research-article

Hierarchical Semantic Enhanced Directional Graph Network for Visual Commonsense Reasoning

Authors:

Xinxin LiaoAuthors Info & Claims

Trustworthy AI'21: Proceedings of the 1st International Workshop on Trustworthy AI for Multimedia Computing

Pages 27 - 36

https://doi.org/10.1145/3475731.3484957

Published: 22 October 2021 Publication History

Abstract

Visual commonsense reasoning (VCR) task aims at boosting research of cognition-level correlations reasoning. It requires not only a thorough understanding of correlated details of the scene but also the ability to infer correlation with related commonsense knowledge. Existing approaches consider the region-word affinity to perform the semantic alignment between vision and linguistic domains, which neglect the implicit correspondence (e.g. word-scene, region-phrase, and phrase-scene) among visual concepts and linguistic words. Although efforts have been made to deliver promising results in previous work, these methods are still confronted with challenges when comes to make interpretable reasoning. Toward this end, we present a novel hierarchical semantic enhanced directional graph network. To be more specific, we design a Modality Interaction Unit (MIU) module, which captures high-order cross-modal alignment by aggregating the hierarchical vision-language relationships. Afterward, we propose a direction clue-aware graph reasoning (DCGR) module. In this module, valuable entities can be dynamically selected in each reasoning step, according to the importance of these entities. This leads to a more interpretable reasoning procedure. Ultimately, heterogeneous graph attention is introduced to filter the irrelevant parts of the final answers. Extensive experiments have been conducted on the VCR benchmark dataset, which demonstrates that our method can achieve competitive results and better interpretability compared with several state-of-the-art baselines.

References

[1]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077--6086.

[2]

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question Answering. In Proceedings of the IEEE International Conference on Computer Vision. 2425--2433.

Digital Library

[3]

Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui Jiang, and Diana Inkpen. 2017. Enhanced LSTM for Natural Language Inference. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vancouver, Canada, 1657--1668. https://doi.org/10.18653/v1/P17--1152

[4]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805 (2018).

[5]

Haiwen Diao, Ying Zhang, Lin Ma, and Huchuan Lu. 2021. Similarity Reasoning and Filtration for Image-Text Matching. arXiv:2101.01368

[6]

Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas, 457--468. https://doi.org/10.18653/v1/D16--1044

[7]

M. Gori, G. Monfardini, and F. Scarselli. 2005. A New Model for Learning in Graph Domains. In Proceedings of the IEEE International Joint Conference on Neural Networks, Vol. 2. 729--734. https://doi.org/10.1109/IJCNN.2005.1555942

[8]

K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778. https://doi.org/10.1109/CVPR.2016.90

[9]

Allan Jabri, Armand Joulin, and Laurens van der Maaten. 2016. Revisiting Visual Question Answering Baselines. In Proceedings of the European Conference on Computer Vision. Springer International Publishing, Cham, 727--739.

[10]

Andrej Karpathy and Fei Li. 2015. Deep Visual-Semantic Alignments for Generating Image Descriptions. In Proceedings of the IEEE International Conference on Computer Vision. 3128--3137. https://doi.org/10.1109/CVPR.2015.7298932

[11]

Jin-Hwa Kim, Kyoung Woon On, Woosang Lim, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang. 2017. Hadamard Product for Low-rank Bilinear Pooling. In 5th International Conference on Learning Representations. OpenReview.net.

[12]

Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. 2018. Bilinear Attention Networks. In Proceedings of the International Conference on Neural Information Processing Systems. Curran Associates Inc., Red Hook, NY, USA, 1571--1581.

Digital Library

[13]

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2017. Visual Genome: Connecting Language and Vision using Crowdsourced Dense Image Annotations. International Journal of Computer Vision 123, 1 (2017), 32--73.

Digital Library

[14]

Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked Cross Attention for Image-Text Matching. In Proceedings of the European Conference on Computer Vision. 201--216.

Digital Library

[15]

Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019. Visualbert: A simple and Performant Baseline for Vision and Language. arXiv preprint arXiv:1908.03557 (2019).

[16]

Yikang Li, Wanli Ouyang, Bolei Zhou, Jianping Shi, Chao Zhang, and Xiaogang Wang. 2018. Factorizable Net: an Efficient Subgraph-Based Framework for Scene Graph Generation. In Proceedings of the European Conference on Computer Vision. 335--351.

[17]

Xiaodan Liang, Hongfei Zhou, and Eric Xing. 2018. Dynamic-structured semantic propagation network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 752--761.

[18]

Jingxiang Lin, Unnat Jain, and Alexander G Schwing. 2019. TAB-VCR: Tags and Attributes based Visual Commonsense Reasoning Baselines. arXiv preprint arXiv:1910.14671 (2019).

Digital Library

[19]

Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and- Language Tasks. In Advances in Neural Information Processing Systems, Vol. 32. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2019/file/ c74d97b01eae257e44aa9d5bade97baf-Paper.pdf

Digital Library

[20]

Duy-Kien Nguyen and Takayuki Okatani. 2018. Improved Fusion of Visual and Language Representations by Dense Symmetric Co-attention for Visual Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6087--6096.

[21]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2016. Faster R-CNN: Towards Real-time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 6 (2016), 1137--1149.

Digital Library

[22]

F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. 2009. The Graph Neural Network Model. IEEE Transactions on Neural Networks 20, 1 (2009), 61--80. https://doi.org/10.1109/TNN.2008.2005605

Digital Library

[23]

Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-Attention with Relative Position Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). Association for Computational Linguistics, New Orleans, Louisiana, 464--468. https://doi.org/10.18653/v1/N18- 2074

[24]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ? ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, Vol. 30. Curran Associates, Inc.

Digital Library

[25]

Petar Veli?kovi?, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph Attention Networks. arXiv preprint arXiv:1710.10903 (2017).

[26]

ZihaoWang, Xihui Liu, Hongsheng Li, Lu Sheng, Junjie Yan, XiaogangWang, and Jing Shao. 2019. CAMP: Cross-Modal Adaptive Message Passing for Text-Image Retrieval. In Proceedings of the IEEE International Conference on Computer Vision. 5763--5772. https://doi.org/10.1109/ICCV.2019.00586

[27]

Aming Wu, Linchao Zhu, Yahong Han, and Yi Yang. 2019. Connective Cognition Network for Directional Visual Commonsense Reasoning. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems. 5670--5680. https://proceedings.neurips.cc/paper/2019/hash/ 8a56257ea05c74018291954fc56fc448-Abstract.html

Digital Library

[28]

Kui Xu, Zhe Wang, Jianping Shi, Hongsheng Li, and Qiangfeng Cliff Zhang. 2019. A2-Net: Molecular Structure Estimation from Cryo-em Density Volumes. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 1230--1237.

[29]

Ruiqing Xu, Chao Li, Junchi Yan, Cheng Deng, and Xianglong Liu. 2019. Graph Convolutional Network Hashing for Cross-Modal Retrieval. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence. 982--988. https://doi.org/10.24963/ijcai.2019/138

Digital Library

[30]

Wei Yang, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. 2016. End-To- End Learning of Deformable Mixture of Parts and Deep Convolutional Neural Networks for Human Pose Estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3073--3082.

[31]

Weijiang Yu, Jingwen Zhou, Weihao Yu, Xiaodan Liang, and Nong Xiao. 2019. Heterogeneous Graph Learning for Visual Commonsense Reasoning. arXiv preprint arXiv:1910.11475 (2019).

Digital Library

[32]

Z. Yu, J. Yu, Y. Cui, D. Tao, and Q. Tian. 2019. Deep Modular Co-Attention Networks for Visual Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6274--6283. https://doi.org/10.1109/ CVPR.2019.00644

[33]

Zhou Yu, Jun Yu, Jianping Fan, and Dacheng Tao. 2017. Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering. Proceedings of the IEEE International Conference on Computer Vision (2017).

[34]

Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. From Recognition to Cognition: Visual Commonsense Reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[35]

Ying Zhang and Huchuan Lu. 2018. Deep Cross-Modal Projection Learning for Image-Text Matching. In Proceedings of the European Conference on Computer Vision. Springer International Publishing, Cham, 707--723.

[36]

Xiang Zhou, Fumin Shen, Li Liu, Wei Liu, Liqiang Nie, Yang Yang, and Heng Tao Shen. 2018. Graph Convolutional Network Hashing. IEEE Transactions on Cybernetics 50, 4 (2018), 1460--1472.

Index Terms

Hierarchical Semantic Enhanced Directional Graph Network for Visual Commonsense Reasoning
1. Computer systems organization
  1. Embedded and cyber-physical systems
2. Networks
  1. Network properties
    1. Network reliability

Recommendations

Deep Hierarchical Attention Flow for Visual Commonsense Reasoning
Natural Language Processing and Chinese Computing
Abstract
Visual Commonsense Reasoning (VCR) requires a thoroughly understanding general information connecting language and vision, as well as the background world knowledge. In this paper, we introduce a novel yet powerful deep hierarchical attention flow ...
Efficient and self-adaptive rationale knowledge base for visual commonsense reasoning
Abstract
Visual commonsense reasoning (VCR) task leads to a cognitive level of understanding between vision and linguistic domains. Three sub-tasks, i.e., $Q \to A$ , $Q A \to R$ , and $Q \to A R$ , require the ability to predict the correct answer and rational explanation ...
Heterogeneous graph learning for visual commonsense reasoning
NIPS'19: Proceedings of the 33rd International Conference on Neural Information Processing Systems

Visual commonsense reasoning task aims at leading the research field into solving cognition-level reasoning with the ability of predicting correct answers and meanwhile providing convincing reasoning paths, resulting in three sub-tasks i.e., Q→A, QA→R and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

Trustworthy AI'21: Proceedings of the 1st International Workshop on Trustworthy AI for Multimedia Computing

October 2021

42 pages

ISBN:9781450386746

DOI:10.1145/3475731

Program Chairs:
Dr. Teddy Furon
INRIA, France
,
Dr. Jingen Liu
JD AI Research, USA
,
Prof. Yogesh Rawat
University of Central Florida, USA
,
Dr. Wei Zhang
JD AI Research, China
,
Prof. Qi Zhao
University of Minnesota, USA

Copyright � 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
Natural Science Foundation of Guangdong Province
PINGAN-HITsz Intelligence Finance Research Center
Ricoh-HITsz Joint Research Center
GBase-HITsz Joint Research Center
Shenzhen Foundational Research Funding Under Grant

Conference

MM '21

Sponsor:

SIGMM

MM '21: ACM Multimedia Conference

October 24, 2021

Virtual Event, China

Upcoming Conference

MM '24

Sponsor:
sigmm

The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
213
Total Downloads

Downloads (Last 12 months)23
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Oct 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents