skip to main content
10.1145/3475731.3484957acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Hierarchical Semantic Enhanced Directional Graph Network for Visual Commonsense Reasoning

Published: 22 October 2021 Publication History

Abstract

Visual commonsense reasoning (VCR) task aims at boosting research of cognition-level correlations reasoning. It requires not only a thorough understanding of correlated details of the scene but also the ability to infer correlation with related commonsense knowledge. Existing approaches consider the region-word affinity to perform the semantic alignment between vision and linguistic domains, which neglect the implicit correspondence (e.g. word-scene, region-phrase, and phrase-scene) among visual concepts and linguistic words. Although efforts have been made to deliver promising results in previous work, these methods are still confronted with challenges when comes to make interpretable reasoning. Toward this end, we present a novel hierarchical semantic enhanced directional graph network. To be more specific, we design a Modality Interaction Unit (MIU) module, which captures high-order cross-modal alignment by aggregating the hierarchical vision-language relationships. Afterward, we propose a direction clue-aware graph reasoning (DCGR) module. In this module, valuable entities can be dynamically selected in each reasoning step, according to the importance of these entities. This leads to a more interpretable reasoning procedure. Ultimately, heterogeneous graph attention is introduced to filter the irrelevant parts of the final answers. Extensive experiments have been conducted on the VCR benchmark dataset, which demonstrates that our method can achieve competitive results and better interpretability compared with several state-of-the-art baselines.

References

[1]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077--6086.
[2]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question Answering. In Proceedings of the IEEE International Conference on Computer Vision. 2425--2433.
[3]
Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui Jiang, and Diana Inkpen. 2017. Enhanced LSTM for Natural Language Inference. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vancouver, Canada, 1657--1668. https://doi.org/10.18653/v1/P17--1152
[4]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805 (2018).
[5]
Haiwen Diao, Ying Zhang, Lin Ma, and Huchuan Lu. 2021. Similarity Reasoning and Filtration for Image-Text Matching. arXiv:2101.01368
[6]
Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas, 457--468. https://doi.org/10.18653/v1/D16--1044
[7]
M. Gori, G. Monfardini, and F. Scarselli. 2005. A New Model for Learning in Graph Domains. In Proceedings of the IEEE International Joint Conference on Neural Networks, Vol. 2. 729--734. https://doi.org/10.1109/IJCNN.2005.1555942
[8]
K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778. https://doi.org/10.1109/CVPR.2016.90
[9]
Allan Jabri, Armand Joulin, and Laurens van der Maaten. 2016. Revisiting Visual Question Answering Baselines. In Proceedings of the European Conference on Computer Vision. Springer International Publishing, Cham, 727--739.
[10]
Andrej Karpathy and Fei Li. 2015. Deep Visual-Semantic Alignments for Generating Image Descriptions. In Proceedings of the IEEE International Conference on Computer Vision. 3128--3137. https://doi.org/10.1109/CVPR.2015.7298932
[11]
Jin-Hwa Kim, Kyoung Woon On, Woosang Lim, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang. 2017. Hadamard Product for Low-rank Bilinear Pooling. In 5th International Conference on Learning Representations. OpenReview.net.
[12]
Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. 2018. Bilinear Attention Networks. In Proceedings of the International Conference on Neural Information Processing Systems. Curran Associates Inc., Red Hook, NY, USA, 1571--1581.
[13]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2017. Visual Genome: Connecting Language and Vision using Crowdsourced Dense Image Annotations. International Journal of Computer Vision 123, 1 (2017), 32--73.
[14]
Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked Cross Attention for Image-Text Matching. In Proceedings of the European Conference on Computer Vision. 201--216.
[15]
Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019. Visualbert: A simple and Performant Baseline for Vision and Language. arXiv preprint arXiv:1908.03557 (2019).
[16]
Yikang Li, Wanli Ouyang, Bolei Zhou, Jianping Shi, Chao Zhang, and Xiaogang Wang. 2018. Factorizable Net: an Efficient Subgraph-Based Framework for Scene Graph Generation. In Proceedings of the European Conference on Computer Vision. 335--351.
[17]
Xiaodan Liang, Hongfei Zhou, and Eric Xing. 2018. Dynamic-structured semantic propagation network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 752--761.
[18]
Jingxiang Lin, Unnat Jain, and Alexander G Schwing. 2019. TAB-VCR: Tags and Attributes based Visual Commonsense Reasoning Baselines. arXiv preprint arXiv:1910.14671 (2019).
[19]
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and- Language Tasks. In Advances in Neural Information Processing Systems, Vol. 32. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2019/file/ c74d97b01eae257e44aa9d5bade97baf-Paper.pdf
[20]
Duy-Kien Nguyen and Takayuki Okatani. 2018. Improved Fusion of Visual and Language Representations by Dense Symmetric Co-attention for Visual Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6087--6096.
[21]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2016. Faster R-CNN: Towards Real-time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 6 (2016), 1137--1149.
[22]
F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. 2009. The Graph Neural Network Model. IEEE Transactions on Neural Networks 20, 1 (2009), 61--80. https://doi.org/10.1109/TNN.2008.2005605
[23]
Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-Attention with Relative Position Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). Association for Computational Linguistics, New Orleans, Louisiana, 464--468. https://doi.org/10.18653/v1/N18- 2074
[24]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ? ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, Vol. 30. Curran Associates, Inc.
[25]
Petar Veli?kovi?, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph Attention Networks. arXiv preprint arXiv:1710.10903 (2017).
[26]
ZihaoWang, Xihui Liu, Hongsheng Li, Lu Sheng, Junjie Yan, XiaogangWang, and Jing Shao. 2019. CAMP: Cross-Modal Adaptive Message Passing for Text-Image Retrieval. In Proceedings of the IEEE International Conference on Computer Vision. 5763--5772. https://doi.org/10.1109/ICCV.2019.00586
[27]
Aming Wu, Linchao Zhu, Yahong Han, and Yi Yang. 2019. Connective Cognition Network for Directional Visual Commonsense Reasoning. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems. 5670--5680. https://proceedings.neurips.cc/paper/2019/hash/ 8a56257ea05c74018291954fc56fc448-Abstract.html
[28]
Kui Xu, Zhe Wang, Jianping Shi, Hongsheng Li, and Qiangfeng Cliff Zhang. 2019. A2-Net: Molecular Structure Estimation from Cryo-em Density Volumes. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 1230--1237.
[29]
Ruiqing Xu, Chao Li, Junchi Yan, Cheng Deng, and Xianglong Liu. 2019. Graph Convolutional Network Hashing for Cross-Modal Retrieval. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence. 982--988. https://doi.org/10.24963/ijcai.2019/138
[30]
Wei Yang, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. 2016. End-To- End Learning of Deformable Mixture of Parts and Deep Convolutional Neural Networks for Human Pose Estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3073--3082.
[31]
Weijiang Yu, Jingwen Zhou, Weihao Yu, Xiaodan Liang, and Nong Xiao. 2019. Heterogeneous Graph Learning for Visual Commonsense Reasoning. arXiv preprint arXiv:1910.11475 (2019).
[32]
Z. Yu, J. Yu, Y. Cui, D. Tao, and Q. Tian. 2019. Deep Modular Co-Attention Networks for Visual Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6274--6283. https://doi.org/10.1109/ CVPR.2019.00644
[33]
Zhou Yu, Jun Yu, Jianping Fan, and Dacheng Tao. 2017. Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering. Proceedings of the IEEE International Conference on Computer Vision (2017).
[34]
Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. From Recognition to Cognition: Visual Commonsense Reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[35]
Ying Zhang and Huchuan Lu. 2018. Deep Cross-Modal Projection Learning for Image-Text Matching. In Proceedings of the European Conference on Computer Vision. Springer International Publishing, Cham, 707--723.
[36]
Xiang Zhou, Fumin Shen, Li Liu, Wei Liu, Liqiang Nie, Yang Yang, and Heng Tao Shen. 2018. Graph Convolutional Network Hashing. IEEE Transactions on Cybernetics 50, 4 (2018), 1460--1472.

Index Terms

  1. Hierarchical Semantic Enhanced Directional Graph Network for Visual Commonsense Reasoning

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      Trustworthy AI'21: Proceedings of the 1st International Workshop on Trustworthy AI for Multimedia Computing
      October 2021
      42 pages
      ISBN:9781450386746
      DOI:10.1145/3475731
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 22 October 2021

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. graph learning
      2. vision and language task
      3. visual commonsense reasoning

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      MM '21
      Sponsor:
      MM '21: ACM Multimedia Conference
      October 24, 2021
      Virtual Event, China

      Upcoming Conference

      MM '24
      The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne , VIC , Australia

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 213
        Total Downloads
      • Downloads (Last 12 months)23
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 17 Oct 2024

      Other Metrics

      Citations

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media