skip to main content
10.1145/3077136.3080671acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
short-paper

A Hierarchical Multimodal Attention-based Neural Network for Image Captioning

Published: 07 August 2017 Publication History

Abstract

A novel hierarchical multimodal attention-based model is developed in this paper to generate more accurate and descriptive captions for images. Our model is an "end-to-end" neural network which contains three related sub-networks: a deep convolutional neural network to encode image contents, a recurrent neural network to identify the objects in images sequentially, and a multimodal attention-based recurrent neural network to generate image captions. The main contribution of our work is that the hierarchical structure and multimodal attention mechanism is both applied, thus each caption word can be generated with the multimodal attention on the intermediate semantic objects and the global visual content. Our experiments on two benchmark datasets have obtained very positive results.

References

[1]
Vinyals, O., Toshev, A., Bengio, S., and Erhan, D., "Show and tell: A neural image caption generator". CVPR 2015, pp. 3156--3164, 2015.
[2]
Karpathy, A., and Li, F. F., "Deep visual-semantic alignments for generating image descriptions". CVPR 2015, pp. 3128--3137, 2015.
[3]
Kulkarni, G., Premraj, V., and Berg, T. L., "Baby talk: Understanding and generating simple image descriptions," CVPR 2011, pp. 1601--1608, 2011.
[4]
Cho, K., Bahdanau, D., and Bengio, Y., "Learning phrase representations using RNN encoder-decoder for statistical machine translation," EMNLP 2014, pp. 1724--1734, 2014.
[5]
Kiros, R., Salakhutdinov, R., and Zemel, R., "Multimodal neural language models," ICML 2014, pp. 595--603, 2014.
[6]
Mao, J. H., Xu, W., and Yuille, A., "Deep captioning with multimodal Recurrent Neural Networks (m-RNN)," arXiv:1410.1090, 2014.
[7]
Donahue, J., Hendricks, L. A., and Rohrbach, M., "Long-term recurrent convolutional networks for visual recognition and description," CVPR 2015, pp. 2625--2634, 2015.
[8]
Xu, K., Ba, J. L., and Bengio, Y. "Show, attend and tell: Neural image caption generation with visual attention," ICML 2015, pp. 2048--2057, 2015.
[9]
You, Q. Z., Jin, H. L., Wang, Z. W., Fang, C., and Luo, J. B., "Image captioning with semantic attention," arXiv:1603.03925, 2016.
[10]
Chen, X. L., Fang, H., Lin, T. Y., Vedantam, R., Gupta, S., Dollar, P., and Zitnick, C. L., "Microsoft COCO captions: Data collection and evaluation server," arXiv:1504.00325, 2015.
[11]
Jia, X., Gavves, E., and Tuytelaars, T., "Guiding long-short term memory for image caption generation," ICCV 2015, 2015.

Cited By

View all
  • (2024)MGTANet: Multi-Scale Guided Token Attention Network for Image CaptioningProceedings of the 2024 3rd International Conference on Cyber Security, Artificial Intelligence and Digital Economy10.1145/3672919.3672964(237-245)Online publication date: 1-Mar-2024
  • (2024)A Survey on Automatic Image Captioning Approaches: Contemporary Trends and Future PerspectivesArchives of Computational Methods in Engineering10.1007/s11831-024-10190-8Online publication date: 16-Oct-2024
  • (2023)Zero-Shot Image Classification with Rectified Embedding Vectors Using a Caption GeneratorApplied Sciences10.3390/app1312707113:12(7071)Online publication date: 13-Jun-2023
  • Show More Cited By
  1. A Hierarchical Multimodal Attention-based Neural Network for Image Captioning

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval
      August 2017
      1476 pages
      ISBN:9781450350228
      DOI:10.1145/3077136
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 07 August 2017

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. hierarchical recurrent neural network
      2. image captioning
      3. long-short term memory model
      4. multimodal attention

      Qualifiers

      • Short-paper

      Funding Sources

      • National Natural Science Fund of China
      • The Application of Big Data Computing Platform in Smart Lingang New City based BIM and GIS
      • Shanghai Municipal Science and Technology Commission
      • Shanghai Municipality Program of Technology Research Leader

      Conference

      SIGIR '17
      Sponsor:

      Acceptance Rates

      SIGIR '17 Paper Acceptance Rate 78 of 362 submissions, 22%;
      Overall Acceptance Rate 792 of 3,983 submissions, 20%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)14
      • Downloads (Last 6 weeks)2
      Reflects downloads up to 16 Oct 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)MGTANet: Multi-Scale Guided Token Attention Network for Image CaptioningProceedings of the 2024 3rd International Conference on Cyber Security, Artificial Intelligence and Digital Economy10.1145/3672919.3672964(237-245)Online publication date: 1-Mar-2024
      • (2024)A Survey on Automatic Image Captioning Approaches: Contemporary Trends and Future PerspectivesArchives of Computational Methods in Engineering10.1007/s11831-024-10190-8Online publication date: 16-Oct-2024
      • (2023)Zero-Shot Image Classification with Rectified Embedding Vectors Using a Caption GeneratorApplied Sciences10.3390/app1312707113:12(7071)Online publication date: 13-Jun-2023
      • (2023)Evolution of visual data captioning Methods, Datasets, and evaluation Metrics: A comprehensive surveyExpert Systems with Applications10.1016/j.eswa.2023.119773221(119773)Online publication date: Jul-2023
      • (2023)PSNet: position-shift alignment network for image captionInternational Journal of Multimedia Information Retrieval10.1007/s13735-023-00307-312:2Online publication date: 27-Nov-2023
      • (2022)Visuals to Text: A Comprehensive Review on Automatic Image CaptioningIEEE/CAA Journal of Automatica Sinica10.1109/JAS.2022.1057349:8(1339-1365)Online publication date: Aug-2022
      • (2022)Topic Guided Image Captioning with�Scene and�Spatial FeaturesAdvanced Information Networking and Applications10.1007/978-3-030-99587-4_16(180-191)Online publication date: 31-Mar-2022
      • (2021)Deep learning for insider threat detectionComputers and Security10.1016/j.cose.2021.102221104:COnline publication date: 1-May-2021
      • (2021)Image captioning with adaptive incremental global context attentionApplied Intelligence10.1007/s10489-021-02734-352:6(6575-6597)Online publication date: 13-Sep-2021
      • (2020)DPAST-RNN: A Dual-Phase Attention-Based Recurrent Neural Network Using Spatiotemporal LSTMs for Time Series PredictionNeural Information Processing10.1007/978-3-030-63836-8_47(568-578)Online publication date: 19-Nov-2020
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media