short-paper

A Hierarchical Multimodal Attention-based Neural Network for Image Captioning

Authors:

Tao ZhangAuthors Info & Claims

SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 889 - 892

https://doi.org/10.1145/3077136.3080671

Published: 07 August 2017 Publication History

Get Access

Abstract

A novel hierarchical multimodal attention-based model is developed in this paper to generate more accurate and descriptive captions for images. Our model is an "end-to-end" neural network which contains three related sub-networks: a deep convolutional neural network to encode image contents, a recurrent neural network to identify the objects in images sequentially, and a multimodal attention-based recurrent neural network to generate image captions. The main contribution of our work is that the hierarchical structure and multimodal attention mechanism is both applied, thus each caption word can be generated with the multimodal attention on the intermediate semantic objects and the global visual content. Our experiments on two benchmark datasets have obtained very positive results.

References

[1]

Vinyals, O., Toshev, A., Bengio, S., and Erhan, D., "Show and tell: A neural image caption generator". CVPR 2015, pp. 3156--3164, 2015.

Crossref

Google Scholar

[2]

Karpathy, A., and Li, F. F., "Deep visual-semantic alignments for generating image descriptions". CVPR 2015, pp. 3128--3137, 2015.

Crossref

Google Scholar

[3]

Kulkarni, G., Premraj, V., and Berg, T. L., "Baby talk: Understanding and generating simple image descriptions," CVPR 2011, pp. 1601--1608, 2011.

Digital Library

Google Scholar

[4]

Cho, K., Bahdanau, D., and Bengio, Y., "Learning phrase representations using RNN encoder-decoder for statistical machine translation," EMNLP 2014, pp. 1724--1734, 2014.

Crossref

Google Scholar

[5]

Kiros, R., Salakhutdinov, R., and Zemel, R., "Multimodal neural language models," ICML 2014, pp. 595--603, 2014.

Google Scholar

[6]

Mao, J. H., Xu, W., and Yuille, A., "Deep captioning with multimodal Recurrent Neural Networks (m-RNN)," arXiv:1410.1090, 2014.

Google Scholar

[7]

Donahue, J., Hendricks, L. A., and Rohrbach, M., "Long-term recurrent convolutional networks for visual recognition and description," CVPR 2015, pp. 2625--2634, 2015.

Crossref

Google Scholar

[8]

Xu, K., Ba, J. L., and Bengio, Y. "Show, attend and tell: Neural image caption generation with visual attention," ICML 2015, pp. 2048--2057, 2015.

Google Scholar

[9]

You, Q. Z., Jin, H. L., Wang, Z. W., Fang, C., and Luo, J. B., "Image captioning with semantic attention," arXiv:1603.03925, 2016.

Google Scholar

[10]

Chen, X. L., Fang, H., Lin, T. Y., Vedantam, R., Gupta, S., Dollar, P., and Zitnick, C. L., "Microsoft COCO captions: Data collection and evaluation server," arXiv:1504.00325, 2015.

Google Scholar

[11]

Jia, X., Gavves, E., and Tuytelaars, T., "Guiding long-short term memory for image caption generation," ICCV 2015, 2015.

Google Scholar

Cited By

View all

Jia WWang RYang JXua L(2024)MGTANet: Multi-Scale Guided Token Attention Network for Image CaptioningProceedings of the 2024 3rd International Conference on Cyber Security, Artificial Intelligence and Digital Economy10.1145/3672919.3672964(237-245)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1145/3672919.3672964
Salgotra GAbrol PSelwal A(2024)A Survey on Automatic Image Captioning Approaches: Contemporary Trends and Future PerspectivesArchives of Computational Methods in Engineering10.1007/s11831-024-10190-8Online publication date: 16-Oct-2024
https://doi.org/10.1007/s11831-024-10190-8
Hur CPark H(2023)Zero-Shot Image Classification with Rectified Embedding Vectors Using a Caption GeneratorApplied Sciences10.3390/app1312707113:12(7071)Online publication date: 13-Jun-2023
https://doi.org/10.3390/app13127071
Show More Cited By

A Hierarchical Multimodal Attention-based Neural Network for Image Captioning
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Machine learning
    1. Machine learning approaches

Recommendations

A survey on deep neural network-based image captioning

Image captioning is a hot topic of image understanding, and it is composed of two natural parts ("look" and "language expression") which correspond to the two most important fields of artificial intelligence ("machine vision" and "natural language ...
Factors Influencing The Performance of Image Captioning Model: An Evaluation
MoMM '16: Proceedings of the 14th International Conference on Advances in Mobile Computing and Multi Media

Recently, neural network-based methods have shown impressive performances in captioning task. There have been numerous attempts with many proposed architectures to solve this captioning problem. In this paper, we present the evaluation of different ...
Hierarchical RNN with Static Sentence-Level Attention for Text-Based Speaker Change Detection
CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

Speaker change detection (SCD) is an important task in dialog modeling. Our paper addresses the problem of text-based SCD, which differs from existing audio-based studies and is useful in various scenarios, for example, processing dialog transcripts ...

Comments

Information & Contributors

Information

Published In

SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

August 2017

1476 pages

ISBN:9781450350228

DOI:10.1145/3077136

General Chairs:
Noriko Kando
National Institute of Informatics
,
Tetsuya Sakai
Waseda University
,
Hideo Joho
University of Tsukuba
,
Program Chairs:
Hang Li
Huawei Noah's Ark Lab
,
Arjen P. de Vries
Radboud University
,
Ryen W. White
Microsoft Cortana

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 August 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Funding Sources

National Natural Science Fund of China
The Application of Big Data Computing Platform in Smart Lingang New City based BIM and GIS
Shanghai Municipal Science and Technology Commission
Shanghai Municipality Program of Technology Research Leader

Conference

SIGIR '17

Sponsor:

SIGIR

SIGIR '17: The 40th International ACM SIGIR conference on research and development in Information Retrieval

August 7 - 11, 2017

Tokyo, Shinjuku, Japan

Acceptance Rates

SIGIR '17 Paper Acceptance Rate 78 of 362 submissions, 22%;

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
564
Total Downloads

Downloads (Last 12 months)14
Downloads (Last 6 weeks)2

Reflects downloads up to 16 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Jia WWang RYang JXua L(2024)MGTANet: Multi-Scale Guided Token Attention Network for Image CaptioningProceedings of the 2024 3rd International Conference on Cyber Security, Artificial Intelligence and Digital Economy10.1145/3672919.3672964(237-245)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1145/3672919.3672964
Salgotra GAbrol PSelwal A(2024)A Survey on Automatic Image Captioning Approaches: Contemporary Trends and Future PerspectivesArchives of Computational Methods in Engineering10.1007/s11831-024-10190-8Online publication date: 16-Oct-2024
https://doi.org/10.1007/s11831-024-10190-8
Hur CPark H(2023)Zero-Shot Image Classification with Rectified Embedding Vectors Using a Caption GeneratorApplied Sciences10.3390/app1312707113:12(7071)Online publication date: 13-Jun-2023
https://doi.org/10.3390/app13127071
Sharma DDhiman CKumar D(2023)Evolution of visual data captioning Methods, Datasets, and evaluation Metrics: A comprehensive surveyExpert Systems with Applications10.1016/j.eswa.2023.119773221(119773)Online publication date: Jul-2023
https://doi.org/10.1016/j.eswa.2023.119773
Xue LZhang AWang RYang J(2023)PSNet: position-shift alignment network for image captionInternational Journal of Multimedia Information Retrieval10.1007/s13735-023-00307-312:2Online publication date: 27-Nov-2023
https://doi.org/10.1007/s13735-023-00307-3
Ming YHu NFan CFeng FZhou JYu H(2022)Visuals to Text: A Comprehensive Review on Automatic Image CaptioningIEEE/CAA Journal of Automatica Sinica10.1109/JAS.2022.1057349:8(1339-1365)Online publication date: Aug-2022
https://doi.org/10.1109/JAS.2022.105734
Zia URiaz MGhafoor A(2022)Topic Guided Image Captioning with�Scene and�Spatial FeaturesAdvanced Information Networking and Applications10.1007/978-3-030-99587-4_16(180-191)Online publication date: 31-Mar-2022
https://doi.org/10.1007/978-3-030-99587-4_16
Yuan SWu X(2021)Deep learning for insider threat detectionComputers and Security10.1016/j.cose.2021.102221104:COnline publication date: 1-May-2021
https://dl.acm.org/doi/10.1016/j.cose.2021.102221
Wang CGu X(2021)Image captioning with adaptive incremental global context attentionApplied Intelligence10.1007/s10489-021-02734-352:6(6575-6597)Online publication date: 13-Sep-2021
https://doi.org/10.1007/s10489-021-02734-3
Shan SShen ZXia BLiu ZLi Y(2020)DPAST-RNN: A Dual-Phase Attention-Based Recurrent Neural Network Using Spatiotemporal LSTMs for Time Series PredictionNeural Information Processing10.1007/978-3-030-63836-8_47(568-578)Online publication date: 19-Nov-2020
https://doi.org/10.1007/978-3-030-63836-8_47
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Recommendations

A survey on deep neural network-based image captioning

Factors Influencing The Performance of Image Captioning Model: An Evaluation

Hierarchical RNN with Static Sentence-Level Attention for Text-Based Speaker Change Detection