skip to main content
10.1145/3581783.3611817acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Target-Guided Composed Image Retrieval

Published: 27 October 2023 Publication History

Abstract

Composed image retrieval (CIR) is a new and flexible image retrieval paradigm, which can retrieve the target image for a multimodal query, including a reference image and its corresponding modification text. Although existing efforts have achieved compelling success, they overlook the conflict relationship modeling between the reference image and the modification text for improving the multimodal query composition and the adaptive matching degree modeling for promoting the ranking of the candidate images that could present different levels of matching degrees with the given query. To address these two limitations, in this work, we propose a Target-Guided Composed Image Retrieval network (TG-CIR). In particular, TG-CIR first extracts the unified global and local attribute features for the reference/target image and the modification text with the contrastive language-image pre-training model (CLIP) as the backbone, where an orthogonal regularization is introduced to promote the independence among the attribute features. Then TG-CIR designs a target-query relationship-guided multimodal query composition module, comprising a target-free student composition branch and a target-based teacher composition branch, where the target-query relationship is injected into the teacher branch for guiding the conflict relationship modeling of the student branch. Last, apart from the conventional batch-based classification loss, TG-CIR additionally introduces a batch-based target similarity-guided matching degree regularization to promote the metric learning process. Extensive experiments on three benchmark datasets demonstrate the superiority of our proposed method.

Supplementary Material

MP4 File (mm-video.mp4)
Presentation video

References

[1]
Alberto Baldrati, Marco Bertini, Tiberio Uricchio, and Alberto Del Bimbo. 2022. Conditioned and composed image retrieval combining and partially fine-tuning CLIP-based features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. IEEE, 4955--4964.
[2]
Xiaolin Chen, Xuemeng Song, Guozhen Peng, Shanshan Feng, and Liqiang Nie. 2021. Adversarial-Enhanced Hybrid Graph Network for User Identity Linkage. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1084--1093.
[3]
Xiaolin Chen, Xuemeng Song, Ruiyang Ren, Lei Zhu, Zhiyong Cheng, and Liqiang Nie. 2020b. Fine-Grained Privacy Detection with Graph-Regularized Hierarchical Attentive Representation Learning. ACM Transactions on Information Systems, Vol. 38, 4 (2020), 37:1--37:26.
[4]
Xiaolin Chen, Xuemeng Song, Yinwei Wei, Liqiang Nie, and Tat-Seng Chua. 2023. Dual Semantic Knowledge Composed Multimodal Dialog Systems. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1518--1527.
[5]
Yanbei Chen, Shaogang Gong, and Loris Bazzani. 2020a. Image Search With Text Feedback by Visiolinguistic Attention Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2998--3008.
[6]
Tat-Seng Chua, S.-K. Lim, and Hung Keng Pung. 1994. Content-Based Retrieval of Segmented Images. In Proceedings of the International ACM Conference on Multimedia. ACM Press, 211--218.
[7]
Ginger Delmas, Rafael Sampaio de Rezende, Gabriela Csurka, and Diane Larlus. 2022. ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and Implicit Similarity. In Proceedings of the International Conference on Learning Representations. OpenReview.net, 1--12.
[8]
Yali Du, Yinwei Wei, Wei Ji, Fan Liu, Xin Luo, and Liqiang Nie. 2023. Multi-queue Momentum Contrast for Microvideo-Product Retrieval. In Proceedings of the ACM International Conference on Web Search and Data Mining. ACM, 1003--1011.
[9]
Sonam Goenka, Zhaoheng Zheng, Ayush Jaiswal, Rakesh Chada, Yue Wu, Varsha Hedau, and Pradeep Natarajan. 2022. FashionVLP: Vision Language Transformer for Fashion Retrieval with Feedback. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 14085--14095.
[10]
Florian Gosselin, Selma Kchir, G. Acher, F. Keith, Olivier Lebec, C. Louison, B. Luvison, Fabrice Mayran de Chamisso, Boris Meden, M. Morelli, Benoit Perochon, Jaonary Rabarisoa, C. Vienne, and G. Ameyugo. 2022. Robot Companion, an intelligent interactive robot coworker for the Industry 5.0. In Proceedings of the International Conference on Intelligent Robots and Systems. IEEE, 8918--8925.
[11]
Chunbin Gu, Jiajun Bu, Zhen Zhang, Zhi Yu, Dongfang Ma, and Wei Wang. 2021. Image Search with Text Feedback by Deep Hierarchical Attention Mutual Information Maximization. In Proceedings of the International ACM Conference on Multimedia. ACM, 4600--4609.
[12]
Weili Guan, Haokun Wen, Xuemeng Song, Chun Wang, Chung-Hsing Yeh, Xiaojun Chang, and Liqiang Nie. 2022. Partially Supervised Compatibility Modeling. IEEE Transactions on Image Processing, Vol. 31 (2022), 4733--4745.
[13]
Weili Guan, Haokun Wen, Xuemeng Song, Chung-Hsing Yeh, Xiaojun Chang, and Liqiang Nie. 2021. Multimodal Compatibility Modeling via Exploring the Consistent and Complementary Correlations. In Proceedings of the International ACM Conference on Multimedia. ACM, 2299--2307.
[14]
Xiaoxiao Guo, Hui Wu, Yu Cheng, Steven Rennie, Gerald Tesauro, and Rog�rio Schmidt Feris. 2018. Dialog-based Interactive Image Retrieval. In Proceedings of the International Conference on Neural Information Processing Systems. MIT Press, 676--686.
[15]
Xiaoxiao Guo, Hui Wu, Yupeng Gao, Steven Rennie, and Rog�rio Schmidt Feris. 2019. The Fashion IQ Dataset: Retrieving Images by Combining Side Information and Relative Natural Language Feedback. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 1--14.
[16]
Xiao Han, Licheng Yu, Xiatian Zhu, Li Zhang, Yi-Zhe Song, and Tao Xiang. 2022. FashionViL: Fashion-Focused Vision-and-Language Representation Learning. In Proceedings of the European Conference on Computer Vision. Springer, 634--651.
[17]
Xiaoping Han, Xiatian Zhu, Licheng Yu, Li Zhang, Yi-Zhe Song, and Tao Xiang. 2023. FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1--14.
[18]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 770--778.
[19]
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2014. Distilling the Knowledge in a Neural Network. In Proceedings of the International Conference on Neural Information Processing Systems. MIT Press, 1--9.
[20]
Sepp Hochreiter and J�rgen Schmidhuber. 1997. Long Short-Term Memory. Neural Comput., Vol. 9 (1997), 1735--1780.
[21]
Seungmin Lee, Dongwan Kim, and Bohyung Han. 2021. CoSMo: Content-Style Modulation for Image Retrieval With Text Feedback. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 802--812.
[22]
Jianjun Lei, Tianyi Qin, Bo Peng, Wanqing Li, Zhaoqing Pan, Haifeng Shen, and Sam Kwong. 2023. Reducing Background Induced Domain Shift for Adaptive Person Re-Identification. IEEE Transactions on Industrial Informatics, Vol. 19, 6 (2023), 7377--7388.
[23]
Han Liu, Yinwei Wei, Jianhua Yin, and Liqiang Nie. 2023. HS-GCN: Hamming Spatial Graph Convolutional Networks for Recommendation. IEEE Transactions on Knowledge and Data Engineering, Vol. 35, 6 (2023), 5977--5990.
[24]
Zheyuan Liu, Cristian Rodriguez Opazo, Damien Teney, and Stephen Gould. 2021. Image Retrieval on Real-life Images with Pre-trained Vision-and-Language Models. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 2105--2114.
[25]
Panzhong Lu, Xin Zhang, Meishan Zhang, and Min Zhang. 2022. Extending Phrase Grounding with Pronouns in Visual Dialogues. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 7614--7625.
[26]
Leigang Qu, Meng Liu, Da Cao, Liqiang Nie, and Qi Tian. 2020. Context-Aware Multi-View Summarization Network for Image-Text Matching. In Proceedings of the International ACM Conference on Multimedia. ACM, 1047--1055.
[27]
Leigang Qu, Meng Liu, Wenjie Wang, Zhedong Zheng, Liqiang Nie, and Tat-Seng Chua. 2023. Learnable Pillar-based Re-ranking for Image-Text Retrieval. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1252--1261.
[28]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the International Conference on Machine Learning. PMLR, 8748--8763.
[29]
Jun Rao, Liang Ding, Shuhan Qi, Meng Fang, Yang Liu, Li Shen, and Dacheng Tao. 2023. Dynamic Contrastive Distillation for Image-Text Retrieval. IEEE Transactions on Multimedia (2023), 1--13.
[30]
Jun Rao, Fei Wang, Liang Ding, Shuhan Qi, Yibing Zhan, Weifeng Liu, and Dacheng Tao. 2022. Where Does the Performance Improvement Come From? -A Reproducibility Concern about Image-Text Retrieval. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2727--2737.
[31]
Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. 2015. FitNets: Hints for Thin Deep Nets. In Proceedings of the International Conference on Learning Representations.
[32]
Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and James Hays. 2019. Composing Text and Image for Image Retrieval - an Empirical Odyssey. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 6439--6448.
[33]
Yigong Wang, Zhuoyi Wang, Yu Lin, Latifur Khan, and Dingcheng Li. 2021. CIFDM: Continual and Interactive Feature Distillation for Multi-Label Stream Learning. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2121--2125.
[34]
Yinwei Wei, Xiang Wang, Weili Guan, Liqiang Nie, Zhouchen Lin, and Baoquan Chen. 2019. Neural multimodal cooperative learning toward micro-video understanding. IEEE Transactions on Image Processing, Vol. 29 (2019), 1--14.
[35]
Haokun Wen, Xuemeng Song, Xin Yang, Yibing Zhan, and Liqiang Nie. 2021. Comprehensive Linguistic-Visual Composition Network for Image Retrieval. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1369--1378.
[36]
Feifei Zhang, Ming Yan, Ji Zhang, and Changsheng Xu. 2022b. Comprehensive Relationship Reasoning for Composed Query Based Image Retrieval. In Proceedings of the International ACM Conference on Multimedia. ACM, 4655--4664.
[37]
Gangjian Zhang, Shikui Wei, Huaxin Pang, Shuang Qiu, and Yao Zhao. 2022a. Composed Image Retrieval via Explicit Erasure and Replenishment With Semantic Alignment. IEEE Transactions on Image Processing, Vol. 31 (2022), 5976--5988.
[38]
Gangjian Zhang, Shikui Wei, Huaxin Pang, and Yao Zhao. 2021b. Heterogeneous Feature Fusion and Cross-modal Alignment for Composed Image Retrieval. In Proceedings of the International ACM Conference on Multimedia. ACM, 5353--5362.
[39]
Wei Zhang, Lingxiao He, Peng Chen, Xingyu Liao, Wu Liu, Qi Li, and Zhenan Sun. 2021a. Boosting End-to-End Multi-Object Tracking and Person Search via Knowledge Distillation. In Proceedings of the ACM International Conference on Multimedia. ACM, 1192--1201.
[40]
Na Zheng, Xuemeng Song, Qingying Niu, Xue Dong, Yibing Zhan, and Liqiang Nie. 2021. Collocation and Try-on Network: Whether an Outfit is Compatible. In Proceedings of the ACM International Conference on Multimedia. ACM, 309--317.
[41]
Hongguang Zhu, Yunchao Wei, Yao Zhao, Chunjie Zhang, and Shujuan Huang. 2023. AMC: Adaptive Multi-Expert Collaborative Network for Text-Guided Image Retrieval. ACM Trans. Multimedia Comput. Commun. Appl. (2023). Just Accepted.

Cited By

View all
  • (2024)SPIRIT: Style-guided Patch Interaction for Fashion Image Retrieval with Text FeedbackACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364034520:6(1-17)Online publication date: 8-Mar-2024
  • (2024)Fine-grained Textual Inversion Network for Zero-Shot Composed Image RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657831(240-250)Online publication date: 10-Jul-2024
  • (2024)CaLa: Complementary Association Learning for Augmenting Comoposed Image RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657823(2177-2187)Online publication date: 10-Jul-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '23: Proceedings of the 31st ACM International Conference on Multimedia
October 2023
9913 pages
ISBN:9798400701085
DOI:10.1145/3581783
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. composed image retrieval
  2. multimodal query composition
  3. multimodal retrieval

Qualifiers

  • Research-article

Funding Sources

  • Shandong Provincial Natural Science Foundation
  • Defense Advanced Research Projects Agency (DARPA)

Conference

MM '23
Sponsor:
MM '23: The 31st ACM International Conference on Multimedia
October 29 - November 3, 2023
Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 995 of 4,171 submissions, 24%

Upcoming Conference

MM '24
The 32nd ACM International Conference on Multimedia
October 28 - November 1, 2024
Melbourne , VIC , Australia

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)363
  • Downloads (Last 6 weeks)34
Reflects downloads up to 22 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)SPIRIT: Style-guided Patch Interaction for Fashion Image Retrieval with Text FeedbackACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364034520:6(1-17)Online publication date: 8-Mar-2024
  • (2024)Fine-grained Textual Inversion Network for Zero-Shot Composed Image RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657831(240-250)Online publication date: 10-Jul-2024
  • (2024)CaLa: Complementary Association Learning for Augmenting Comoposed Image RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657823(2177-2187)Online publication date: 10-Jul-2024
  • (2024)LDRE: LLM-based Divergent Reasoning and Ensemble for Zero-Shot Composed Image RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657740(80-90)Online publication date: 10-Jul-2024
  • (2024)Simple but Effective Raw-Data Level Multimodal Fusion for Composed Image RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657727(229-239)Online publication date: 10-Jul-2024
  • (2024)PhotoScout: Synthesis-Powered Multi-Modal Image SearchProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642319(1-15)Online publication date: 11-May-2024
  • (2024)Self-Training Boosted Multi-Factor Matching Network for Composed Image RetrievalIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.334643446:5(3665-3678)Online publication date: May-2024

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media