skip to main content
research-article

Dynamic Contrastive Distillation for Image-Text Retrieval

Published: 14 April 2023 Publication History

Abstract

The recent advancement in vision-and-language pretraining (VLP) has significantly improved the performance of cross-modal image-text retrieval (ITR) systems. However, the increasing size of VLP models presents a challenge for real-world deployment due to their high latency, making them unsuitable for practical search scenarios. To alleviate this problem, we present a novel plug-in dynamic contrastive distillation (DCD) framework to compress the large VLP models for the ITR task. Technically, we face the following two challenges: 1) the typical uni-modal metric learning approach is difficult to directly apply to cross-modal tasks due to the limited GPU memory to optimize too many negative samples during handling cross-modal fusion features. 2) it is inefficient to static optimize the student network from different hard samples, which affects distillation learning and student network optimization. We propose a method for multi-modal contrastive learning that balances training costs and effects. Our approach involves using a teacher network to identify hard samples for student networks to learn from, allowing the students to leverage the knowledge from pre-trained teachers and effectively learn from hard samples. To learn from hard sample pairs, we propose dynamic distillation to dynamically learn samples of different difficulties to balance better the difficulty of knowledge and students' self-learning ability. We successfully apply our proposed DCD strategy on two state-of-the-art vision-language pretrained models, i.e., ViLT and METER. Extensive experiments on MS-COCO and Flickr30K benchmarks show the effectiveness and efficiency of our DCD framework. We further provide in-depth analyses and discussions that explain how the performance improves.

References

[1]
W. Wang et al., “Image as a foreign language: Beit pretraining for all vision and vision-language tasks,” 2022, arXiv:2208.10442.
[2]
P. Wang et al., “OFA: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework,” in Proc. Int. Conf. Mach. Learn., 2022, pp. 23318–23340.
[3]
Z. Kyaw et al., “Matryoshka peek: Toward learning fine-grained, robust, discriminative features for product search,” IEEE Trans. Multimedia, vol. 19, no. 06, pp. 1272–1284, Jun. 2017.
[4]
X. Lu, L. Liu, L. Nie, X. Chang, and H. Zhang, “Semantic-driven interpretable deep multi-modal hashing for large-scale multimedia retrieval,” IEEE Trans. Multimedia, vol. 23, pp. 4541–4554, 2021.
[5]
W. Wang, H. Bao, L. Dong, and F. Wei, “VLMo: Unified vision-language pre-training with mixture-of-modality-experts,” in Proc. Adv. Neural Inf. Process. Syst., 2022, pp. 1–16.
[6]
Z.-Y. Dou et al., “Coarse-to-fine vision-language pre-training with fusion in the backbone,” in Proc. Adv. Neural Inf. Process. Syst., 2022, pp. 1–15.
[7]
C. Li, T. Yan, X. Luo, L. Nie, and X. Xu, “Supervised robust discrete multimodal hashing for cross-media retrieval,” IEEE Trans. Multimedia, vol. 21, no. 11, pp. 2863–2877, Nov. 2019.
[8]
Y. Zeng, X. Zhang, and H. Li, “Multi-grained vision language pre-training: Aligning texts with visual concepts,” in Proc. Int. Conf. Mach. Learn., 2022, vol. 162, pp. 25994–26009.
[9]
A. van den Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” 2018, arXiv:1807.03748.
[10]
M. Ye, X. Zhang, P. C. Yuen, and S.-F. Chang, “Unsupervised embedding learning via invariant and spreading instance feature,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 6210–6219.
[11]
M. Caron et al., “Emerging properties in self-supervised vision transformers,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 9650–9660.
[12]
K. He, H. Fan, Y. Wu, S. Xie, and R. B. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 9726–9735.
[13]
T. Chen, S. Kornblith, M. Norouzi, and G. E. Hinton, “A simple framework for contrastive learning of visual representations,” in Proc. Int. Conf. Mach. Learn., 2020, pp. 1597–1607.
[14]
J. Grill et al., “Bootstrap your own latent - A new approach to self-supervised learning,” in Proc. Adv. Neural Inf. Process. Syst., vol. 33, pp. 21271–21284, 2020.
[15]
R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by learning an invariant mapping,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2006, pp. 1735–1742.
[16]
G. E. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” in Proc. Adv. Neural Inf. Process. Syst., 2015, pp. 1–9.
[17]
C. Gong et al., “Saliency propagation from simple to difficult,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 2531–2539.
[18]
X. Jiao et al., “Tinybert: Distilling BERT for natural language understanding,” in Proc. Conf. Empirical Methods Natural Lang. Process. (Findings), 2020, pp. 4163–4174.
[19]
S. Sun, Y. Cheng, Z. Gan, and J. Liu, “Patient knowledge distillation for BERT model compression,” in Proc. Conf. Empirical Methods Natural Lang. Process., 2019, pp. 4322–4331.
[20]
X. Jin et al., “Knowledge distillation via route constrained optimization,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 1345–1354.
[21]
P. Chen, S. Liu, H. Zhao, and J. Jia, “Distilling knowledge via knowledge review,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 5008–5017.
[22]
L. Ding et al., “Understanding and improving lexical choice in non-autoregressive translation,” in Proc. Int. Conf. Learn. Representations, 2021, pp. 1–13.
[23]
L. Ding, L. Wang, S. Shi, D. Tao, and Z. Tu, “Redistributing low-frequency words: Making the most of monolingual data in non-autoregressive translation,” in Proc. 60th Annu. Meeting Assoc. Comput. Linguistics, 2022, pp. 2417–2426.
[24]
K.-H. Lee, X. Chen, G. Hua, H. Hu, and X. He, “Stacked cross attention for image-text matching,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 201–216.
[25]
Y. Chen et al., “UNITER: Universal image-text representation learning,” in Proc. Eur. Conf. Comput. Vis., (Lecture Notes in Computer Science Series), 2020, pp. 104–120.
[26]
J. Lu, D. Batra, D. Parikh, and S. Lee, “ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2019, pp. 13–23.
[27]
F. Faghri, D. J. Fleet, J. R. Kiros, and S. Fidler, “VSE++ : Improving visual-semantic embeddings with hard negatives,” in Proc. Brit. Mach. Vis. Conf., 2018, pp. 1–13.
[28]
L. Zhang et al., “Be your own teacher: Improve the performance of convolutional neural networks via self distillation,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 3712–3721.
[29]
W. Kim, B. Son, and I. Kim, “Vilt: Vision-and-language transformer without convolution or region supervision,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 5583–5594.
[30]
Z.-Y. Dou et al., “An empirical study of training end-to-end vision-and-language transformers,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2022, pp. 18166–18176.
[31]
K. Li, Y. Zhang, K. Li, Y. Li, and Y. Fu, “Visual semantic reasoning for image-text matching,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 4653–4661.
[32]
Y. Wu, S. Wang, G. Song, and Q. Huang, “Learning fragment self-attention embeddings for image-text matching,” in Proc. 27th ACM Int. Conf. Multimedia, 2019, pp. 2088–2096.
[33]
L. Qu, M. Liu, D. Cao, L. Nie, and Q. Tian, “Context-aware multi-view summarization network for image-text matching,” in Proc. 28th ACM Int. Conf. Multimedia, 2020, pp. 1047–1055.
[34]
Y. He, S. Xiang, C. Kang, J. Wang, and C. Pan, “Cross-modal retrieval via deep and bidirectional representation learning,” IEEE Trans. Multimedia, vol. 18, no. 7, pp. 1363–1377, Jul. 2016.
[35]
Z. Shao et al., “Learning granularity-unified representations for text-to-image person re-identification,” in Proc. 30th ACM Int. Conf. Multimedia, 2022, pp. 5566–5574.
[36]
J. Rao et al., “Where does the performance improvement come from - A reproducibility concern about image-text retrieval,” in Proc. 45th Int. ACM SIGIR Conf. Res. Devlop. Inf. Retrieval, 2022, pp. 2727–2737.
[37]
Z. Wang et al., “Camp: Cross-modal adaptive message passing for text-image retrieval,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 5764–5773.
[38]
J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. North Amer. Chapter Assoc. Comput. Linguistics: Hum. Lang. Technol., 2019, pp. 4171–4186.
[39]
V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “Distilbert, a distilled version of BERT: Smaller, faster, cheaper and lighter,” 2019, arXiv:1910.01108.
[40]
Z. Sun et al., “Mobilebert: A compact task-agnostic BERT for resource-limited devices,” in Proc. Assoc. Comput. Linguistics, 2020, pp. 2158–2170.
[41]
J. Rao, X. Meng, L. Ding, S. Qi, and D. Tao, “Parameter-efficient and student-friendly knowledge distillation,” 2022, arXiv:2205.15308.
[42]
L. Ding and D. Tao, “The university of Sydney's machine translation system for WMT19,” in Proc. WMT, 2019, pp. 175–182.
[43]
L. Ding, D. Wu, and D. Tao, “The USYD-JD speech translation system for IWSLT2021,” 2021, arXiv:2107.11572.
[44]
C. Zan et al., “Vega-MT: The JD explore academy translation system for WMT22,” in Proc. WMTed EMNLP, 2022, pp. 1–12.
[45]
Z. Fang et al., “Compressing visual-linguistic model via knowledge distillation,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 1428–1438.
[46]
J. Rao et al., “Student can also be a good teacher: Extracting knowledge from vision-and-language model for cross-modal retrieval,” in Proc. 30th ACM Int. Conf. Inf. Knowl. Manage., 2021, pp. 3383–3387.
[47]
X. Gu, T. Lin, W. Kuo, and Y. Cui, “Zero-shot detection via vision and language knowledge distillation,” 2021, arXiv:2104.13921.
[48]
L. Li et al., “Dynamic knowledge distillation for pre-trained language models,” in Proc. Conf. Empirical Methods Natural Lang. Process., 2021, pp. 379–389.
[49]
S. Tang et al., “Learning efficient detector with semi-supervised adaptive distillation,” in Proc. Brit. Mach. Vis. Conf., 2019, pp. 1–12.
[50]
Y. Zhang et al., “Prime-aware adaptive distillation,” in Proc. Eur. Conf. Comput. Vis., (Ser. Lecture Notes in Computer Science). 2020, pp. 658–674.
[51]
M. Fang, Y. Li, and T. Cohn, “Learning how to active learn: A deep reinforcement learning approach,” in Proc. Conf. Empirical Methods Natural Lang. Process., 2017, pp. 595–605.
[52]
Q. Zhong, L. Ding, J. Liu, B. Du, and D. Tao, “Panda: Prompt transfer meets knowledge distillation for efficient model adaptation,” 2022, arXiv:2208.10160.
[53]
T. Lin, P. Goyal, R. B. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 2999–3007.
[54]
K. Sohn, “Improved deep metric learning with multi-class n-pair loss objective,” in Proc. Adv. Neural Inf. Process. Syst., 2016, pp. 1849–1857.
[55]
W. Li et al., “UNIMO: Towards unified-modal understanding and generation via cross-modal contrastive learning,” in Proc. Assoc. Comput. Linguistics, 2021, pp. 2592–2607.
[56]
A. Shrivastava, A. Gupta, and R. B. Girshick, “Training region-based object detectors with online hard example mining,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2016, pp. 761–769.
[57]
B. Li, Y. Liu, and X. Wang, “Gradient harmonized single-stage detector,” in Proc. AAAI Conf. Artif. Intell., 2019, pp. 8577–8584.
[58]
Q. Zhang, Z. Lei, Z. Zhang, and S. Z. Li, “Context-aware attention network for image-text retrieval,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 3533–3542.
[59]
B. Wang, L. Ding, Q. Zhong, X. Li, and D. Tao, “A contrastive cross-channel data augmentation framework for aspect-based sentiment analysis,” 2022, arXiv:2204.07832.
[60]
Q. Zhong, L. Ding, J. Liu, B. Du, and D. Tao, “E2s2: Encoding-enhanced sequence-to-sequence pretraining for language understanding and generation,” 2022, arXiv:2205.14912.
[61]
M. Gutmann and A. Hyvärinen, “Noise-contrastive estimation: A new estimation principle for unnormalized statistical models,” in Proc. 13th Int. Conf. Artif. Intell. Statist., 2010, pp. 297–304.
[62]
T. Kim, J. Oh, N. Kim, S. Cho, and S. Yun, “Comparing Kullback-Leibler divergence and mean squared error loss in knowledge distillation,” in Proc. Int. Joint Conf. Artif. Intell., 2021, pp. 2628–2635.
[63]
Q. Cui et al., “Contrastive vision-language pre-training with limited resources,” in Proc. Eur. Conf. Comput. Vis., 2022, vol. 13696, pp. 236–253.
[64]
J. D. Robinson, C. Chuang, S. Sra, and S. Jegelka, “Contrastive learning with hard negative samples,” in Proc. Int. Conf. Learn. Representations, 2021, pp. 1–29.
[65]
B. Settles, “Active learning literature survey,” Univ. Wisconsin–Madison, Madison, WI, USA, Tech. Rep., 2009.
[66]
Y. Gal and Z. Ghahramani, “Dropout as a Bayesian approximation: Representing model uncertainty in deep learning,” in Proc. Int. Conf. Mach. Learn., 2016, pp. 1050–1059.
[67]
B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and scalable predictive uncertainty estimation using deep ensembles,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2017, pp. 6405–6416.
[68]
T. Lin et al., “Microsoft COCO: Common objects in context,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 740–755.
[69]
P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,” Trans. Assoc. Comput. Linguistics, vol. 2, pp. 67–78, 2014.
[70]
A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 4, pp. 664–676, Apr. 2017.
[71]
B. Zhang, H. Hu, V. Jain, E. Ie, and F. Sha, “Learning to represent image and text with denotation graph,” in Proc. Conf. Empirical Methods Natural Lang. Process., 2020, pp. 823–839.
[72]
H. Zhou et al., “Rethinking soft labels for knowledge distillation: A bias-variance tradeoff perspective,” in Proc. Int. Conf. Learn. Representations. OpenReview.net, 2021, pp. 1–15.
[73]
A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in Proc. Int. Conf. Learn. Representations., 2020, pp. 1–22.
[74]
S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, Jun. 2017.

Cited By

View all
  • (2024)Parameter-Efficient and Student-Friendly Knowledge DistillationIEEE Transactions on Multimedia10.1109/TMM.2023.332148026(4230-4241)Online publication date: 1-Jan-2024
  • (2024)Restructuring the Teacher and Student in Self-DistillationIEEE Transactions on Image Processing10.1109/TIP.2024.346342133(5551-5563)Online publication date: 1-Jan-2024
  • (2024)Harnessing the�Power of�Prompt Experts: Efficient Knowledge Distillation for�Enhanced Language UnderstandingMachine Learning and Knowledge Discovery in Databases. Research Track and Demo Track10.1007/978-3-031-70371-3_13(218-234)Online publication date: 8-Sep-2024

Index Terms

  1. Dynamic Contrastive Distillation for Image-Text Retrieval
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image IEEE Transactions on Multimedia
        IEEE Transactions on Multimedia  Volume 25, Issue
        2023
        8932 pages

        Publisher

        IEEE Press

        Publication History

        Published: 14 April 2023

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 22 Oct 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Parameter-Efficient and Student-Friendly Knowledge DistillationIEEE Transactions on Multimedia10.1109/TMM.2023.332148026(4230-4241)Online publication date: 1-Jan-2024
        • (2024)Restructuring the Teacher and Student in Self-DistillationIEEE Transactions on Image Processing10.1109/TIP.2024.346342133(5551-5563)Online publication date: 1-Jan-2024
        • (2024)Harnessing the�Power of�Prompt Experts: Efficient Knowledge Distillation for�Enhanced Language UnderstandingMachine Learning and Knowledge Discovery in Databases. Research Track and Demo Track10.1007/978-3-031-70371-3_13(218-234)Online publication date: 8-Sep-2024

        View Options

        View options

        Get Access

        Login options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media