VoP: Text-Video Co-operative Prompt Tuning for Cross-Modal Retrieval

Huang, Siteng; Gong, Biao; Pan, Yulin; Jiang, Jianwen; Lv, Yiliang; Li, Yuyuan; Wang, Donglin

Computer Science > Computer Vision and Pattern Recognition

arXiv:2211.12764 (cs)

[Submitted on 23 Nov 2022 (v1), last revised 22 Mar 2023 (this version, v3)]

Title:VoP: Text-Video Co-operative Prompt Tuning for Cross-Modal Retrieval

Authors:Siteng Huang, Biao Gong, Yulin Pan, Jianwen Jiang, Yiliang Lv, Yuyuan Li, Donglin Wang

View PDF

Abstract:Many recent studies leverage the pre-trained CLIP for text-video cross-modal retrieval by tuning the backbone with additional heavy modules, which not only brings huge computational burdens with much more parameters, but also leads to the knowledge forgetting from upstream models. In this work, we propose the VoP: Text-Video Co-operative Prompt Tuning for efficient tuning on the text-video retrieval task. The proposed VoP is an end-to-end framework with both video & text prompts introducing, which can be regarded as a powerful baseline with only 0.1% trainable parameters. Further, based on the spatio-temporal characteristics of videos, we develop three novel video prompt mechanisms to improve the performance with different scales of trainable parameters. The basic idea of the VoP enhancement is to model the frame position, frame context, and layer function with specific trainable prompts, respectively. Extensive experiments show that compared to full fine-tuning, the enhanced VoP achieves a 1.4% average R@1 gain across five text-video retrieval benchmarks with 6x less parameter overhead. The code will be available at this https URL.

Comments:	Accepted by CVPR 2023
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2211.12764 [cs.CV]
	(or arXiv:2211.12764v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2211.12764

Submission history

From: Siteng Huang [view email]
[v1] Wed, 23 Nov 2022 08:20:29 UTC (763 KB)
[v2] Wed, 8 Mar 2023 06:31:05 UTC (757 KB)
[v3] Wed, 22 Mar 2023 02:36:52 UTC (758 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:VoP: Text-Video Co-operative Prompt Tuning for Cross-Modal Retrieval

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:VoP: Text-Video Co-operative Prompt Tuning for Cross-Modal Retrieval

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators