skip to main content
10.1145/3503161.3548028acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Learning Granularity-Unified Representations for Text-to-Image Person Re-identification

Published: 10 October 2022 Publication History

Abstract

Text-to-image person re-identification (ReID) aims to search for pedestrian images of an interested identity via textual descriptions. It is challenging due to both rich intra-modal variations and significant inter-modal gaps. Existing works usually ignore the difference in feature granularity between the two modalities, i.e., the visual features are usually fine-grained while textual features are coarse, which is mainly responsible for the large inter-modal gaps. In this paper, we propose an end-to-end framework based on transformers to learn granularity-unified representations for both modalities, denoted as LGUR. LGUR framework contains two modules: a Dictionary-based Granularity Alignment (DGA) module and a Prototype-based Granularity Unification (PGU) module. In DGA, in order to align the granularities of two modalities, we introduce a Multi-modality Shared Dictionary (MSD) to reconstruct both visual and textual features. Besides, DGA has two important factors, i.e., the cross-modality guidance and the foreground-centric reconstruction, to facilitate the optimization of MSD. In PGU, we adopt a set of shared and learnable prototypes as the queries to extract diverse and semantically aligned features for both modalities in the granularity-unified feature space, which further promotes the ReID performance. Comprehensive experiments show that our LGUR consistently outperforms state-of-the-arts by large margins on both CUHK-PEDES and ICFG-PEDES datasets. Code will be released at https://github.com/ZhiyinShao-H/LGUR.

Supplementary Material

MP4 File (MM22-fp1181.mp4)
LGUR framework contains two modules:a Dictionary-based Granularity Alignment (DGA) module and a Prototype-based Granularity Unification (PGU) module. In DGA, in order to align the granularities of two modalities, we introduce a Multi-modality Shared Dictionary (MSD) to reconstruct both visual and textual features. In PGU, we adopt a set of shared and learnable prototypes as the queries to extract diverse and semantically aligned features for both modalities in the granularity-unified feature space, which further promotes the ReID performance.

References

[1]
Surbhi Aggarwal, Venkatesh Babu RADHAKRISHNAN, and Anirban Chakraborty. 2020. Text-based person search via attribute-aided matching. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2617--2625.
[2]
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In European Conference on Computer Vision. Springer, 213--229.
[3]
Tianlang Chen, Chenliang Xu, and Jiebo Luo. 2018. Improving text-based person search by spatial matching and adaptive threshold. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 1879--1887.
[4]
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. Uniter: Universal image-text representation learning. In European conference on computer vision. Springer, 104--120.
[5]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[6]
Zefeng Ding, Changxing Ding, Zhiyin Shao, and Dacheng Tao. 2021. Semantically Self-Aligned Network for Text-to-Image Part-aware Person Re-identification. arXiv preprint arXiv:2107.12666 (2021).
[7]
Ding, Changxing and Wang, Kan and Wang, Pengfei and Tao, Dacheng. 2022. Multi-Task Learning With Coarse Priors for Robust Part-Aware Person Re- Identification. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 3 (2022), 1474 -- 1488.
[8]
Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2018. VSE: Improving Visual-Semantic Embeddings with Hard Negatives. (2018).
[9]
Chenyang Gao, Guanyu Cai, Xinyang Jiang, Feng Zheng, Jun Zhang, Yifei Gong, Pai Peng, Xiaowei Guo, and Xing Sun. 2021. Contextual Non-Local Alignment over Full-Scale Representation for Text-Based Person Search. arXiv preprint arXiv:2101.03036 (2021).
[10]
Jing Ge, Guangyu Gao, and Zhen Liu. 2019. Visual-Textual Association with Hardest and Semi-Hard Negative Pairs Mining for Person Search. arXiv preprint arXiv:1912.03083 (2019).
[11]
Jianyuan Guo, Yuhui Yuan, Lang Huang, Chao Zhang, Jin-Ge Yao, and Kai Han. 2019. Beyond human parts: Dual part-aligned representations for person reidentification. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3642--3651.
[12]
Xiao Han, Sen He, Li Zhang, and Tao Xiang. 2021. Text-Based Person Search with Limited Data. arXiv preprint arXiv:2110.10807 (2021).
[13]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[14]
Sepp Hochreiter and J�rgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780.
[15]
Zhicheng Huang, Zhaoyang Zeng, Yupan Huang, Bei Liu, Dongmei Fu, and Jianlong Fu. 2021. Seeing Out of tHe bOx: End-to-End Pre-training for Vision- Language Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12976--12985.
[16]
Ya Jing, Chenyang Si, JunboWang,WeiWang, LiangWang, and Tieniu Tan. 2020. Pose-Guided Multi-Granularity Attention Network for Text-Based Person Search. In AAAI.
[17]
Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. ICLR (2015).
[18]
Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV). 201--216.
[19]
Gen Li, Nan Duan, Yuejian Fang, Ming Gong, and Daxin Jiang. 2020. Unicodervl: A universal encoder for vision and language by cross-modal pre-training. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11336--11344.
[20]
Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019).
[21]
Shuang Li, Tong Xiao, Hongsheng Li, Wei Yang, and Xiaogang Wang. 2017. Identity-aware textual-visual matching with latent co-attention. In Proceedings of the IEEE International Conference on Computer Vision. 1890--1899.
[22]
Shuang Li, Tong Xiao, Hongsheng Li, Bolei Zhou, Dayu Yue, and XiaogangWang. 2017. Person search with natural language description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1970--1979.
[23]
Jiawei Liu, Zheng-Jun Zha, Richang Hong, Meng Wang, and Yongdong Zhang. 2019. Deep adversarial graph attention convolution network for text-based person search. In ACMMM. 665--673.
[24]
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv preprint arXiv:1908.02265 (2019).
[25]
Kai Niu, Yan Huang,Wanli Ouyang, and LiangWang. 2020. Improving descriptionbased person re-identification by multi-granularity image-text alignments. IEEE Transactions on Image Processing 29 (2020), 5542--5556.
[26]
Kai Niu, Yan Huang, and Liang Wang. 2020. Textual dependency embedding for person search by language. In ACMMM. 4032--4040.
[27]
Nikolaos Sarafianos, Xiang Xu, and Ioannis A Kakadiaris. 2019. Adversarial Representation Learning for Text-to-Image Matching. In Proceedings of the IEEE International Conference on Computer Vision. 5814--5824.
[28]
Walter J Scheirer, Patrick J Flynn, Changxing Ding, Guodong Guo, Vitomir Struc, Mohamad Al Jazaery, Klemen Grm, Simon Dobrisek, Dacheng Tao, Yu Zhu, et al. 2016. Report on the BTAS 2016 video person recognition evaluation. In 2016 IEEE 8th International Conference on Biometrics Theory, Applications and Systems (BTAS). IEEE, 1--8.
[29]
Chunfeng Song, Yan Huang,Wanli Ouyang, and LiangWang. 2018. Mask-guided contrastive attention model for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1179--1188.
[30]
Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2019. Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530 (2019).
[31]
Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019. Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7464-- 7473.
[32]
Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, and ShengjinWang. 2018. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the European Conference on Computer Vision (ECCV). 480--496.
[33]
Hao Tan and Mohit Bansal. 2019. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490 (2019).
[34]
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv� J�gou. 2021. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning. PMLR, 10347--10357.
[35]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.
[36]
Chengji Wang, Zhiming Luo, Yaojin Lin, and Shaozi Li. 2021. Text-based Person Search via Multi-Granularity Embedding Learning. IJCAI.
[37]
GuanshuoWang, Yufeng Yuan, Xiong Chen, Jiwei Li, and Xi Zhou. 2018. Learning discriminative features with multiple granularities for person re-identification. In ACMMM. 274--282.
[38]
Kan Wang, Pengfei Wang, Changxing Ding, and Dacheng Tao. 2021. Batch coherence-driven network for part-aware person re-identification. IEEE Transactions on Image Processing 30 (2021), 3405--3418.
[39]
Pengfei Wang, Changxing Ding, Zhiyin Shao, Zhibin Hong, Shengli Zhang, and Dacheng Tao. 2022. Quality-aware part models for occluded person reidentification. IEEE Transactions on Multimedia (2022).
[40]
Pengfei Wang, Changxing Ding, Wentao Tan, Mingming Gong, Kui Jia, and Dacheng Tao. 2022. Uncertainty-aware clustering for unsupervised domain adaptive object re-identification. IEEE Transactions on Multimedia (2022).
[41]
Zhe Wang, Zhiyuan Fang, Jun Wang, and Yezhou Yang. 2020. ViTAA: Visual- Textual Attributes Alignment in Person Search by Natural Language. (2020).
[42]
Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian. 2018. Person transfer gan to bridge domain gap for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 79--88.
[43]
Yushuang Wu, Zizheng Yan, Xiaoguang Han, Guanbin Li, Changqing Zou, and Shuguang Cui. 2021. LapsCore: Language-Guided Person Search via Color Reasoning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1624--1633.
[44]
Hantao Yao, Shiliang Zhang, Richang Hong, Yongdong Zhang, Changsheng Xu, and Qi Tian. 2019. Deep representation learning with part loss for person re-identification. IEEE Transactions on Image Processing 28, 6 (2019), 2860--2871.
[45]
Ying Zhang and Huchuan Lu. 2018. Deep cross-modal projection learning for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV). 686--701.
[46]
Kecheng Zheng, Wu Liu, Jiawei Liu, Zheng-Jun Zha, and Tao Mei. 2020. Hierarchical gumbel attention network for text-based person search. In ACMMM. 3441--3449.
[47]
Zhedong Zheng, Liang Zheng, Michael Garrett, Yi Yang, Mingliang Xu, and Yi-Dong Shen. 2020. Dual-Path Convolutional Image-Text Embeddings with Instance Loss. TOMM 16, 2 (2020), 1--23.
[48]
Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason Corso, and Jianfeng Gao. 2020. Unified vision-language pre-training for image captioning and vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 13041--13049.
[49]
Aichun Zhu, Zijie Wang, Yifeng Li, Xili Wan, Jing Jin, Tian Wang, Fangqiang Hu, and Gang Hua. 2021. DSSL: Deep Surroundings-person Separation Learning for Text-based Person Retrieval. In ACMMM. 209--217.

Cited By

View all
  • (2025)Graph-based Consistent Reconstruction and Alignment for imbalanced text–image person re-identificationExpert Systems with Applications10.1016/j.eswa.2024.125429260(125429)Online publication date: Jan-2025
  • (2024)Fine-grained Semantics-aware Representation Learning for Text-based Person RetrievalProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658054(92-100)Online publication date: 30-May-2024
  • (2024)Dense captioning for Text-Image ReIDProceedings of the Fourteenth Indian Conference on Computer Vision, Graphics and Image Processing10.1145/3627631.3627648(1-8)Online publication date: 31-Jan-2024
  • Show More Cited By

Index Terms

  1. Learning Granularity-Unified Representations for Text-to-Image Person Re-identification

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '22: Proceedings of the 30th ACM International Conference on Multimedia
    October 2022
    7537 pages
    ISBN:9781450392037
    DOI:10.1145/3503161
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 10 October 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. person re-identification
    2. text-to-image retrieval

    Qualifiers

    • Research-article

    Funding Sources

    • the Program for Guangdong Introducing Innovative and Entrepreneurial Teams
    • Guangdong Basic and Applied Basic Research Foundation
    • Guangdong Provincial Key Laboratory of Human Digital Twin
    • CCF-Baidu Open Fund
    • the National Natural Science Foundation of China

    Conference

    MM '22
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 995 of 4,171 submissions, 24%

    Upcoming Conference

    MM '24
    The 32nd ACM International Conference on Multimedia
    October 28 - November 1, 2024
    Melbourne , VIC , Australia

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)180
    • Downloads (Last 6 weeks)17
    Reflects downloads up to 22 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Graph-based Consistent Reconstruction and Alignment for imbalanced text–image person re-identificationExpert Systems with Applications10.1016/j.eswa.2024.125429260(125429)Online publication date: Jan-2025
    • (2024)Fine-grained Semantics-aware Representation Learning for Text-based Person RetrievalProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658054(92-100)Online publication date: 30-May-2024
    • (2024)Dense captioning for Text-Image ReIDProceedings of the Fourteenth Indian Conference on Computer Vision, Graphics and Image Processing10.1145/3627631.3627648(1-8)Online publication date: 31-Jan-2024
    • (2024)MACA: Memory-aided Coarse-to-fine Alignment for Text-based Person SearchProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657915(2497-2501)Online publication date: 10-Jul-2024
    • (2024)Cross-Modal Adaptive Dual Association for Text-to-Image Person RetrievalIEEE Transactions on Multimedia10.1109/TMM.2024.335564426(6609-6620)Online publication date: 2024
    • (2024)Text-to-Image Person Re-Identification Based on Multimodal Graph Convolutional NetworkIEEE Transactions on Multimedia10.1109/TMM.2023.334435426(6025-6036)Online publication date: 2024
    • (2024)Text-to-Image Vehicle Re-Identification: Multi-Scale Multi-View Cross-Modal Alignment Network and a Unified BenchmarkIEEE Transactions on Intelligent Transportation Systems10.1109/TITS.2023.334859925:7(7673-7686)Online publication date: Jul-2024
    • (2024)VGSG: Vision-Guided Semantic-Group Network for Text-Based Person SearchIEEE Transactions on Image Processing10.1109/TIP.2023.333765333(163-176)Online publication date: 1-Jan-2024
    • (2024)Fine-Granularity Alignment for Text-Based Person Retrieval Via Semantics-Centric Visual DivisionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.339283134:9(8242-8252)Online publication date: Sep-2024
    • (2024)An Overview of Text-Based Person Search: Recent Advances and Future DirectionsIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.337637334:9(7803-7819)Online publication date: Sep-2024
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media