Learning Granularity-Unified Representations for Text-to-Image Person Re-identification

Published: 10 October 2022 Publication History


Text-to-image person re-identification (ReID) aims to search for pedestrian images of an interested identity via textual descriptions. It is challenging due to both rich intra-modal variations and significant inter-modal gaps. Existing works usually ignore the difference in feature granularity between the two modalities, i.e., the visual features are usually fine-grained while textual features are coarse, which is mainly responsible for the large inter-modal gaps. In this paper, we propose an end-to-end framework based on transformers to learn granularity-unified representations for both modalities, denoted as LGUR. LGUR framework contains two modules: a Dictionary-based Granularity Alignment (DGA) module and a Prototype-based Granularity Unification (PGU) module. In DGA, in order to align the granularities of two modalities, we introduce a Multi-modality Shared Dictionary (MSD) to reconstruct both visual and textual features. Besides, DGA has two important factors, i.e., the cross-modality guidance and the foreground-centric reconstruction, to facilitate the optimization of MSD. In PGU, we adopt a set of shared and learnable prototypes as the queries to extract diverse and semantically aligned features for both modalities in the granularity-unified feature space, which further promotes the ReID performance. Comprehensive experiments show that our LGUR consistently outperforms state-of-the-arts by large margins on both CUHK-PEDES and ICFG-PEDES datasets. Code will be released at

LGUR framework contains two modules:a Dictionary-based Granularity Alignment (DGA) module and a Prototype-based Granularity Unification (PGU) module. In DGA, in order to align the granularities of two modalities, we introduce a Multi-modality Shared Dictionary (MSD) to reconstruct both visual and textual features. In PGU, we adopt a set of shared and learnable prototypes as the queries to extract diverse and semantically aligned features for both modalities in the granularity-unified feature space, which further promotes the ReID performance.


  • (2025)Graph-based Consistent Reconstruction and Alignment for imbalanced text–image person re-identificationExpert Systems with Applications10.1016/j.eswa.2024.125429260(125429)Online publication date: Jan-2025
  • (2024)Fine-grained Semantics-aware Representation Learning for Text-based Person RetrievalProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658054(92-100)Online publication date: 30-May-2024
  • (2024)Dense captioning for Text-Image ReIDProceedings of the Fourteenth Indian Conference on Computer Vision, Graphics and Image Processing10.1145/3627631.3627648(1-8)Online publication date: 31-Jan-2024
    Author Tags

    1. person re-identification
    2. text-to-image retrieval


    • (2025)Graph-based Consistent Reconstruction and Alignment for imbalanced text–image person re-identificationExpert Systems with Applications10.1016/j.eswa.2024.125429260(125429)Online publication date: Jan-2025
    • (2024)Fine-grained Semantics-aware Representation Learning for Text-based Person RetrievalProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658054(92-100)Online publication date: 30-May-2024
    • (2024)Dense captioning for Text-Image ReIDProceedings of the Fourteenth Indian Conference on Computer Vision, Graphics and Image Processing10.1145/3627631.3627648(1-8)Online publication date: 31-Jan-2024
    • (2024)MACA: Memory-aided Coarse-to-fine Alignment for Text-based Person SearchProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657915(2497-2501)Online publication date: 10-Jul-2024
    • (2024)Cross-Modal Adaptive Dual Association for Text-to-Image Person RetrievalIEEE Transactions on Multimedia10.1109/TMM.2024.335564426(6609-6620)Online publication date: 2024
    • (2024)Text-to-Image Person Re-Identification Based on Multimodal Graph Convolutional NetworkIEEE Transactions on Multimedia10.1109/TMM.2023.334435426(6025-6036)Online publication date: 2024
    • (2024)Text-to-Image Vehicle Re-Identification: Multi-Scale Multi-View Cross-Modal Alignment Network and a Unified BenchmarkIEEE Transactions on Intelligent Transportation Systems10.1109/TITS.2023.334859925:7(7673-7686)Online publication date: Jul-2024
    • (2024)VGSG: Vision-Guided Semantic-Group Network for Text-Based Person SearchIEEE Transactions on Image Processing10.1109/TIP.2023.333765333(163-176)Online publication date: 1-Jan-2024
    • (2024)Fine-Granularity Alignment for Text-Based Person Retrieval Via Semantics-Centric Visual DivisionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.339283134:9(8242-8252)Online publication date: Sep-2024
    • (2024)An Overview of Text-Based Person Search: Recent Advances and Future DirectionsIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.337637334:9(7803-7819)Online publication date: Sep-2024
