Key Parts Spatio-Temporal Learning for Video Person Re-identification

Wei Guo, School, Institute of Information Engineering, Chinese Academy of Sciences, CN and School, School of Cyber Security, University of Chinese Academy of Sciences, China, [email protected]
Hao Wang, School, College of Software, Henan Normal University, CN, [email protected]

DOI: https://doi.org/10.1145/3595916.3626417
MMAsia '23: ACM Multimedia Asia 2023, Tainan, Taiwan, December 2023

Person re-identification (Re-ID) is a technology to identify specific pedestrians in different scenarios. In recent years, Re-ID has been widely used in surveillance, supermarket and smart city. However, there are still many challenges in this field, including complex background, pose changes, and occlusion dislocation. We propose a novel Key Parts Spatio-temporal Learning (KSTL) framework to alleviate the above problems. Specifically, we first use the mask method based on keypoint detection to locate and extract the key part features of the human body. Then, we introduce Spatio-temporal Learning (STL) block based on key parts to realize the mutual transfer and learning of key parts features of multiple frames. Finally, we fuse the learned key part features and global features as the final video representation. The method we propose can not only accurately learn the features of key parts, but also make full use of the timing information in the video, thus achieving good detection results. We conduct extensive experiments on three public benchmarks, and the results demonstrate the effectiveness and superiority of KSTL.

CCS Concepts:Computer systems organization → Embedded systems; Redundancy;Computer systems organization~Robotics;Networks~Network reliability;

Keywords: person re-identification, video, keypoint detection, spatio-temporal learning, feature fusion

ACM Reference Format:
Wei Guo and Hao Wang. 2023. Key Parts Spatio-Temporal Learning for Video Person Re-identification. In ACM Multimedia Asia 2023 (MMAsia '23), December 06--08, 2023, Tainan, Taiwan. ACM, New York, NY, USA 6 Pages. https://doi.org/10.1145/3595916.3626417

1 INTRODUCTION

Person re-identification [31] (Re-ID) is mainly used in image processing and analysis in complex environments. In recent years, it has been increasingly applied in many real-world scenarios such as intelligent security and supermarkets, which has attracted wide attention from academia and industry. The existing types of Re-ID are mainly divided into two types: image-based [6, 9] and video-based [1, 17]. The former performs pedestrian recognition by learning the features of static images, however, this method relies heavily on the shooting quality of still images. Different from static images, video sequences contain long-term temporal information, which can provide richer spatio-temporal cues, bringing richer information dimensions and further optimization space for Re-ID.

Figure 1
Figure 1: Some problems in video-based person Re-ID.

The general principle of video-based Re-ID [8, 24] is to extract and aggregate spatial and temporal cues from video sequences to generate discriminative representations. The basic method is to perform the same feature extraction operation on each frame, and aggregate them in the time dimension by methods such as time pooling. However, this approach will make the information extracted between consecutive frames highly redundant, and it is difficult to capture the key parts of the video with discriminative significance under conditions such as occlusion, posture changes, and complex backgrounds as shown in Figure 1, resulting in the degradation of the model performance. Meanwhile, simple multi-frame aggregation operations fail to fully utilize the temporal dependencies among multiple frames.

To explicitly address the above issues, we propose a Key Parts Spatio-temporal Learning (KSTL) framework to better mine pedestrian features in videos. We first use the keypoint detection technology [4] to precisely locate the keypoint positions of pedestrians, and perform mask operation on them to extract key part features. Then, we construct Spatio-temporal Learning (STL) block, which can transfer and fuse key part features in time, space and spacetime. Finally, after multi-block STL learning, we fuse the learned key part features with the global features to enhance the contribution of key parts to the learned pedestrian features and weaken the background features. Our method achieves good pedestrian recognition results.

The main contributions of this paper are as following:

Figure 2
Figure 2: The overall structure of our proposed method. Given a video $I=\left\lbrace I_{t} \right\rbrace _{t=1}^{T}$, we use the CNN backbone to extract the global features of the video. At the same time, we use keypoint detection and mask operation to accurately locate the key part area of the human body, and then send it into the backbone network and STL block for key part feature extraction and learning. Finally, the key part features Fk and global features Fg are sent to the fusion block for information fusion as the final video representation.

  • We propose a mask method based on keypoint detection, which accurately locates and extracts key part features.
  • We construct a spatio-temporal learning block based on the features of key parts to transfer and fuse the features of key parts in time, space and spacetime.
  • We introduce a fusion block, which fuses the learned key part features and global features on a frame-by-frame basis as the final video representation. Extensive experiments are conducted on multiple datasets, and our method achieves higher retrieval accuracy than SOTA methods.

2 RELATED WORK

Video sequence contains more abundant temporal information than a single image, and video-based Re-ID provides a larger optimization space for the improvement of model performance. Among them, attention mechanism [25] and the utilization of spatio-temporal information [30] are two commonly used means.

Attention mechanism. Bai et al. [1] propose a salient-to-wide module to gradually expand the attention region and supplement the additional information in the wide region to the salient region, resulting in more powerful video representations. Zhao et al. [33] decomposed single-frame features into attribute-related sub-feature groups, and weighted the sub-features by attribute recognition to form the final video representation. Zhang et al. [32] propose an attentional feature aggregation module to finely aggregate spatio-temporal features into a discriminative video representation. Chen et al. [5] divide a video sequence into multiple short video clips and use an attention mechanism to mitigate the impact of noisy frames. Si et al. [23] propose a Dual ATtention Matching network (DuATM) to achieve feature refinement and feature pair alignment.

Utilization of spatio-temporal information. Yan et al. [28] use multi-granularity hypergraphs to fully mine spatial and temporal cues in video sequences. McLaughlin et al. [21] combine recurrent neural network (CNN) and convolutional neural network (RNN), CNN extracts frame-by-frame features, and RNN is used to make information flow between different frames. Yang et al. [29] propose a new spatio-temporal graph convolutional network to model the spatial relationship of video sequences. Wang et al. [27] propose a novel pyramid spatiotemporal aggregation framework for progressively aggregating frame-level features. At present, some works [18, 19] use 3D convolutional neural network in video-based Re-ID, and achieved good results, but these methods have a large computational resource overhead and performance is difficult to optimize.

3 PROPOSED METHOD

In this section, we detail the proposed Key Parts Spatio-temporal Learning framework. We first give an overview of the overall architecture of KSTL, and then introduce the key parts of feature extraction, spatio-temporal learning and fusion in the subsequent subsections.

3.1 Overview

The overall architecture of our proposed KSTL is shown in Figure 2. Our method mainly includes: backbone network, keypoint detection, mask operation, spatio-temporal learning block and fusion block. First, we use the backbone network (ResNet50 [11] in our work) to extract the global features of the video. At the same time, we use the keypoint detection technology to detect the key point position of the pedestrian, and take the key point as the center to expand around as the key part, and then use the mask operation to mask the features other than the key part and input it into the backbone network to extract the key part features. Then, we input the key part features into the STL block, in which the key part features in the same or different video frames are learned and transferred to each other. Finally, we fuse the learned key part features and global features as the final pedestrian representation.

3.2 Feature Extraction of Key Parts

Key parts refer to specific parts of a pedestrian's body, such as the head and limbs, which are important and recognizable areas. Through the feature extraction and learning of key parts, it is possible to more reliably confirm whether it is the same person.

Formally, given a video, we firstly sample T frames $I=\left\lbrace I_{t} \right\rbrace _{t=1}^{T}$ as the inputs of our network. We use backbone network to extract the features Ft for each frame in the video. It can be expressed as:

\begin{equation} F_{t} = CNN\left(I_{t} \right),t=1,2,...,T \end{equation}
(1)
where FtRC × H × W, H, W, C represents the height, width and the number of channels, respectively. We denote the global feature of the whole video as $F_{g} = \left\lbrace F_{t} \right\rbrace _{t=1}^{T}$, FgRT × C × H × W.

At the same time, we use keypoint detection technology to detect all frames in the video, and the results of each frame detection are 17 keypoint position coordinates of the human body. We denote the keypoint detection result of frame It as $k_{t} = \left\lbrace k_{t,i} \right\rbrace _{i=1}^{17}$, and kt, i represents the position coordinates of keypoint i of frame It. Then, take the keypoint position as the center and expand around as the key part. Specifically, we first construct a 0-1 mask matrix mt, i with the value of the key part as 1 and other areas as 0, which can be expressed as:

\begin{equation} m_{t,i} = Mask\left(k_{t,i} \right), i=1,2,...,17 \end{equation}
(2)
where Mask represents the mask operation,mt, iRC × H × W. Then we dot multiply the mask matrix mt, i with the corresponding frame It to generate the mask frame based on key parts. The masked results are shown in Figure 3. Send it to the backbone network to extract key part features $F_{t,k_{i} }$, which can be expressed as:
\begin{equation} F_{t,k_{i} }=CNN\left(I_{t} \otimes m_{t,i} \right) \end{equation}
(3)
where ⊗ is the dot product and $F_{t,k_{i} }\in R^{C\times H\times W}$ represents the feature of key part i of frame It. The key part features of the frame are expressed as $F_{t,k } = \left\lbrace F_{t,k_{i} } \right\rbrace _{t=1}^{17}$, Ft, kR17 × C × H × W. The key part features of the whole video are denoted as $F_{k } = \left\lbrace F_{t,k} \right\rbrace _{t=1}^{T}$, FkRT × 17 × C × H × W.
Figure 3
Figure 3: Key parts mask operation display.

3.3 Spatio-temporal Learning and Fusion

In order to alleviate the problems of occlusion and pedestrian pose changes in pedestrian re-identification, we construct a STL block based on key parts, which can learn and transfer key part features in time, space and spacetime. STL takes the key part feature Fk of the whole video as input, and its core idea is to update the feature $F_{t,k_{i} }$ according to the relationship between the part feature $F_{t,k_{i} }$ and other related part features $\left\lbrace F_{t^{^{\prime }},k_{j}} \right\rbrace _{t^{^{\prime }}=1}^{T}$ in the video. The principle can be expressed as follows:

\begin{equation} F_{t,k_{i} }^{l+1} =\frac{1}{N} \sum _{t^{^{\prime }}=1}^{T} \sum _{\forall j}f\left(F_{t,k_{i} }^{l},F_{t^{^{\prime }},k_{j} }^{l} \right)F_{t^{^{\prime }},k_{j} }^{l} \end{equation}
(4)
where l represents the l-th STL block, i is the index of the key part to be updated, j is the index of all positions with relation to i, and j can be in space, time or spacetime. The function f is used to calculate the similarity relationship between key parts i and j. Finally, it is normalized by a factor N, where N = 17 × T in this paper.
Figure 4
Figure 4: The structure of spatio-temporal learning block.

The STL structure is shown in Figure 4, which takes the key part feature Fk as input and sends it into two convolutions for embedding, and then uses dot product for similarity measurement, which can be expressed as:

\begin{equation} f\left(F_{t,k_{i} }^{l},F_{t^{^{\prime }},k_{j} }^{l} \right)=\theta \left(F_{t,k_{i} }^{l} \right)^{T} \psi \left(F_{t^{^{\prime }},k_{j} }^{l} \right) \end{equation}
(5)
where θ and ψ denote 1×1×1 convolutions. In order to reduce the computational complexity, we perform dimensionality reduction and reshaping on the features of the key parts, and then weight and sum the features of all other key parts according to the similarity, and update the features of the key parts after normalization.

To avoid model performance degradation as the number of STL blocks used increases, we also build a residual connection, which can be expressed as:

\begin{equation} F_{t,k_{i} }^{l+1}=Res\left(F_{t,k_{i} }^{l+1},F_{t,k_{i} }^{l} \right) \end{equation}
(6)
where Res represents the residual connection. Finally, we feed the global feature Fg and the updated key part feature Fk into a fusion block for feature fusion. Due to the different dimensions, we first reduce the dimension of the key part feature Fk in the fusion block. It can be expressed as:
\begin{equation} F_{k }=\sigma (W_{\phi } \left(F_{k } \right)) \end{equation}
(7)
where Wϕ is the learnable weight and σ is an activation function, the dimension of Fk is reduced from RT × 17 × C × H × W to RT × C × H × W. Next, we fuse the key part feature Fk into the global feature Fg of the video to highlight the contribution of key part feature in the overall video representation. Specifically, we add the global features and key part features frame by frame, and use them as the final representation of the video after global pooling.
\begin{equation} \mathbf {F}=GAP\left(F_{g} + \mathbf {W}F_{k} \right) \end{equation}
(8)
where W denotes the balance weight.

3.4 Training Schemes

To optimize the proposed KSTL framework, we use cross entropy loss $ \mathcal {L}_{xent}$ and hard triplet loss $ \mathcal {L}_{tri}$ for supervised learning. Hard triplet loss is often used to train samples with small differences, and is widely used in person Re-ID. The overall loss function $ \mathcal {L}_{all}$ can be concretely expressed as:

\begin{equation} \begin{aligned} \mathcal {L}_{all} = &-\sum _{i=1}^{N} [m+\max _{pos=1...N}\left\Vert \mathbf {F}_i-\mathbf {F}_{pos} \right\Vert _{2}-\min _{neg=1...N} \left\Vert \mathbf {F}_i-\mathbf {F}_{neg} \right\Vert _{2}]_{+ } \\ &-\sum _{i=1}^N \log \frac{\exp \left(\mathbf {W}_{i} \mathbf {F}_i+b_i\right)}{\sum _{j=1}^C \exp \left(\mathbf {W}_j \mathbf {F}_j+b_j\right)} \end{aligned} \end{equation}
(9)
where N is the mini-batch size, C is the number of classes in the training set, and Fi, Fpos and Fneg are the anchor, positive and negative features in the hard triplet loss respectively.

4 EXPERIMENTS

4.1 Experimental Settings

Dataset and Evaluation Metrics. The MARS [34] dataset is currently one of the most commonly used public datasets for video-based Re-ID, which is captured by 6 cameras. The iLIDS-VID [26] dataset is captured in a public place with two non-overlapping cameras, which is challenging due to complex background, large light changes and many occlusions. The PRID-2011 [13] dataset is collected in a simple and uncrowded outdoor environment. We adopt the Cumulative Matching Characteristics (CMC) [2] and mean Average Precision (mAP) [35] as evaluation metrics.

Implementation Details. We use ResNet50 pretrained on ImageNet [7] as the backbone network, and PyTorch's [22] built-in pre-trained keypoint detection model Keypoint R-CNN [10] is used for keypoint detection. We randomly sample 32 video sequences from 8 persons as a batch, and resize the input video frames to 256 × 128. In the implementation, we set the keypoint to expand 9 pixels up, down, left, and right as the key part area. We adopt the Adam [16] optimizer with weight decay 5e− 4. The initial learning rate lr is set to 3e− 4 and is reduced by a factor of 10 every 100 epochs. We train for a total of 200 epochs and test every 10 epochs. All the experiments are implemented in PyTorch, with two Tesla V100 GPU.

4.2 Comparisons with State-of-the-Arts

We conduct extensive experiments on three benchmark datasets, MARS, iLIDS-VID and PRID-2011, and compare the results with state-of-the-art methods. For the iLIDS-VID and PRID-2011, we only evaluate the cumulative Re-ID accuracy because that there only contains a single correct match in the gallery set.

The comparison results with the state-of-the-art methods are shown in Table 1. On iLIDS-VID dataset, our method achieves 93.4% Rank-1 accuracy, which is 0.7% higher than SDCL [3] and outperforms all existing methods. On PRID-2011 dataset, our method achieves 96.7% Rank-1 accuracy, which is 0.2% higher than SDCL and outperforms all existing methods. On the MARS dataset, our method achieves 91.5% Rank-1 accuracy, which is 0.4% higher than SDCL and is the best method so far. It is worth noting that the background of pedestrians in the iLIDS-VID dataset is complex and serious occlusion occurs in some videos. The optimal Rank-1 accuracy is achieved on this dataset, which also proves the effectiveness of our method in focusing on the information of key parts of the human body. Overall, the comprehensive performance of our KSTL framework is quite competitive among current Re-ID methods.

Table 1: Performance (%) comparison with state-of-the-art methods.
Methods MARS iLIDS-VID PRID-2011
mAP Rank-1 Rank-1 Rank-1
MG-RAFA [32] 85.9 88.8 88.6 95.9
BiCnet-TKS [14] 86.0 90.2 - -
DSANet [15] 86.6 91.1 - -
GRL [20] 84.8 91.0 90.4 96.2
DenseIL [12] 87.0 90.8 92.0 -
SINet [1] 86.2 91.0 92.5 96.5
SDCL [3] 86.5 91.1 92.7 96.5
KSTL(ours) 86.3 91.5 93.4 96.7
Figure 5
Figure 5: The influence of the coverage of key parts on MARS dataset.

4.3 Ablation Study

In this section, we evaluate the impact of parameter settings and components on model performance. First, we directly fuse the extracted key part features to global features without using STL blocks, and the impact of key part coverage on model performance evaluated on the Mars dataset is shown in Figure 5. We can observe that the model performance is the best when 9 pixels around the keypoint position are retained, the mAP is 83.2% and the Rank-1 accuracy is 87.6%. This is because when the pixel range is too small, the key parts cannot be effectively covered, and when the pixel range is too large, too many background features are extracted, which weakens the representation of pedestrians.

Table 2: Influence of number of STL blocks on model performance.
Number of STL MARS iLIDS-VID PRID-2011
mAP Rank-1 Rank-1 Rank-1
1 85.92 89.35 90.66 93.4
2 86.18 91.28 92.37 95.2
3 86.33 91.52 93.46 96.71
4 86.25 90.87 92.95 96.83

Then, under the condition of using the optimal coverage, we analyze the impact of the number of STL blocks, there are four groups of cases, and the experimental results are shown in Table 2. Specifically, the performance of the model using 2 blocks of STL is significantly improved compared to using only 1 block. The comprehensive performance of the model reaches the best when using 3 blocks of STL, and the mAP is 86.33% and the Rank-1 accuracy is 91.52% on the MARS dataset. The Rank-1 accuracy on the iLIDS-VID dataset is 93.46%, and the Rank-1 accuracy on the PRID dataset is 96.71%. This is because the STL block can fill in the incomplete part features with effective part features of other frames, thus achieving more delicate representation and more accurate retrieval. At the same time, we also find that when using 4 blocks of STL, the representation of pedestrians is weakened, which is due to excessive spatio-temporal learning of key parts, too many redundant features are extracted, and high similarity between key parts is caused. Considering that computing resources and training time will increase significantly with the increase of layers, we only use 3 blocks of STL in the framework for feature learning of key parts.

Table 3: Ablation results of components.
Components MARS iLIDS-VID PRID-2011
mAP Rank-1 Rank-1 Rank-1
Baseline 78.3 85.7 79.7 88.6
Key Parts 83.2 87.6 88.3 93.4
KSTL 86.3 91.5 93.4 96.7

Finally, we also evaluate the contribution of different components to the final performance of the model, as shown in Table 3. The first row in the table is the baseline, which uses the backbone network to extract the features of the entire video and obtain the final representation of the video after global pooling. For the second row in the table, we only use the original key part feature and add it to the global feature, we observe that the mAP increases by 4.8% and the Rank-1 increases by 1.9% on the MARS dataset. The Rank-1 is increased by 9.6% and 4.9% on iLIDS-VID and PRID-2011 datasets, respectively. This suggests that keypoint features can lead to more discriminative feature representations. The overall performance of the model is shown in the third row of the table. Compared with only using key part features, mAP increased by 3.1% and Rank-1 increased by 3.9% on the MARS dataset. The Rank-1 accuracy increases by 5.1% and 3.3% on the iLIDS-VID and PRID-2011 datasets, respectively. This shows that through spatio-temporal learning based on key parts, our method can extract meaningful key information in space and time and improve the performance of the model, which is a good match with our design motivation.

4.4 Visualization

We also visualize the model retrieval results, as shown in Figure 6. The visualization results show that our framework can still retrieve pedestrians well under conditions such as occlusion and complex background. This is because the fusion of key part features highlights the key part of the pedestrian and weakens the background features. The STL block can fully mine the long-range dependencies of key parts in the video to achieve more robust re-identification. In summary, our method not only accurately learns the features of key parts, but also makes full use of the temporal information in the video, which is suitable for various complex scenarios.

Figure 6
Figure 6: Visualization of retrieval results on MARS.

5 CONCLUSION

In this paper, we propose a Key Parts Spatio-temporal Learning framework for video-based person Re-ID to achieve more accurate pedestrian detection. The framework learns carefully designed fine-grained key part features in time, space and spacetime, making full use of spatio-temporal cues in video sequences. We first perform the key part mask operation according to the key points to extract the key part features. Then, the STL block is used to learn and transfer the key part features spatio-temporal. Finally, the key part features are fused to the global features of the video to highlight the key features and weaken the redundant features to generate a more discriminative video representation. We conduct extensive experiments on three person Re-ID benchmarks, and the framework achieves good performance compared to the state-of-the-art methods.

REFERENCES

  • Shutao Bai, Bingpeng Ma, Hong Chang, Rui Huang, and Xilin Chen. 2022. Salient-to-broad transition for video person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7339–7348.
  • Ruud M Bolle, Jonathan H Connell, Sharath Pankanti, Nalini K Ratha, and Andrew W Senior. 2005. The relation between the ROC curve and the CMC. In Fourth IEEE workshop on automatic identification advanced technologies (AutoID’05). IEEE, 15–20.
  • Chengzhi Cao, Xueyang Fu, Hongjian Liu, Yukun Huang, Kunyu Wang, Jiebo Luo, and Zheng-Jun Zha. 2023. Event-Guided Person Re-Identification via Sparse-Dense Complementary Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17990–17999.
  • Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7291–7299.
  • Dapeng Chen, Hongsheng Li, Tong Xiao, Shuai Yi, and Xiaogang Wang. 2018. Video person re-identification with competitive snippet-similarity aggregation and co-attentive snippet embedding. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1169–1178.
  • Jiaxin Chen, Yunhong Wang, Jie Qin, Li Liu, and Ling Shao. 2017. Fast person re-identification via cross-camera semantic binary transformation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3873–3882.
  • Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248–255.
  • Jiyang Gao and Ram Nevatia. 2018. Revisiting temporal modeling for video-based person reid. arXiv preprint arXiv:1805.02104 (2018).
  • Niloofar Gheissari, Thomas B Sebastian, and Richard Hartley. 2006. Person reidentification using spatiotemporal appearance. In 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06), Vol. 2. IEEE, 1528–1535.
  • Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision. 2961–2969.
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
  • Tianyu He, Xin Jin, Xu Shen, Jianqiang Huang, Zhibo Chen, and Xian-Sheng Hua. 2021. Dense interaction learning for video-based person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1490–1501.
  • Martin Hirzer, Csaba Beleznai, Peter M Roth, and Horst Bischof. 2011. Person re-identification by descriptive and discriminative classification. In Image Analysis: 17th Scandinavian Conference, SCIA 2011, Ystad, Sweden, May 2011. Proceedings 17. Springer, 91–102.
  • Ruibing Hou, Hong Chang, Bingpeng Ma, Rui Huang, and Shiguang Shan. 2021. Bicnet-tks: Learning efficient spatial-temporal representation for video person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2014–2023.
  • Minjung Kim, MyeongAh Cho, and Sangyoun Lee. 2023. Feature Disentanglement Learning with Switching and Aggregation for Video-based Person Re-Identification. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1603–1612.
  • Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
  • Jianing Li, Jingdong Wang, Qi Tian, Wen Gao, and Shiliang Zhang. 2019. Global-local temporal representations for video person re-identification. In Proceedings of the IEEE/CVF international conference on computer vision. 3958–3967.
  • Jianing Li, Shiliang Zhang, and Tiejun Huang. 2019. Multi-scale 3d convolution network for video based person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8618–8625.
  • Xingyu Liao, Lingxiao He, Zhouwang Yang, and Chi Zhang. 2019. Video-based person re-identification via 3d convolutional networks and non-local attention. In Computer Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part VI 14. Springer, 620–634.
  • Xuehu Liu, Pingping Zhang, Chenyang Yu, Huchuan Lu, and Xiaoyun Yang. 2021. Watching you: Global-guided reciprocal learning for video-based person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 13334–13343.
  • Niall McLaughlin, Jesus Martinez Del Rincon, and Paul Miller. 2016. Recurrent convolutional network for video-based person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1325–1334.
  • Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in pytorch. (2017).
  • Jianlou Si, Honggang Zhang, Chun-Guang Li, Jason Kuen, Xiangfei Kong, Alex C Kot, and Gang Wang. 2018. Dual attention matching network for context-aware feature sequence based person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5363–5372.
  • Yumin Suh, Jingdong Wang, Siyu Tang, Tao Mei, and Kyoung Mu Lee. 2018. Part-aligned bilinear representations for person re-identification. In Proceedings of the European conference on computer vision (ECCV). 402–419.
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
  • Taiqing Wang, Shaogang Gong, Xiatian Zhu, and Shengjin Wang. 2014. Person re-identification by video ranking. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part IV 13. Springer, 688–703.
  • Yingquan Wang, Pingping Zhang, Shang Gao, Xia Geng, Hu Lu, and Dong Wang. 2021. Pyramid spatial-temporal aggregation for video-based person re-identification. In Proceedings of the IEEE/CVF international conference on computer vision. 12026–12035.
  • Yichao Yan, Jie Qin, Jiaxin Chen, Li Liu, Fan Zhu, Ying Tai, and Ling Shao. 2020. Learning multi-granular hypergraphs for video-based person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2899–2908.
  • Jinrui Yang, Wei-Shi Zheng, Qize Yang, Ying-Cong Chen, and Qi Tian. 2020. Spatial-temporal graph convolutional network for video-based person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3289–3299.
  • Mang Ye, Xiangyuan Lan, and Pong C Yuen. 2018. Robust anchor embedding for unsupervised video person re-identification in the wild. In Proceedings of the European Conference on Computer Vision (ECCV). 170–186.
  • Mang Ye, Jianbing Shen, Gaojie Lin, Tao Xiang, Ling Shao, and Steven CH Hoi. 2021. Deep learning for person re-identification: A survey and outlook. IEEE transactions on pattern analysis and machine intelligence 44, 6 (2021), 2872–2893.
  • Zhizheng Zhang, Cuiling Lan, Wenjun Zeng, and Zhibo Chen. 2020. Multi-granularity reference-aided attentive feature aggregation for video-based person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10407–10416.
  • Yiru Zhao, Xu Shen, Zhongming Jin, Hongtao Lu, and Xian-sheng Hua. 2019. Attribute-driven feature disentangling and temporal aggregation for video person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4913–4922.
  • Liang Zheng, Zhi Bie, Yifan Sun, Jingdong Wang, Chi Su, Shengjin Wang, and Qi Tian. 2016. Mars: A video benchmark for large-scale person re-identification. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VI 14. Springer, 868–884.
  • Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. 2015. Scalable person re-identification: A benchmark. In Proceedings of the IEEE international conference on computer vision. 1116–1124.

CC-BY license image
This work is licensed under a Creative Commons Attribution International 4.0 License.

MMAsia '23, December 06–08, 2023, Tainan, Taiwan

© 2023 Copyright held by the owner/author(s).
ACM ISBN 979-8-4007-0205-1/23/12.
DOI: https://doi.org/10.1145/3595916.3626417