CN112906605A

CN112906605A - Cross-modal pedestrian re-identification method with high accuracy

Info

Publication number: CN112906605A
Application number: CN202110243887.7A
Authority: CN
Inventors: 张立言; 杜国栋; 徐旭
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2021-03-05
Filing date: 2021-03-05
Publication date: 2021-06-04
Anticipated expiration: 2041-03-05
Also published as: CN112906605B

Abstract

The invention discloses a high-accuracy cross-modal pedestrian re-identification method, which comprises the following steps of: acquiring real pedestrian video information under a monitoring environment from the data set, and processing to obtain a pedestrian image sample and pedestrian identity information; building a multi-scale combined double-current cross-modal depth network, initializing network parameters, using a pedestrian image sample and pedestrian identity information as supervision information, and performing supervised training on the double-current cross-modal depth network; and taking the interested pedestrian target query as the input of the double-flow cross-modal depth network, wherein the double-flow cross-modal depth network gives a pedestrian target list with higher similarity with the query target. The method can process heterogeneous sample information of two modes simultaneously, extract mode common features in the sample, and form features with more representation capability through feature fusion of the global scale and the local scale.

Description

Cross-modal pedestrian re-identification method with high accuracy

Technical Field

The invention discloses a method for realizing high-accuracy cross-mode and cross-camera pedestrian matching by applying deep learning and knowledge introduction, and belongs to the field of computer vision.

Background

With the development of society, road monitoring systems are more and more popular. Due to the performance problems of surveillance cameras and the changes of the environmental conditions of surveillance, the face recognition technology in people recognition does not work well in the cross-camera pedestrian tracking, so the importance of pedestrian re-recognition topic is becoming more and more prominent [1 ]. And (4) re-identifying the pedestrian, aiming at searching and screening the same personnel target in a cross-camera environment and further determining the activity track [1] of the corresponding personnel target. Meanwhile, because most images shot by a night camera are infrared images and are different from RGB images shot in the day time, the mode difference between two modes is difficult to overcome by using the traditional Re-ID method, and therefore, the cross-mode Re-ID is proposed to solve the problem [3 ].

With the great popularization of monitoring equipment and the arrival of a big data era, the pedestrian retrieval and matching have more and more significant significance in the field of public safety. However, conventional face recognition techniques do not work well under road monitoring due to limitations in road monitoring installations and performance issues of monitoring equipment [1 ]. Therefore, the necessity and importance of the pedestrian re-identification technology research which can be adapted to the road monitoring task environment and can complete the cross-camera specific person identification tracking are more and more emphasized.

At present, researches on pedestrian Re-identification mainly focus on three aspects, namely single-mode Re-ID, unsupervised Re-ID, cross-mode Re-ID, shielding Re-ID removal, dense crowd, cross-resolution and other small directions. The research topic of the monomodal Re-ID is put forward earliest, so the development is the most perfect, and the research topic lays the foundation for the subsequent other related research directions. The unsupervised Re-ID is a research direction proposed on the premise that the label is difficult to obtain in practical application, and can be said to be a treatment which is necessary in the practical application of the Re-ID. The cross-modal Re-ID is proposed based on the practical need of tracking detection, and most of the criminal activities are performed at night, so that matching between pedestrian images at night and in the day becomes an increasingly important subject.

The monomodal Re-ID has achieved very high matching accuracy on the existing Re-ID dataset, and a number of more robust baselines have been proposed. The single-mode Re-ID is supervised training of RGB images under the condition of a manual label, and the method mainly aims to excavate details with distinguishing force in a pedestrian sample to form sample characteristics so as to improve matching accuracy. Of these methods, self-attribute and part-level feature-based methods are particularly effective. The method of Self-attention in pedestrian re-identification is to block a pedestrian image, and reshape the content of each block by using the relationship between image blocks and the weight of each image block in final matching, so that the reshaped image block can provide stronger distinguishing capability. The method based on part-level features is more direct, and compared with a typical PCB model, the method is to directly transversely divide the image, and each small image block after division is used for representing the original whole pedestrian sample, so that the method forces the model to pay more attention to the detailed region of the image. The segmentation is based on prior knowledge, six pieces of PCB are segmented, corresponding to the human body, and the PCB is segmented into three pieces corresponding to the head, the upper body and the lower body by some methods.

The problems that need to be overcome across modality Re-IDs are much more numerous than for monomodal Re-IDs. In addition to the need to extract sample features that are well characterized, modal differences need to be overcome. The practice of most of the current papers [3[10] is to utilize a two-stream network structure to respectively process two modes, and then use a sharing layer to extract mode sharing characteristics, thereby extracting reliable characteristics. Other methods are dedicated to the role of unique features of the modalities, and one typical method is to fill in missing features of the modality with samples of another modality that are labeled, so as to achieve feature balance between the samples. Similar effects can be achieved by using GAN, and the methods retain the content of the original sample, replace the content with the style characteristics of another modality, expand the data set and achieve the balance of the samples among the modalities [2 ].

Reference documents:

[1]Ye,Mang.(2020).Deep Learning for Person Re-identification:A Survey and Outlook.

[2]Choi,Seokeon&Lee,Sumin&Kim,Youngeun&Kim,Taekyung&Kim,Changick.(2020).Hi-CMD:Hierarchical Cross-Modality Disentanglement for Visible-Infrared Person Re-Identification.10254-10263.10.1109/CVPR42600.2020.01027.

[3]Ye,Mang&Shen,Jianbing&Crandall,David&Shao,Ling&Luo,Jiebo.(2020).Dynamic Dual-Attentive Aggregation Learning for Visible-Infrared Person Re-Identification.

[4]Wang,Guan-An&Zhang,Tianzhu&Yang,Yang&Cheng,Jian&Chang,Jianlong&Liang,Xu&Hou,Zeng-Guang.(2020).Cross-Modality Paired-Images Generation for RGB-Infrared Person Re-Identification.Proceedings of the AAAI Conference on Artificial Intelligence.34.12144-12151.10.1609/aaai.v34i07.6894.

[5]Y.Lu et al.,"Cross-Modality Person Re-Identification With Shared-Specific Feature Transfer,"2020IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR),Seattle,WA,USA,2020,pp.13376-13386,doi:10.1109/CVPR42600.2020.01339.

[6]Jia,Mengxi&Zhai,Yunpeng&Lu,Shijian&Ma,Siwei&Zhang,Jian.(2020).A Similarity Inference Metric for RGB-Infrared Cross-Modality Person Re-identification.

[7]Fan,Xing&Luo,Hao&Zhang,Chi&Jiang,Wei.(2020).Cross-Spectrum Dual-Subspace Pairing for RGB-infrared Cross-Modality Person Re-Identification.

[8]Zhang,Ziyue&Jiang,Shuai&Huang,Congzhentao&Li,Yang&Xu,Richard.(2020).RGB-IR Cross-modality Person ReID based on Teacher-Student GAN Model.

[9]Wang,Guanan&Zhang,Tianzhu&Cheng,Jian&Liu,Si&Yang,Yang&Hou,Zengguang.(2019).RGB-Infrared Cross-Modality Person Re-Identification via Joint Pixel and Feature Alignment.

[10]Zhu,Yuanxin&Yang,Zhao&Wang,Li&Zhao,Sai&Hu,Xiao&Tao,Dapeng.(2019).Hetero-Center Loss for Cross-Modality Person Re-Identification.Neurocomputing.386.10.1016/j.neucom.2019.12.100.

[11]Wang,Zhixiang&Wang,Zheng&Zheng,Yinqiang&Chuang,Yung-Yu&Satoh,Shin'ich.(2019).Learning to Reduce Dual-Level Discrepancy for Infrared-Visible Person Re-Identification.618-626.10.1109/CVPR.2019.00071.

[12]Hao,Yi&Wang,Nannan&Gao,Xinbo&Li,Jie&Wang,Xiaoyu.(2019).Dual-alignment Feature Embedding for Cross-modality Person Re-identification.57-65.10.1145/3343031.3351006.

[13]Ye,Mang&Lan,Xiangyuan&Leng,Qingming.(2019).Modality-aware Collaborative Learning for Visible Thermal Person Re-Identification.347-355.10.1145/3343031.3351043.

[14]Liu,Haijun&Cheng,Jian.(2019).Enhancing the Discriminative Feature Learning for Visible-Thermal Cross-Modality Person Re-Identification.

[15]Basaran,Emrah&

Muhittin&Kamasak,Mustafa.(2020).An efficient framework for visible-infrared cross modality person re-identification.Signal Processing:Image Communication.87.115933.10.1016/j.image.2020.115933.

[16]Pingyang,Dai&Ji,Rongrong&Wang,Haibin&Wu,Qiong&Huang,Yuyu.(2018).Cross-Modality Person Re-Identification with Generative Adversarial Training.677-683.10.24963/ijcai.2018/94.

[17]Wang,Guanan&Zhang,Tianzhu&Cheng,Jian&Liu,Si&Yang,Yang&Hou,Zengguang.(2019).RGB-Infrared Cross-Modality Person Re-Identification via Joint Pixel and Feature Alignment.

[18]Sun,Y.,Zheng,L.,Yang,Y.,Tian,Q.,&Wang,S.(2018).Beyond Part Models:Person Retrieval with Refined Part Pooling.ArXiv,abs/1711.09349.

[19]Wang,Guanshuo&Yuan,Yufeng&Chen,Xiong&Li,Jiwei&Zhou,Xi.(2018).Learning Discriminative Features with Multiple Granularities for Person Re-Identification.

[20]Ge,Y.,Chen,D.,&Li,H.(2020).Mutual Mean-Teaching:Pseudo Label Refinery for Unsupervised Domain Adaptation on Person Re-identification.ArXiv,abs/2001.01526.

disclosure of Invention

In order to solve the defects in the prior art, a more effective method needs to be adopted to process the difference between two different modes, namely the RGB mode and the IR mode, so that a more reasonable feature space is formed, and the subsequent detection and matching are facilitated. In order to achieve the corresponding purpose, not only the similarity relationship between different identity samples in a single mode needs to be processed, but also the similarity relationship between the same identity samples in cross-mode needs to be processed, so that the purpose of finally detecting the same sample in another mode through a query in one mode is achieved.

The invention provides a multi-scale combined double-flow network structure, which can process heterogeneous sample information of two modes simultaneously, extract mode common characteristics in a sample, and form characteristics with more representation capability through the characteristic fusion of the global scale and the local scale, has reasonable design, meets the modeling requirement and has good effect.

In order to achieve the purpose, the invention adopts the technical scheme that:

a cross-modal pedestrian re-identification method with high accuracy comprises the following steps:

step 1, acquiring pedestrian video information under a real monitoring environment from a data set, preprocessing the whole segment of pedestrian video information to obtain a pedestrian image sample, intercepting key pedestrian images in the video and marking corresponding pedestrian identity information;

step 2, building a multi-scale combined double-current cross-modal depth network, initializing network parameters, using the pedestrian image sample obtained in the step 1 and pedestrian identity information as supervision information, performing supervised training on the double-current cross-modal depth network, after the training is finished, finely adjusting hyper-parameters in the double-current cross-modal depth network according to a final training effect, and fixing the network parameters;

and 3, taking the interested pedestrian target query as the input of the double-flow cross-modal depth network, giving a pedestrian target list with higher similarity with the query target by the double-flow cross-modal depth network, and searching the pedestrian target with the same identity according to the pedestrian target list by an operator from high similarity to low similarity to track the pedestrian.

In the first step, the preprocessed pedestrian image sample comprises an image containing the physical appearance characteristics of the pedestrian, pedestrian identity information corresponding to the image sample, and original video sequence information of the pedestrian image sample.

In the step 1, the data set is a SYSU-MM01 true data set.

In the step 2, the double-current cross-modal deep network uses ResNet-50 which is pre-trained on ImageNet in a pyrrch model library as a skeleton network; the dual-stream cross-modal depth network is divided into local branches and global branches, each branch comprises a dual-stream structure and is used for processing sample characteristics of two modalities.

In the step 2, in the Global branch, the layer0 part in the ResNet-50 is used as a double-flow structure, and the following layers 1 to 4 are used as network structures sharing parameters; the double-flow part does not share parameters, respectively extracts the characteristics of two modes, reserves part of mode characteristic information, and the shared parameter network part extracts the mode shared characteristics of two different mode samples and continues to utilize the shared characteristics to carry out subsequent optimization operation; the subsequent optimization operations include: reducing the dimension of the extracted features by using a linear layer, reducing the number of model parameters, reducing the calculation load, and optimizing a feature space by using the triple loss of the samples which are difficult to be loaded and the cross entropy loss after obtaining the final features, wherein the cross entropy loss is used for optimizing the sample relation in the modes, and the triple loss is used for optimizing the sample relation between the modes;

the Local branch comprises two sub-branches, and three blocks and six blocks of samples are respectively split; the double-flow structure of the Local branch is characterized in that all layers of ResNet-50 are not subjected to parameter sharing, more modal characteristics are reserved, the sample characteristics are horizontally divided into three blocks corresponding to the head, the upper half and the lower half of a human body after a backbone is carried out, and six blocks correspond to more fine human body parts; the features of the two modal samples are connected after being cut, dimension reduction is carried out on the features entering a linear layer sharing parameters, and then each cut feature block is independently optimized for a target function; wherein, a cross entropy function is adopted to optimize a characteristic space in a mode; introducing a target function based on a cluster center, and utilizing the cluster center of the same-identity sample to realize an optimization target with a closer distance to the same-identity sample and a farther distance to the different-feature sample;

in the final matching detection stage, the features of all local branches and the features of global branches are connected to form a descriptor with more distinguishing characteristics.

The invention has the beneficial effects that:

(1) the method is different from the existing cross-modal pedestrian re-identification method, is a multi-scale combined characteristic characterization method, not only has coarse-grained characterization characteristics in the traditional method, but also fuses corresponding fine-grained sample local characteristics, and improves the characterization distinguishing capability of the method.

(2) The invention introduces the target function based on the cluster center, can more effectively process the characteristic space between the modes, and meanwhile, the new target function adapts to the optimization mode of local characteristics without introducing corresponding negative effects, thereby greatly improving the accuracy of the method. The difference between the modes can be effectively processed, the Euclidean distance is used as a measurement standard for the target function, the average value of sample characteristics with the same identity is used as a cluster center, the difference value is used as a calculation target, the distance between samples with the same identity in the cross-mode is shortened, the hyper-parameter is used as a margin, and the distance between the samples with different identities in the cross-mode is expanded by using the difference value to form a good characteristic space.

(3) The invention introduces a heterogeneous double-flow network structure, processes global information by using a network structure with more sharing layers, processes local information by using a network structure with less sharing layers, and respectively meets the requirements of target functions of corresponding branches, thereby further improving the cross-modal processing capability.

(4) The invention realizes the combination of the global coarse-grained characteristic and the local fine-grained characteristic. The Global features extracted from the Global branches can provide comprehensive sample information, meanwhile, the local branches can provide fine local features on the basis of feature segmentation, detailed information lacking in the Global features is made up, and distinguishing capability of the model is improved.

Drawings

Fig. 1 is a schematic structural diagram of a dual-stream cross-modal depth network.

Detailed Description

The present invention is further described below.

The invention relates to a high-accuracy cross-modal pedestrian re-identification method, which comprises the following steps of:

step 1, data preparation and formalization definition: the original data of pedestrian re-identification is usually a monitoring video, and key pedestrian information in the monitoring video needs to be cut out manually or by an algorithm. The method is a cross-mode pedestrian re-identification algorithm based on images, so that an image identification and cutting algorithm is used as a front-end method to cut pedestrian images in a video clip and mark corresponding pedestrian identity information to distinguish different pedestrian targets. In the invention, a SYSU-MM01 real data set is used, and the data set is not subjected to deep manual labeling, has a certain noise label and accords with an actual application scene. Preprocessing the whole section of pedestrian video information to obtain a pedestrian image sample, intercepting key pedestrian images in the video and marking corresponding pedestrian identity information; the preprocessed pedestrian image sample comprises an image containing the physical appearance characteristics of the pedestrian, pedestrian identity information corresponding to the image sample and original video sequence information of the pedestrian image sample.

Step 2, constructing a multi-scale combined cross-mode double-flow pedestrian re-identification network: and building a corresponding network structure according to the model schematic diagram shown in FIG. 1. And (2) carrying out supervised training on the double-current cross-modal depth network by using the pedestrian image sample obtained in the step (1) and the pedestrian identity information as supervision information, and after the training is finished, finely adjusting the hyper-parameters in the double-current cross-modal depth network according to the final training effect until a relatively good effect is achieved, and fixing the network parameters.

ResNet-50 pre-trained on ImageNet in the pyrtch model library was used as the backbone network for the model. The model is divided into local and global branches, each of which contains a dual-flow structure to handle sample features of both modalities. The double-flow network is a parallel network with different structures and different shared parameters, and a cross-modal task is processed by using the double-flow network structure, so that unique characteristics of partial modes can be reserved, and subsequent optimization operation is facilitated.

In the Global branch, the layer0 part in ResNet-50 is used as a dual-stream structure, and the following layers 1 to 4 are used as network structures sharing parameters. The double-flow part does not share parameters, respectively extracts the characteristics of two modes, reserves part of mode characteristic information, and the shared parameter network part aims to extract the mode shared characteristics of two different mode samples and continuously utilize the shared characteristics to carry out subsequent optimization operation. And subsequently, reducing the dimension of the extracted features by using a linear layer, reducing the number of model parameters, reducing the calculation load, and optimizing a feature space by using the triple loss of the samples which are difficult to be loaded and the cross entropy loss after obtaining the final features, wherein the cross entropy loss aims at optimizing the sample relation in the modes, and the triple loss aims at optimizing the sample relation among the modes. The hard negative sample triplet loss is expressed as:

and P is P pedestrian identity labels randomly selected from each mini-batch, and K is the number of pedestrian samples selected from each pedestrian label, so that each mini-batch has P × K samples in total. f represents the operation of extracting features of the samples by the model, D is a measurement standard, and the Euclidean distance is adopted to judge the distance between the two samples.

The cross entropy loss is expressed as:

wherein f represents the sample characteristics extracted by the network, and W is the corresponding same-dimension weight vector.

The Local branch comprises two subbranches, and three blocks and six blocks of segmentation are respectively carried out on the samples. The double-flow structure of the Local branch is different from the global branch, and the double-flow structure of the Local branch has the advantages that all layers of ResNet-50 are not subjected to parameter sharing, more modal characteristics are reserved, and subsequent optimization is facilitated. After the backbone, the sample characteristics are horizontally divided into three blocks corresponding to the head, upper body and lower body of the human body, and six blocks corresponding to the finer human body parts. The features of the two modal samples are connected after being cut, dimension reduction is carried out in a linear module sharing parameters, and then each cut feature block is independently optimized for an objective function. The linear module includes three operations, linear layer dimension reduction, ReLU function reactivation and regularization. Likewise, the cross-entropy function primarily optimizes the feature space within the modality. An objective function based on a cluster center is introduced, and the purpose is to realize an optimization target that the distance between the cluster center and the identity sample is closer and the distance between the different feature samples is farther by utilizing the cluster center of the identity sample. The cluster center based objective function is represented as:

and the I represents calculating Euclidean distance, adding and averaging to calculate and solve the center of the same-identity sample cluster and drawing the same-identity sample cluster closer, and the center of the different-identity sample cluster gradually gets away under the action of the hyperparameter mu.

In the invention, the model is trained by using marked pedestrian sample data, the training period is 80, the learning rate is set to be 0.01, and each 20 samples are reduced to be 0.1 of the model along with the gradual reduction of the training period. After the whole training period is finished, the parameters of the model after training are stored, so that the subsequent detection processing is facilitated.

And 3, sorting the pedestrian target images needing to be detected, selecting images with more characteristics as query images to be transmitted into the trained model, detecting and matching the pedestrian sample images generated by all the monitoring videos as a detection set, outputting pedestrian samples with high matching degree by the model, arranging the pedestrian samples from high to low in similarity degree, and finding the pedestrian targets with the same identity according to a pedestrian target list by an operator to track the pedestrians.

Claims

1. A cross-modal pedestrian re-identification method with high accuracy is characterized in that: the method comprises the following steps:

2. The method of claim 1, wherein the cross-modal pedestrian re-identification method with high accuracy is characterized in that: in the first step, the preprocessed pedestrian image sample comprises an image containing the physical appearance characteristics of the pedestrian, pedestrian identity information corresponding to the image sample, and original video sequence information of the pedestrian image sample.

3. The method of claim 1, wherein the cross-modal pedestrian re-identification method with high accuracy is characterized in that: in the step 1, the data set is a SYSU-MM01 true data set.

4. The method of claim 1, wherein the cross-modal pedestrian re-identification method with high accuracy is characterized in that: in the step 2, the double-current cross-modal deep network uses ResNet-50 which is pre-trained on ImageNet in a pyrrch model library as a skeleton network; the dual-stream cross-modal depth network is divided into local branches and global branches, each branch comprises a dual-stream structure and is used for processing sample characteristics of two modalities.

5. The method of claim 4, wherein the cross-modal pedestrian re-identification method with high accuracy is characterized in that: in the step 2, in the Global branch, the layer0 part in the ResNet-50 is used as a double-flow structure, and the following layers 1 to 4 are used as network structures sharing parameters; the double-flow part does not share parameters, respectively extracts the characteristics of two modes, reserves part of mode characteristic information, and the shared parameter network part extracts the mode shared characteristics of two different mode samples and continues to utilize the shared characteristics to carry out subsequent optimization operation; the subsequent optimization operations include: reducing the dimension of the extracted features by using a linear layer, reducing the number of model parameters, reducing the calculation load, and optimizing a feature space by using the triple loss of the samples which are difficult to be loaded and the cross entropy loss after obtaining the final features, wherein the cross entropy loss is used for optimizing the sample relation in the modes, and the triple loss is used for optimizing the sample relation between the modes;