CN114022686A

CN114022686A - Pedestrian re-identification method oriented to occlusion scene

Info

Publication number: CN114022686A
Application number: CN202111484998.3A
Authority: CN
Inventors: 王蓉; 孙义博; 张文靖
Original assignee: PEOPLE'S PUBLIC SECURITY UNIVERSITY OF CHINA
Current assignee: PEOPLE'S PUBLIC SECURITY UNIVERSITY OF CHINA
Priority date: 2021-12-07
Filing date: 2021-12-07
Publication date: 2022-02-08

Abstract

The invention discloses a pedestrian re-identification method facing an occlusion scene, which is characterized in that a global contrast pooling module is introduced into a global feature branch, the average pooling characteristic and the maximum pooling characteristic are fused, and global features with stronger background noise and occlusion anti-interference performance are extracted; introducing an One-vs-rest relation module into the local feature branch, associating each part of local features with the rest of local features, and extracting the local features containing global information; and in the measurement learning stage, the training process of the model is supervised by combining three loss functions of cross entropy loss, sample-difficult sampling triple loss and central loss, so that the network model can extract pedestrian features with more discriminative power. And finally, evaluating on an Occluded-DukeMTMC data set, and fully embodying the effectiveness and the advancement of the method in solving the problem of pedestrian re-identification shielding.

Description

Pedestrian re-identification method oriented to occlusion scene

Technical Field

The invention relates to a pedestrian re-identification method, in particular to a pedestrian re-identification method facing an occlusion scene.

Background

Pedestrian re-identification is a popular and complex research topic in the field of computer vision. A specific pedestrian image to be inquired is given, whether the specific pedestrian exists under other non-overlapping cameras is judged by using a computer vision technology, the vision limitation of a single fixed camera is made up to a great extent, and the method is usually combined with technologies such as pedestrian detection and pedestrian tracking and is applied to the fields of mode recognition and the like.

Compared with the traditional pedestrian re-identification method, the pedestrian re-identification method based on deep learning can extract more discriminative feature description through automatic learning and simple similarity measurement, and better performance is obtained. However, in practical application scenarios, the pedestrian re-identification technology faces various problems: 1. the pedestrian postures shot by the camera are variable due to the angle and the position of the camera; 2. the resolution of the pedestrian image is changed due to the shooting distance; cross-modal re-identification due to image domain variation; 4. and (6) shielding. Under the shielding scene, pedestrians in the image shot by the camera are often shielded by some obstacles (such as luggage, a counter, people, an automobile and trees), or due to the fact that the partial region of the pedestrian body leaves the shooting range of the camera, the identification information contained in the image of the pedestrians is reduced, and meanwhile, the pedestrians are interfered by noise of the shielding part, so that the target pedestrians are easily matched with wrong pedestrians. Therefore, accurately matching the pedestrian picture only with local observability, efficiently utilizing the pedestrian features of the non-occluded area and the relationship among the local features, and reducing the noise interference of the occluded area are the keys for improving the pedestrian re-identification precision in the occluded scene.

At present, when the problem of pedestrian re-identification shielding is solved in the academic world, a commonly used method is to detect key points of a human body through a posture estimator and use the key points as auxiliary information to guide a model to focus on a visual area of an image so as to distinguish a shielding area and an unshielded area of a pedestrian image and further reduce noise interference of the shielding area. However, these methods usually only directly use the local features of the pedestrian, and do not consider the relationship between the parts of the body, so that it is easy to confuse two different pedestrian pictures with similar features at the corresponding parts; in addition, some methods only adopt a single loss function to supervise the training process of the model, which will make the sample class information not be fully utilized and mined, and finally lead to the problem of reduced accuracy of pedestrian re-identification. Therefore, the invention provides a pedestrian re-identification method facing an occlusion scene.

Disclosure of Invention

In order to solve the defects of the technology, the invention provides a pedestrian re-identification method facing an occlusion scene, wherein a global contrast pooling module is introduced into a global feature branch, so that the average pooling characteristic and the maximum pooling characteristic are fused, and global features with stronger background noise and occlusion anti-interference performance are extracted; introducing an One-vs-rest relation module into the local feature branch, associating each part of local features with the rest of local features, and extracting the local features containing global information; and in the measurement learning stage, the training process of the model is supervised by combining three loss functions of cross entropy loss, sample-difficult sampling triple loss and central loss, so that the network model can extract pedestrian features with more discriminative power.

In order to solve the technical problems, the invention adopts the technical scheme that: a pedestrian re-identification method facing an occlusion scene comprises the following steps:

step S1: generating human body key points by means of a pre-trained attitude estimator, and combining global features to enable the model to pay more attention to the pedestrian region which is not shielded, so that noise interference caused by the shielded region is reduced;

step S2: a global contrast pooling module is introduced, so that noise interference caused by background clutter and shielding is reduced, and the information of the whole body area of the pedestrian is better expressed;

step S3: deeper feature extraction is carried out on local part block features through a One-vs-rest relation module, so that the features of each local layer contain information of the corresponding part and other body parts, and the relation among the body parts is better reflected;

step S4: and (3) supervising the model by adopting a mode of multi-loss function joint training, so that the model ensures the accuracy of the predicted label and considers the inter-class dispersion and the intra-class compactness.

Step S1 specifically includes the following steps:

step S11: inputting a pedestrian picture, namely an original picture, and generating a human body key point M through a pre-trained attitude estimator_iCoordinates (x) corresponding to each key point_i,y_i) And confidence level

Step S12: removing key points with low confidence coefficient, namely removing the confidence coefficient through a filtering mechanism

Keeping the coordinates of the key points larger than the threshold value theta, and keeping the confidence coefficient

And removing the coordinates of the key points smaller than the threshold value theta, wherein the filtering mechanism is shown as formula 1:

wherein M is_i、(x_i,y_i) And

respectively representing the ith key point and the corresponding coordinate and confidence coefficient of the ith key point, wherein theta represents a threshold value in a filtering mechanism, and N represents the number of the key points;

step S13: mapping the key points to an original picture to generate a heat map, performing down-sampling by a bilinear interpolation method, and combining the down-sampling with global features to form attitude guide features;

step S14: splicing the posture guide features obtained through average pooling and maximum pooling with the global features after average pooling in the local feature branches, and reducing the dimension through a full-connection layer;

step S15: and performing label prediction on the extracted global features through a full connection layer and a softmax layer.

Step S2 specifically includes the following steps:

step S21: feature P of local block₁～P₆Obtaining P through global maximum pooling and global average pooling_maxAnd P_avgSubtracting the two to obtain a difference characteristic P_contThe calculation process is shown in formula 2:

P_cont＝P_max-P_avgequation 2

Step S22: p_maxAnd P_contThe two parts of features are subjected to dimensionality reduction through a sub-network consisting of a 1 multiplied by 1 convolution layer, a BN normalization layer and a ReLU activation function layer respectively to obtain the features

After splicing, the two are conveyed to a sub-network again to be reduced from 2c dimension to c dimension, and finally, the two are connected with

Adding to obtain a representative global feature Q₀The specific implementation process is shown in formula 3:

wherein Q is₀Represents the global contrast feature, CBR represents the sub-network consisting of the 1 × 1 convolution layer, the BN normalization layer, and the ReLU activation function layer, and Concat (·) represents the splicing operation.

Step S3 specifically includes the following steps:

step S31: with P_iFor example, to P_iOther local features P than_jPerforming global average pooling to obtain R_iThe calculation process is shown in formula 4:

wherein, P_jRepresenting other partial local features except the ith partial local feature;

step S32: p_iAnd R_iThe two parts of features are subjected to dimensionality reduction through a sub-network consisting of a 1 multiplied by 1 convolution layer, a BN normalization layer and a ReLU activation function layer respectively to obtain the features

Integration results in a local feature Q with global ties_iThe specific implementation process is shown in formula 5:

wherein Q is_iRepresenting local features with global ties, CBR representing a sub-network consisting of a 1 × 1 convolutional layer, a BN normalization layer, and a ReLU activation function layer, Concat (·) representing a splicing operation.

Step S4 specifically includes the following steps:

step S41: and (3) constraining the posture guidance features by adopting a cross entropy loss function, namely calculating the difference between the predicted label and the real label, wherein the calculation process is shown as a formula 6 and a formula 7:

wherein L is_{ID_loss}Represents a cross entropy loss function, K represents the number of classes, c represents a certain class,

q_irepresentation characteristic q_iOutput values after full connection layer classification; n and yⁿRespectively representing the number of input images of a small batch and the real label,

representing each feature q_iThe predictive tag of (a);

step S42: adopting three loss functions of a difficult sample sampling triple loss and a central loss to constrain global contrast characteristics and local relation characteristics;

the calculation process of the hard sample sampling triplet loss function is shown in equation 8:

wherein N is_KNumber of entities, N, representing small lot_MRepresenting the number of images for each entity, alpha represents a threshold parameter that controls the distance of the positive and negative sample pairs in the feature space,

respectively representing an anchor picture, a positive sample and a negative sample, and k and l respectively representing entity indexes; m and n represent image indexes;

the calculation process of the center loss function is shown in equation 9:

wherein m represents the batch size,

and y_iRespectively representing the characteristics and labels of the ith picture in the batch data,

indicates co-existence of y_iA characteristic center point of sample data of the category;

step S43: combining the loss functions in the steps S41 and S42, jointly supervising the training process of the model, reducing the classification error, and simultaneously constraining the inter-class distance and the intra-class distance of the sample, so that the network model learns the characteristics with more discriminative power, the generalization capability of the network model is improved, and the calculation process is shown as a formula 10, a formula 11 and a formula 12:

L_ML＝λL_GCLR+(1-λ)L_PGFequation 10

L_PGF＝L_{ID_loss}Equation 11

L_GCLR＝L_TriHard+αL_{ID_loss}+βL_{Center_loss}Equation 12

Wherein L is_MLRepresenting a joint loss function, L_PGFCross entropy loss function, L, employed to represent pose guidance features_GCLRThree loss functions, L, used to represent global pair bit and local relationship features_TriHardRepresenting a hard sample sampling triplet loss function, L_{ID_loss}Representing the cross entropy loss function, L_{Center_loss}Representing a central loss function, and alpha, beta, and lambda represent weighting coefficients for balancing the losses of the respective portions.

According to the invention, the global feature branch generates the human body key points by means of the pre-trained attitude estimator, and the human body key points are mapped into the original feature map to generate the heat map, so that the model focuses more on the pedestrian region which is not shielded by combining the global feature, and the noise interference caused by the shielded region is reduced; a global contrast pooling module is introduced, so that noise interference caused by background clutter and shielding is reduced, and the information of the whole body area of the pedestrian is better expressed; in the local feature branch, the feature graph generated by the backbone network is horizontally segmented, local block features are obtained after the maximum pooling, and then deeper feature extraction is carried out on the local block features through a One-vs-rest relation module, so that the features of each local layer can contain the information of the corresponding part and other body parts, and the relation among the body parts is better reflected; for metric learning, a model is supervised by adopting a mode of multi-loss function joint training, so that the model ensures the accuracy of a predicted label and considers the inter-class dispersion and the intra-class compactness.

Drawings

Fig. 1 is a diagram of a network model architecture of the present invention.

FIG. 2 is a schematic diagram of pose keypoint generation according to the present invention.

FIG. 3 is a heat map of the present invention.

FIG. 4 is a diagram of the global contrast pooling process of the present invention.

FIG. 5 is a schematic diagram of One-vs-rest relationship module according to the present invention.

FIG. 6 is a diagram of a multiple loss fusion process of the present invention.

FIG. 7 is a CMC graph of the pedestrian re-identification method based on feature association improvement on Occluded-DukeMTMC data set in accordance with the present invention.

FIG. 8 is a CMC graph of the pedestrian re-identification method based on feature association and multi-loss fusion on Occluded-DukeMTMC data set.

FIG. 9 is a graph of the visual ranking results of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

As shown in fig. 1, the pedestrian re-identification method facing the occlusion scene includes two branches:

1. global feature branching: generating human body key points by means of a pre-trained attitude estimator, mapping the human body key points into an original characteristic diagram, and combining global characteristics to enable a model to pay more attention to an unblocked pedestrian area so as to reduce noise interference caused by the blocked area; a global contrast pooling module is introduced, so that noise interference caused by background clutter and shielding is reduced, and the information of the whole body area of the pedestrian is better expressed;

2. local feature branching: the local part block features are extracted more deeply through a One-vs-rest relation module, so that the features of each local layer can contain the information of the corresponding part and other body parts, and the relation among the body parts is better reflected; for metric learning, a model is supervised by adopting a mode of multi-loss function joint training, so that the model ensures the accuracy of a predicted label and considers the inter-class dispersion and the intra-class compactness.

The invention improves and optimizes a base line method (PGFA), trains and tests an original model and an improved model on a data set, and thus selects the improved method with the highest recognition accuracy. Wherein, the identification precision is reflected on two evaluation indexes of CMC and mAP.

The invention is further illustrated by the following examples.

As shown in fig. 2, a pre-trained alphaphase pose estimator is used to generate human key points of an input picture and coordinates and confidence degrees corresponding to the key points, and key points with lower confidence degrees are filtered and removed, and a filtering mechanism is shown in formula 1; secondly, mapping the key points to an original picture to generate a heat map, performing down-sampling by a bilinear interpolation method, and combining the down-sampled heat map with global features to form attitude guide features, as shown in FIG. 3; then splicing the posture guide features obtained after average pooling and maximum pooling with the global features obtained after average pooling in the local feature branches, and reducing the dimension through a full-connection layer; finally, label prediction is carried out on the extracted global features through a full connection layer and a softmax layer (an output layer in a neural network).

Wherein M is_i、(x_i,y_i) And

respectively representing the ith key point and the corresponding coordinate and confidence coefficient of the point, and a theta tableShowing a threshold value in a filtering mechanism, wherein N shows the number of key points;

as shown in fig. 4, the local blocking feature P₁～P₆Obtaining P through global maximum pooling and global average pooling_maxAnd P_avgSubtracting the two to obtain a difference characteristic P_contThe calculation process is shown in formula 2:

P_cont＝P_max-P_avgequation 2

P_maxAnd P_contThe two parts of features are respectively reduced in dimension through a sub-network consisting of a 1 × 1 convolution layer, a BN (batch normalization) normalization layer and a ReLU (rectified Linear Unit) activation function layer to obtain the features

As shown in fig. 5, the local part block features are subjected to deeper feature extraction through the One-vs-rest relation module, so that the features of each local layer can contain information of the corresponding part and other body parts, and the relationship among the body parts is better reflected.

With P_iFor example, to P_iOther local features P than_jPerforming global average pooling to obtain R_iThe calculation process is shown in formula 4:

P_iand R_iThe two parts of features are subjected to dimensionality reduction through a sub-network consisting of a 1 multiplied by 1 convolution layer, a BN normalization layer and a ReLU activation function layer respectively to obtain the features

wherein Q is_iRepresenting local relational features with global ties, CBR representing a sub-network consisting of a 1 × 1 convolutional layer, a BN normalization layer, and a ReLU activation function layer, and Concat (·) representing a splicing operation.

As shown in fig. 6, the training process of the model is supervised by adopting a combined training mode of three loss functions, namely cross entropy loss, sample-difficult triple loss and central loss, so that the inter-class distance and the intra-class distance of the samples are constrained while the classification error is reduced, the network model learns the characteristics with more discriminative power, and the generalization capability of the network model is improved. The calculation process of the cross entropy loss function is shown in formula 6 and formula 7:

representing each feature q_iThe predictive tag of (a);

the calculation process of the center loss function is shown in equation 9:

wherein m represents the batch size,

the finally adopted combined loss function consists of a cross entropy loss function, a sample-sampling-difficult triple loss function and a central loss function, and the calculation process is shown as a formula 10, a formula 11 and a formula 12:

L_ML＝λL_GCLR+(1-λ)L_PGFequation 10

L_PGF＝L_{ID_loss}Equation 11

L_GCLR＝L_TriHard+αL_{ID_loss}+βL_{Center_loss}Equation 12

The experiment was implemented on an NVIDIA TITAN V GPU (computer graphics model) using a PyTorch deep learning framework. Adopting ResNet50 with the average pooling layer and the full-connection layer removed as a backbone network, and setting the convolution step length of the last layer to be 1; in the training process, data enhancement is carried out on an input image through random overturning and random erasing operations, the size of the image is adjusted to be 384 multiplied by 128, the total number of rounds is set to be 60, and the batch size is set to be 16; the initial learning rate is set to 0.01, and the learning rate is automatically multiplied by 0.1 to be attenuated every 20 rounds; optimizing by random gradient descent algorithm, and setting weight attenuation parameter to 5 × 10^-4Momentum is set to 0.9; three weighting coefficients of alpha, beta and lambda are respectively set to be 2, 5 multiplied by 10^-4And 0.2. The PyTorch is a Python-based continuous computing package, providing two high-level functions: 1. tensor computation with powerful GPU acceleration (e.g.NumPy). 2. A deep neural network comprising an automatic derivation system. Resnet is an abbreviation for Residual Network (Residual Network), a family of networks widely used in the field of object classification and the like and as part of the classical neural Network of the computer vision task backbone, typical networks being Resnet50, Resnet101 and the like.

The method trains and tests on Occluded-DukeMTMC data set. The Occluded-DukeMTMC data set is a large-scale data set specially constructed by Miao et al for the pedestrian re-identification problem in an occlusion scene, wherein a training set comprises 702 entities and 15618 images, and a test set comprises 1110 entities, 2210 images of an Occluded pedestrian to be queried and 17661 images of a candidate pedestrian.

The evaluation indexes of the invention adopt CMC and mAP which are commonly used in retrieval tasks. The CMC refers to an accumulated matching curve, the abscissa is k, and the ordinate is Rank-k and is used for representing the probability that the front k bits in the matching result hit the target to be queried; the mAP refers to average mean value precision, the value range is [0, 1], and compared with the CMC evaluation index, the result can better reflect the performance quality of the pedestrian re-identification model. The larger the mAP index is, the larger the number of positive samples retrieved by the model is, and the higher the ranking of the positive samples is, the better the performance of the model is.

TABLE 1 Performance of pedestrian re-identification method on Occluded-DukeMTMC dataset based on feature correlation improvement

Carrying out an ablation experiment on the pedestrian re-identification method based on characteristic correlation improvement, wherein PGFA represents a baseline method, OURS _ IDloss (R) represents that a One-vs-rest relation module is independently introduced, OURS _ IDloss (G) represents that a global contrast pooling module is independently introduced, and OURS _ IDloss represents that the global contrast pooling module and the One-vs-rest relation module are simultaneously introduced. As can be seen from Table 1, the Rank-1 and mAP of the method of independently introducing the One-vs-rest relation module are respectively improved by 2.7% and 2.2%; the Rank-1 and mAP of the method of separately introducing the global contrast pooling module are not improved; the method of introducing the global contrast pooling module and the One-vs-rest relation module can achieve higher re-identification precision, and Rank-1 and mAP are respectively improved by 3.1% and 2.7%.

The cumulative matching curve of the pedestrian re-identification method on the Occluded-DukeMTMC data set based on the characteristic association improvement is visually reflected by introducing a global contrast pooling module and an One-vs-rest relation module simultaneously as shown in FIG. 7, so that the pedestrian characteristics with higher discriminative power and fine granularity can be extracted by the model, and higher re-identification precision is realized.

TABLE 2 Performance of pedestrian re-identification method on Occluded-DukeMTMC dataset based on feature association and multiple loss fusion

Performing an ablation experiment on a pedestrian re-identification method based on feature association and multi-loss fusion, wherein PGFA represents a baseline method, OURS _ IDloss represents that measurement learning is performed only by adopting a cross entropy loss function, OURS _ IDloss + TriHard represents that measurement learning is performed by simultaneously adopting the cross entropy loss function and a difficult sample sampling triplet loss function, and OURS _ IDloss + TriHard + Centerlos represents that measurement learning is performed by simultaneously adopting three loss functions of cross entropy loss, difficult sample sampling triplet loss and central loss. As can be seen from Table 2, Rank-1 and mAP of the method using cross entropy loss and sample-refractory triplet loss are respectively improved by 2.1% and 3.3%; meanwhile, Rank-1 and mAP of the method using cross entropy loss, sample-sampling-difficulty triplet loss and center loss are respectively improved by 3.5% and 4.2%.

The cumulative matching curve of the pedestrian re-identification method based on feature association and multi-loss fusion on the Occluded-DukeMTMC data set is shown in FIG. 8, so that the monitoring of the training process of the model by simultaneously using three loss functions of cross entropy loss, sample-difficultly-sampled triple loss and central loss is intuitively reflected, and higher re-identification precision can be realized.

As shown in fig. 9, a pedestrian image is randomly selected as a target to be queried, the improved method is tested, and a visual sorting result is returned. The Query is a target to be queried, and the image with the T/F letter superscript is a Query result returned by the candidate library. T represents that the query result and the target to be queried belong to the same entity; f represents that the query result and the target to be queried do not belong to the same entity. It can be seen that the improved pedestrian re-identification method can return a better ranking result. The PGFA (Pose-Guided feed Alignment) represents a baseline method, OURS _ IDloss (R) represents that an One-vs-rest relation module is independently introduced, OURS _ IDloss (G) represents that a global contrast pooling module is independently introduced, OURS _ IDloss represents that the global contrast pooling module and the One-vs-rest relation module are simultaneously introduced, OURS _ IDloss + Trid represents that a cross entropy loss and a difficult sample sampling triplet loss function are adopted while the global contrast pooling module and the One-vs-rest relation module are introduced, and OURS _ IDloss + Hard + Centerlos represents that a cross entropy loss, a difficult sample sampling triplet loss and a central loss function are adopted while the global contrast pooling module and the One-vs-rest relation module are introduced. As can be seen from fig. 9, the present invention has effectiveness and advancement in solving the problem of pedestrian re-identification occlusion, and can achieve higher re-identification accuracy.

The above embodiments are not intended to limit the present invention, and the present invention is not limited to the above examples, and those skilled in the art may make variations, modifications, additions or substitutions within the technical scope of the present invention.

Claims

1. A pedestrian re-identification method facing an occlusion scene is characterized in that: the method comprises the following steps:

2. The occlusion scene-oriented pedestrian re-identification method according to claim 1, wherein: the step S1 specifically includes the following steps:

wherein M is_i、(x_i,y_i) And

respectively represent the ith key point and theCorresponding coordinates and confidence degrees, theta represents a threshold value in a filtering mechanism, and N represents the number of key points;

3. The occlusion scene-oriented pedestrian re-identification method according to claim 1, wherein: the step S2 specifically includes the following steps:

P_cont＝P_max-P_avgequation 2

And

Adding to obtain a representative global feature Q₀The concrete implementation process is shown in formula 3：

4. The occlusion scene-oriented pedestrian re-identification method according to claim 1, wherein: the step S3 specifically includes the following steps:

And

5. The occlusion scene-oriented pedestrian re-identification method according to claim 1, wherein: the step S4 specifically includes the following steps:

representation characteristic q_iOutput values after full connection layer classification; n and yⁿRespectively representing the number of input images of a small batch and the real label,

representing each feature q_iThe predictive tag of (a);

and

the calculation process of the center loss function is shown in equation 9:

wherein m represents the batch size,

L_ML＝λL_GCLR+(1-λ)L_PGFequation 10

L_PGF＝L_{ID_loss}Equation 11

L_GCLR＝L_TriHard+αL_{ID_loss}+βL_{Center_loss}Equation 12