CN112101150B

CN112101150B - Multi-feature fusion pedestrian re-identification method based on orientation constraint

Info

Publication number: CN112101150B
Application number: CN202010901241.9A
Authority: CN
Inventors: 艾明晶; 单国志
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2020-09-01
Filing date: 2020-09-01
Publication date: 2022-08-12
Anticipated expiration: 2040-09-01
Also published as: CN112101150A

Abstract

The invention discloses a multi-feature fusion pedestrian re-identification method based on orientation constraint, and provides a new network model aiming at factors such as orientation change, local shielding and the like. The influence of orientation difference is mostly ignored in the existing research on pedestrian re-identification, and the method can preferentially identify the target with the same orientation as the direction of the query image while ensuring the accuracy. Firstly, designing a pedestrian orientation classifier, and marking the orientation of a pedestrian in a picture; and then inputting the picture into a two-branch convolutional neural network, extracting global features and local features of the pedestrian and carrying out constraint training, wherein one branch processes samples with the same orientation, and the other branch processes samples with different orientations. According to the method, a mixed loss function of orientation constraint is designed at the same time, three parts of loss learning network weights are combined, and the accuracy is effectively improved. Experiments prove that the invention respectively achieves 94.71 percent and 87.31 percent of rank-1 accuracy on the mark-1501 and Duke MTMC-ReID data sets, and the average level is superior to that of most methods.

Description

Multi-feature fusion pedestrian re-identification method based on orientation constraint

Technical Field

The invention relates to the field of computer vision and image processing, in particular to a multi-feature fusion pedestrian re-identification method based on orientation constraint (figure 1). The pedestrian re-identification method based on the two-branch orientation constraint network mainly overcomes the adverse effect of orientation difference on the re-identification through the two-branch orientation constraint network model, solves the problem of local shielding through a method of fusing global features and local features, and finally can preferentially identify the pedestrian target with the same orientation as an inquiry image on the premise of ensuring the accuracy rate of the re-identification of the pedestrian, and the re-identification result is more accurate and tidy and better accords with the actual application scene.

Background

Pedestrian Re-identification (ReID) is a technique for determining whether a specific pedestrian exists in an image or a video by using a computer vision technique, and belongs to a sub-problem of image retrieval. Due to the application value of the method in the fields of video monitoring and security, the method becomes a research hotspot in recent years. In 2006, the pedestrian re-identification technology was first separated from target tracking and studied as an independent visual topic. To date, the research methods are mainly divided into two categories: traditional methods based on manual features and deep learning methods based on neural networks. Before 2014, pedestrian re-identification mainly utilizes a traditional image processing means to extract low-level color features, texture features and medium-level attribute features, but because the features are easily interfered by external environment and are not enough in distinguishing degree, high accuracy cannot be obtained all the time.

In recent years, the wide application of deep learning technology in the field of computers has led to breakthrough development of the technology. But this is still a very challenging issue due to the effects of local occlusion, pose variations, orientation differences, illumination and resolution. According to different research emphasis, the pedestrian re-identification method based on deep learning can be generally divided into metric learning and feature extraction. In particular, the pedestrian orientation-based approach is most relevant to the work of the present invention, and the related research background thereof is described in turn as follows.

(1) Method for learning based on measurement

The goal of metric learning is to make the maximum distance between samples belonging to the same class smaller than the minimum distance between samples of different classes. In deep learning, the main focus of implementing metric learning is how to design the corresponding loss function. At the beginning of research, pedestrian re-identification is simply regarded as a classification problem, a plurality of pictures belonging to the same pedestrian are taken as a category, a full connection layer is connected to the tail end of a Convolutional Neural Network (CNN), then the pictures are converted into probability distribution through a softmax function, and finally training is carried out through cross entropy loss.

With the continuous research of people, the metric learning method directly maps the pictures of the same pedestrian to a high-dimensional space to form a clustering effect, namely, the pictures of the same pedestrian are regarded as a positive sample pair, the pictures of different pedestrians are regarded as a negative sample pair, and the essence of the method is that the distance between the positive sample pair and the high-dimensional space is smaller than that between the negative sample pair. Typical metric learning methods include contrast loss, triplet loss, quadruplet loss, and the like.

In 2015, the related scholars proposed a triple loss function in the research on face recognition, which becomes a typical loss measurement method, and a schematic diagram thereof is shown in fig. 2. Wherein (Anchor, Positive) is a Positive sample pair, and (Anchor, Negative) is a Negative sample pair. Through continuous iteration of the training process, the distance between the positive sample pairs is gradually reduced, and the distance between the negative samples is continuously enlarged, so that the clustering purpose is achieved, as shown in formula (1). (reference 1: Schroff, Florian, Kalenecheko, Dnitry, James. faceNet: A unifired Embedding for Face Recognition and clarification [ J ],2015.)

L _tri ＝[d(a,p)-d(a,n)+α] ₊ (1)

[x] ₊ ＝max(x,0) (3)

Wherein d represents the distance between two feature vectors, generally adopting Euclidean distance, as shown in formula (2). [ x ] of] ₊ Represents the larger value between x and 0, as shown in equation (3). The reference sample Anchor is denoted by the letter a, p denotes the Positive sample pair Positive, n denotes the Negative sample pair Negative, and α is the threshold for controlling the sample distance.

In 2016, Cheng et al proposed an improved triple loss function based on the thought of adding absolute distance, so that the re-recognition performance was greatly improved. (reference 2: D.Cheng, Y.H.Gong, S.P.Zhou, J.J.Wang and N.N.Zheng, "Person re-identification by multi-channel part-based CNN with improved triple function," in Proceedings of the IEEE Conference Computer Vision and Pattern Recognition, Las Vegas, NV, USA: IEEE, pages 1335-1344,2016.)

In 2017, Cheng et al proposed a quadruple loss, which is a negative sample picture more than the triplet, namely: reference sample a, positive sample p, negative sample n1, and n 2. As shown in equation (4), the former term is called strong pushing and the latter term is called weak pushing. The quadruple directly considers the absolute distance between positive and negative samples by adding weak push, so that the model can learn better characteristics. (reference 3: W.H.Chen, X.T.Chen, J.G.Zhang and K.Q.Huang, "Beyond triple loss: a deep rectangle network for person re-identification," in Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Hawaii, USA: IEEE,2017.)

L _q ＝[d(a,p)-d(a,n)+α] ₊ +[d(a,p)-d(n1,n2)+β] ₊ (4)

Where n1 denotes the first negative sample, n2 denotes the second negative sample, and α and β are threshold parameters.

In 2017, Hermans et al propose a Hard sample sampling (Hard sample mining) method for re-identifying the relation of input samples for pedestrians. The basic idea is that the selection of the sample pairs is as difficult as possible, and the sample which is the least similar to (farthest from) the reference sample anchor is selected as a positive sample in the pictures of the same person in a training batch, and the sample which is the most similar to (closest to) the reference sample anchor is selected as a negative sample in the pictures of other persons. The difficult sample triples obtained in the way obviously improve the generalization performance. (reference 4: A. Hermans, L.Beyer and B.Leibe, "In feedback of the triple loss for person re-identification," arXiv preprinting arXiv:1703.07737,2017.)

In addition, Xiao et al further propose a boundary mining penalty based on the combination of the quadruple and the hard sample sampling advantage. The metric learning-based method is based on the triplets and becomes the most widely applied method in the similarity measurement of pedestrian re-identification at present, and fig. 3 can summarize the evolution process of the metric loss.

(2) Method based on feature expression

In general, the feature description of an image is divided into three layers: low-level color features, medium-level attribute features, depth features. At present, the main approach is to extract deep features based on deep neural networks. Initially, the CNN-based method mainly extracts global features of a whole pedestrian picture, and with the progress of research, it is widely recognized that a good degree of distinction cannot be achieved only by using the global features, so that the semantic feature-based method and the local feature-based method become current research hotspots.

A global feature based approach. In the early stage of research, some methods directly use classical models such as ResNet and GooleNet to extract global features from a whole pedestrian picture, such as: in 2017, Sun et al propose an SVDNet pedestrian re-identification network, and iteratively optimize a converged network model by using a singular value decomposition full-connection layer weight mode. (reference 5: Y.Sun, L.ZHEN, W.Deng and S.Wang, "SVDNet for Pedestrian Retrieval," 2017 IEEE International Conference on Computer Vision (ICCV), Venice, pages 3820-. (reference 6: H.Luo, Y.Gu, X.Liao, S.Lai, and W.Jiang, "Bags of Tricks and A Strong Baseline for Deep Person Re-identification," arXiv preprint arXiv:1903.07071,2019.)

Methods based on semantic features. The method is gradually developed along with the research of human body posture estimation, and the main idea is to acquire a local Region of Interest (ROI) by utilizing skeleton key point positioning or image semantic segmentation and obtain complex feature representation by combining global features. In 2017, a Spindle Net method proposed by ZHao et al is a representative study based on semantic features, and the method firstly extracts 14 human body key points by using a posture detection model, then divides 7 ROI areas by using the key points, and enters the same CNN network with an original picture to extract features. (reference 7: H.Y.ZHao, M.Q.Tian, S.Y.Sun, J.Shao, J.J.Yan and S.Yi et al, "Spindle net: person re-identification with human body region defined feature extraction and fusion," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Hawaii, USA: IEEE, pages907-915,2017.) in addition, the GLAD (Global-local alignment descriptor) model proposed by Wei et al is a comparative classical semantic feature extraction method. (reference document 8: L.Wei, S.Zhang, H.Yao, W.Gao and Q.Tian, "GLAD: Global-Local-Alignment Descriptor for Scalable Person Re-Identification," in IEEE Transactions on Multimedia, vol.21, No.4, pages 986-

A local feature based approach. Although it is reasonable to extract local features according to semantic image division, it is not always necessary, especially because the effect of the human body posture estimation at present is not ideal, and errors are introduced by wrong posture estimation. Therefore, many studies are now conducted to divide the pedestrian image horizontally or vertically and then to align the parts by using a certain strategy. In 2018, the PCB method proposed by the university of qinghua and the AlignedReID method proposed by the academy of spaciousness are two typical methods. The PCB divides the picture into 6 parts from top to bottom, then 6 local features are obtained by utilizing horizontal pooling, and then each feature passes through a full connection layer and respectively calculates a cross entropy loss representing learning. (reference 9: Y.Sun, et al. "Beyond part models: Person retrieved with retrieved part powder)," in Proceedings of the European Conference on Computer Vision (ECCV),2018.) the AlignedReID also horizontally divides the picture into 8 parts, then extracts features separately using CNNs and implements local alignment using the shortest path method. (reference 10: X.Zhang, H.Luo, X.Fan, W.L.Xiao, Y.X.Sun and Q.Xiao et al, "AlignedReID: reproducing man-level performance in person re-identification," arXiv predicting arXiv:1711.08184,2017.)

(3) Orientation and view-based method

Most of current researches concern the important factor of local shading, but the consideration of factors such as orientation change is relatively less, so that a network model cannot adapt to complex orientation change, and misjudgment of a network can be caused under the condition that the orientation difference is obvious. The following analyses the relevant research work based on orientation changes. In 2018, DVAML methods attempted to learn feature spaces between samples in the same orientation and in different orientations, but their methods did not achieve very high accuracy. (reference 11: P. Chen, X. xu and C. Deng, "Deep view-aware metric learning for person re-identification," in Proceedings of the Twenty-Seveth Interval Joint reference on Intelligent analysis, pages 620 626, International Joint reference on Intelligent analysis Organization,2018.) Sun et al generated a large virtual pedestrian data set PersonX using the Unity engine and quantitatively analyzed the impact of orientation on pedestrian re-identification, which is a great inspiration for the present invention. (reference document 12: X.Sun and L.Zheng, "separating person re-identification from the viewpoint of view-point," arXiv preprinting arXiv:1812.02162,2018.)

In summary, most of the existing pedestrian re-identification technologies solve common occlusion problems by fusing global and local features in an inter-class constraint training mode, but relatively few researches on orientation differences are performed, and particularly, misjudgment of a network can be caused under the condition that samples of the same person in different orientations have obvious differences or samples of different pedestrians in the same orientation are very similar. The method mainly solves the problem, can improve the identification accuracy, and can preferentially identify the sample with the same direction as the query image during retrieval.

Disclosure of Invention

The invention aims to overcome the influence of factors such as pedestrian orientation difference, local shielding and the like on the ReiD technology, and the accuracy and the reliability of pedestrian re-identification are improved through the provided pedestrian orientation classification algorithm and the pedestrian re-identification network model, so that a research basis is provided for target tracking and other computer vision tasks.

The invention mainly researches the pedestrian re-identification problem from the direction difference angle. As shown in fig. 4, from an orientation perspective, one of the main factors affecting the ReID technique is: the samples of the same person in different orientations have a lower similarity (e.g., a and b in fig. 4 are images of the same pedestrian in different orientations), while different pedestrians with the same orientation are sometimes similar (e.g., c, d, e, f in fig. 4 are images of different pedestrians in the same orientation). From the aspect of feature similarity, the distance between the same person image should be smaller than the distance between different person images. Meanwhile, for the same person, the distance of the image with the same orientation should be smaller than that of the image with different orientation, which is reasonable because the same direction with the person should have the highest similarity.

Based on the consideration, the invention provides an effective pedestrian orientation classification method for judging the orientation of a pedestrian in a picture, and provides an orientation constraint-based multi-feature fusion pedestrian re-recognition network model on the basis of the orientation, which comprises two different branches, wherein the two different branches are used for respectively processing samples with the same and different orientations, each branch is fused with two global and local features to represent the pedestrian, and finally, three loss functions are fused for constraint training, so that accurate pedestrian re-recognition is realized, the target with the same orientation as the retrieved picture can be preferentially recognized, and the practical value is very high.

As shown in fig. 5, a represents original pedestrian sample data, where different colors represent different pedestrians and different shapes represent different body orientations. The pedestrian re-identification network maps each pedestrian sample to a high-dimensional space, the result obtained by the method without considering the orientation factor can be represented by a graph b, only the inter-class distance is considered, and the result of the invention is shown by a graph c, so that different pedestrians can be distinguished, and meanwhile, the clustering of the view angle level can be formed, and the model can preferentially identify the pedestrians in the same orientation.

Next, the main content of the present invention will be described in detail, which specifically includes the following steps:

the method comprises the following steps: design pedestrian orientation classifier based on multi-feature fusion

Orientation information (front, back, left and right) is an inherent attribute feature of a pedestrian image, which greatly influences the discrimination capability of a re-identification network, but the existing ReID data set does not mark the attribute when being collected. Therefore, the invention firstly designs a pedestrian orientation classifier based on multi-feature fusion for accurately judging the orientation of the pedestrian in the picture.

The orientation classification is actually a multi-classification task, and in order to improve the classification accuracy, the invention designs an orientation classification network model as shown in fig. 6. As shown in the figure, for a pedestrian image, 18 joint key points of the pedestrian are extracted by a PAFs method (openpos human pose key point extraction network), and the 18 key points can approximately describe the contour of the pedestrian and can also accurately obtain the coordinate position of each key point in the image. Secondly, the whole pedestrian image can be divided into three body parts of a head, an upper body and a lower body by transversely dividing the coordinates.

The entire pedestrian image and the three body parts form the input of a convolutional neural network that extracts the features of the four parts of the image using convolutional modules (here, the ResNet50 network) and then concatenates the resulting feature vectors to form a combined vector for the final pedestrian representation. And finally, adding a full connection layer at the tail end of the network, performing four classifications by using a softmax loss function, and continuously training and iterating to obtain a final classification result.

The method has the advantages that the global features and the three body local features are fused, the robustness of the features is enhanced, and particularly, the local features of the head and the feet can distinguish orientation differences better under certain conditions. The network is trained using the RAP dataset because the RAP dataset is tagged with an orientation, which can simplify the tagging cost. It should be noted that the classifier is pre-trained using a training set, and the weights of the network are not changed in the implementation of the subsequent steps.

Finally, the orientation classification network is used for labeling the orientation information of the two large-scale pedestrian re-identification data sets Market1501 and DukeMTMC-reiD, and a foundation is provided for orientation constraint in subsequent pedestrian re-identification. Experiments prove that the method has good performance effect on pedestrian orientation identification and is superior to most methods.

Step two: sampling difficult samples based on pedestrian orientation, selecting triplets for training

As shown in equation (1), the triplet loss is the most widely used measurement method, but the training process depends on the selection manner of the triplet sample to a great extent, and the excessively simple triplet is not favorable for the learning of the image features. Practice proves that the sampling of the hard samples based on the training batch (batch) is a relatively effective triple selection strategy, namely, for each training batch, P pedestrians are randomly selected, K different pictures are randomly selected in the image of each pedestrian, namely, P × K images are contained in one batch, and then the most difficult positive sample (the most dissimilar) and the most difficult negative sample (the most similar) are selected for each picture to form a triple.

The second step of the invention is to add the consideration of the orientation of the pedestrian to the widely used sampling strategy of the difficult sample, and provide a sampling strategy of the difficult sample based on the orientation of the pedestrian for selecting the training triples. This is based on the simple assumption that the distance between differently oriented samples of the same person is greater than the distance between the same oriented samples, i.e. samples of the same person in the same direction should be more similar.

Specifically, on the basis of sampling of difficult samples in batches, P pedestrians are randomly selected for each training batch, but the K pictures of each pedestrian are not simply randomly selected, and the K pictures are ensured to have samples with the same orientation and also have samples with different orientations. For example, in the experiment, if K is 4, when selecting a sample for each pedestrian, two different directions (divided into four directions, namely front, rear, left and right directions) are selected first, and then two samples of the pedestrian are selected from each direction, so that a training batch composed of the samples with different orientations is included.

The advantages of this orientation-based sample selection strategy are: when positive and negative samples are taken for each pedestrian, it is more likely that a positive sample is found that is less similar to it (i.e., a sample that is oriented differently from it). In order to prove the correctness of the strategy, the invention uses a mark-1501 data set to perform experimental verification on the basis of BaseLine. The result proves that under the condition that other strategies are not changed, the accuracy of the two performance indexes of mAP and rank-1 is improved by about 0.7% only by changing the sample selection strategy. The verification of the strategy also lays a foundation for establishing a re-recognition network model based on orientation constraint.

Step three: multi-feature pedestrian re-identification network design based on orientation constraint

On the basis of the orientation judgment and the sampling strategy verification of the first two steps, the third step of the method designs a specific pedestrian re-identification network model. The method is divided into the following three aspects:

3.1 design of network architecture

The overall network structure diagram of the present invention is shown in fig. 1, and the main purpose of the present network is to overcome the influence of orientation difference on re-identification, that is, to make similar samples belonging to the same orientation but different classes more separable, and to make samples belonging to different orientations but the same class closer. The network is a two-branch structure, and can map a sample to two different feature spaces simultaneously, wherein each feature space corresponds to one network branch. Wherein each branch is designed with different mixing loss functions, the network design enables the first branch (called the same orientation branch) to focus on samples with the same orientation mainly, and the second branch (called the different orientation branch) to adapt to the change situation of samples with different orientations better.

First, based on the triple sample sampling method based on the pedestrian orientation proposed in step two, the network selects a batch of input images (N images, where N is P K). Then, the trained orientation classifier is used for judging the orientation of the pedestrian in each picture to obtain a corresponding orientation label. Thus, each pedestrian picture can be represented by the triplet (I, Y, O), where I represents the image, Y represents the ID of the pedestrian, i.e. to which pedestrian it belongs, and O represents the orientation label of the pedestrian.

As shown in fig. 1, for the input image of the batch, the network first extracts its simple features through a shared convolution network module, and the shared module can extract common color, attribute, texture and other features due to the convolution modules belonging to the lower layers. Then, two branch convolution networks are added after the shared convolution module to map the samples into two different high-dimensional subspaces. The invention uses ResNet50 which is most commonly used at present as a backbone network, and the structure of the backbone network has obvious hierarchy, so that a sharing layer and a branch layer can be easily divided. After testing, the first layer is used as a sharing module, the last three layers are used as branches, and finally, the upper layer network and the lower layer network respectively output a feature vector of N x d. Thus, each picture is equivalent to extracting two different features, and 2N features are output in total.

The two different feature spaces are different mainly in that the first branch only selects triads with the same pedestrian orientation, and the second branch only selects triads with different pedestrian orientations. Therefore, due to the difference of the triplets, the upper branch and the lower branch have respective emphasis, and the change of the orientation difference can be better adapted. Different triplets represent different training strategies, and are introduced below mainly for the feature fusion strategy used in the network and the training strategy of each branch.

3.2 Multi-feature fusion strategy

The design of the network structure is mainly designed by considering orientation factors, which is the main innovation of the invention. In order to improve the distinguishing degree of the features, the invention also adopts a multi-feature fusion mode when each branch extracts the image features.

The combination of the global feature and the local feature can better describe the pedestrian feature of an image and can overcome the problem of occlusion of body parts to a certain extent. Therefore, when extracting image features, each branch of the network does not simply extract a global representation, but also adopts a mode of combining global features and local features. By introducing local features, the expression capacity of the features can be enhanced on one hand; on the other hand, especially in the same orientation branch, the negative sample pair (a pair of images not belonging to the same pedestrian) of the same orientation is likely to be very similar, and the local features can capture some detail differences, thereby better distinguishing the positive and negative samples.

In order to avoid complicated semantic division, the invention refers to a simple and effective horizontal segmentation method AlignedReID, as shown in FIG. 7, the AlignedReID method also horizontally segments the picture into 8 parts, and then uses CNN to respectively extract features. However, in order to solve the alignment problem of different parts, for each 8 local features of two pictures, the aligndreid calculates the distance matrix between them, and then finds a shortest path from the starting point to the end point, where the total distance of the shortest path is the final distance of the two pictures. Similarly, the body part is obtained through horizontal division, and automatic alignment is realized through a dynamic planning method based on the shortest path. In the training stage of the network, the triples are directly mined through global features, but the local feature triples are added in the upper branch and the lower branch for auxiliary training, and only the global features are still used in the testing stage. This further improves the performance of the model.

3.3 network training strategy

In the practice of convolutional neural networks, the key after the network structure is determined is the selection of a loss function, and the characteristics of the loss function are directly related to the learning effect of the network. The triple loss can consider the relative distance between a positive sample and a negative sample, the softmax classification loss can learn the distribution of the samples in the feature space, and the center loss can enable the samples of the same class to be close to the class center. Many studies show that the three kinds of loss joint training can achieve better effect, so that most of the current researches adopt a mixed training strategy.

The hybrid training strategy adopted in the present invention is described in detail below, and is composed of three parts, namely, same-orientation branches, different-orientation branches and cross constraints. This is one of the key factors that the present invention can achieve in addition to the network structure.

Branching in the same orientation. For the same-orientation branch, only the samples with the same orientation are selected from one batch to form a triplet, that is, the samples with the same orientation are selected when both positive and negative samples are selected for the current training sample (since each batch is selected according to the orientation-based sampling strategy set forth in step two, the samples with the same orientation must exist). The selected triples have obvious advantages, can learn certain pedestrian orientation information while learning pedestrian identity information, and reduce the complexity of re-identification to a certain extent, so that samples which belong to the same pedestrian and have the same orientation are gathered more closely, and the apparent characteristics of different pedestrians in the same orientation are very similar, therefore, the negative samples have certain complexity and are more representative. Here, a denotes anchor, p denotes positive samples, n denotes negative samples, s denotes the same orientation, and d denotes a different orientation. As shown in formula (5):

wherein ps represents the same pedestrian in the same orientation, ns represents different pedestrians in different orientations, and the rest characters have the same meaning as formula (1).

Meanwhile, in order to better learn feature distribution, the strategy adds softmax classification loss and center loss on the basis of the triples. However, it should be noted that, since the same-orientation branches only consider the same-orientation samples, the pedestrians and the directions are not classified according to the id of the pedestrian, but are considered as a combined label, and one orientation of one person is classified into one category, for example, M pedestrians are shared, and four directions are classified into M × 4 different categories. Both softmax loss and center loss are performed at this combined classification level. As shown below, equation (6) represents a softmax loss, and equation (7) represents a central loss.

Wherein N represents the size of batch, f _i Feature vector representing the ith image, f _i (k) The kth dimension of the ith feature vector is represented, label _ i represents the combination label (id) of id and orientation, and M × T represents the number of classes to be classified, and is also the length of the feature vector obtained after the layers are all connected. C _{label_i} Representing class centers for class i.

The final total loss for the same orientation branch is made up of three parts (as shown in equation 8). It can be seen that the purpose of these three parts of lost training is consistent, and they together make the same-orientation samples with people form a good clustering effect in the feature space. In fact, even though this branch only considers samples of the same orientation, the rank-1 accuracy obtained on Market-1501 by performing experiments using this branch alone is already over 90%. Where λ represents a weighting factor for the center loss.

L _same ＝L _triSame +L _ceSame +λL _center (8)

Branching in different orientations. It is clear that the first branch ignores the relationship between differently oriented samples, where the second branch is used to account for the training of differently oriented samples, thus making up for the deficiencies of the first branch. Due to the different orientations, even the same person tends to have very different apparent characteristics. Thus, the construction of this branch triplet selects samples of different orientations, i.e., when the triplet is selected, both positive and negative samples select images of different orientations than the training samples. The purpose is to allow samples of different orientations to get more attention and thus to be able to widen the inter-class distance.

Similar to equation (5), the triplets of differently oriented branches are represented as equation (9):

where pd represents a sample of different orientations of the same person, nd represents different pedestrians in different orientations, and β is the distance threshold.

In order to take into account the distribution and the intra-class distance of the samples, the softmax loss and the center loss are also used, but since the branch only considers samples with different orientations, the branch is classified only according to the id of the sample, and M persons have M categories, which are different from the branch with the same orientation. The formula (10) is a softmax loss formula, and the central loss is the same as the formula (7).

Wherein M represents the number of classes of pedestrians, and the rest characters have the same meaning as formula (6).

Finally, it can be obtained that the loss of the branches with different orientations is also the sum of three parts, as shown in formula (11), which compensates for the relation of the samples with different orientations that are not considered by the first branch.

L _diff ＝L _triDiff +L _ceDiff +λL _center (11)

And (5) cross constraint training. The first two branches respectively consider the relation between samples based on orientation constraints. In order to consider the overall distribution rule of the sample, the training strategy adds cross constraint among branches, and is still mainly based on triple loss, such as formula (12) and formula (13).

Equation (12) is also a triplet loss function, where θ is the distance threshold. For training sample a, the positive sample selected is different from its orientation, while the negative sample is consistent with its orientation. With this option, it is ensured that the distance of the positive samples obtained in the two branches is always smaller than the distance of the negative samples, thus organically combining the training of the two branches.

Equation (13) does not consider negative samples, but only the relative relationship between positive samples, so it is an intra-class constraint. This option selects a relatively small separation threshold δ that ensures that within a class (i.e., between samples of the same pedestrian), the distance between samples of the same orientation is less than the distance between samples of different orientations. This is somewhat meaningful because with such an intra-class constraint, when retrieved from a query image, samples can often be preferentially obtained that are oriented the same, thereby increasing the probability of being the same person.

In summary, when the network is trained by combining the above three losses in the training phase, the total loss function can be represented by equation (14), where μ is a weight parameter and can take a relatively small value.

L _Total ＝L _same +L _diff +L _cross +μL _intra (14)

Since the distance between the features of the sample image represents the similarity between the two images, it can be converted into distance calculation in the testing stage. Similarly, firstly, judging the orientation of the sample image through a designed orientation classifier, and if the sample image is the sample with the same orientation, calculating the Euclidean distance by using the features obtained by the first branch; if the samples are of different orientations, the distance is calculated using the samples of the second branch. And finally, fusing to form a distance matrix, and performing performance test on the test gallery based on the distance matrix.

In general, the main contribution of the invention is to design a pedestrian re-recognition neural network model based on orientation constraint and multi-feature fusion, and belongs to the more classical subject in the field of computer vision. The pedestrian re-identification method is based on multiple factors influencing the accuracy rate of pedestrian re-identification, comprehensively considers the problems of pedestrian orientation difference, local shielding and the like, respectively considers pedestrians in the same orientation and different orientations, overcomes the influence of the orientation difference on the re-identification to a certain extent, and has certain innovation because the proposed orientation constraint scheme is not involved in other existing pedestrian re-identification methods.

In addition, in cooperation with the proposed pedestrian re-identification method, the invention also provides a sampling strategy for selecting the triple sample based on the pedestrian orientation information and a classifier design scheme for judging the pedestrian orientation, which are also one of the main contributions of the invention. Experimental results prove that the pedestrian re-identification scheme provided by the invention is superior to most of the existing methods in identification accuracy and is more applicable in practical scenes.

Drawings

Fig. 1 is an overall structure diagram of the orientation constraint re-identification network proposed by the present invention, which is described in detail in step three.

Fig. 2 is a schematic diagram of a triplet loss function.

Fig. 3 is a diagram illustrating the evolution of the metric loss function.

Fig. 4 is an example of pedestrian image contrast for different orientations.

Fig. 5 is a schematic diagram of the recognition effect of the network model of the present invention.

FIG. 6 is a proposed global and local feature based orientation classifier of the present invention.

FIG. 7 is a schematic diagram of an AlignedReiD method based on local feature alignment.

FIG. 8 is a schematic view of vector angles for orientation classification based on pose joint points.

Fig. 9 is an example of a data set sample used in the experiment.

FIG. 10 is an example of the results of an experiment of the present invention on a data set.

Fig. 11 is a schematic diagram of a distance curve obtained in the training process of the proposed network model.

Detailed Description

The technical scheme, the experimental method and the test result of the invention are further described in detail with reference to the accompanying drawings and specific experimental embodiments.

The invention relates to a pedestrian re-identification subject in the field of computer vision, and provides a multi-feature fusion pedestrian re-identification method based on orientation constraint.

The experimental procedure is specifically described below.

The method comprises the following steps: the data set is sorted (taking mark-1501 as an example), orientation judgment is carried out on each image in the data set by adopting a mode (method III) based on combination of global features and local features, and orientation labels are marked.

Step two: and (3) constructing a two-branch convolutional neural network, realizing a corresponding loss function, inputting a training set sample into the network for training, observing the training condition, and continuously iterating to obtain a training model.

Step three: and testing according to the training result, searching the pedestrian images with the same id as the query image from the galery library for each query image in the query to form a result sequence, and simultaneously calculating to obtain a corresponding evaluation index.

The experimental conditions and conclusions of this patent are described in detail below.

(1) Experimental results of pedestrian orientation classifier

To test the accuracy of the orientation classifier proposed in the present invention, the present invention performed comparative experiments based on the RAP dataset and the other two methods.

The first method is to classify based on the relative positions of posture joint points, firstly extract joint key points of each pedestrian image based on a PAFs method, then select two joint points of a left shoulder and a right shoulder from the extracted joint key points to form a left-to-right vector, and finally solve a clockwise included angle between the vector and the vertical direction (from top to bottom). The pedestrian orientation can be judged by the range of the included angle (with 45 degrees as classification intervals), as shown in fig. 8.

And the second method directly uses the convolutional neural network ResNet50 for training to realize the four-point classification of the pedestrian images.

The third method is a classification method based on the fusion of global features and local features.

The classification accuracy and performance of the three pedestrian orientation classification methods are compared through experiments, the experimental results are shown in table 1, and obviously, the method provided by the invention has certain advantages in accuracy.

TABLE 1 comparison of Performance of three pedestrian orientation classifiers

Method	Description of the method	Accuracy (%)
			Method 1	Classification based on relative positions of posture joint points (mathematical method)	82.07
Method two	Global feature based on pedestrian image classification using CNN	87.33
			Method III	Classification based on global and local feature combinations (invention)	89.03

(2) Pedestrian re-identification data set and evaluation index

The test data sets and evaluation indices used in the ReID experiments are presented next. As shown in FIG. 9, the proposed method of the present invention was tested on both large public data sets, Market-1501 and DukeMTMC-ReiD. The Market-1501 includes 1501 pedestrians shot by 6 cameras and 32668 detected rectangular frames of the pedestrians, the training set comprises 751 persons and 12,936 images, and on average, each person has 17.2 pieces of training data; the test set had 750 persons, contained 19732 images, and an average of 26.3 test data per person. DukeMTMC-ReID is a pedestrian re-recognition subset of the pedestrian tracking DukeMTMC dataset that contains a total of 36411 pictures of 1404 pedestrians, of which 16522 images of the 702 pedestrians were used for training and the remaining images were used for testing.

In the task of pedestrian re-identification, the testing process usually gives a (or a group of) image to be queried (query), then calculates similarity between the image and images in a candidate set (galery) according to a model, and arranges the images in a sequence from large to small according to the similarity, wherein the closer the image is to the query image. In order to evaluate the performance of the pedestrian re-identification algorithm, the current practice is to calculate the corresponding index on the public data set, and then compare with other models. CMC curves (relational Material metrics) and mAP (mean Average precision) are the two most commonly used evaluation criteria.

In the experiment, rank-1, rank-5 and mAP indexes which are most commonly used in a CMC curve are mainly selected, wherein rank-k refers to the probability that the top k-sheet (with the highest confidence) in the search results has correct results, and the mAP index is actually equivalent to an average level, and the higher the mAP index is, the higher the query results which are the same person as the query are in the whole ranking list is, the higher the model effect is.

(3) ReID experimental details and main parameter configuration

In the experiment, the present invention used ResNet50 as the backbone network, the first layer and all layers before it (layer) as shared modules, and the last three layers as branch modules (not sharing weights). Setting the convolution step length (stride) of the last layer as 1, obtaining the final 2048-dimensional features through global average pooling, and adding a batch normalization layer and a full connection layer for calculating classification loss.

For all input data, the method resets all image sizes to 256 × 128 and the batch size to 128, including 32 pedestrians and 4 pictures per pedestrian (N-128, P-32, K-4). The images were then randomly augmented and cropped, and each image was Randomly Erased (REA) with a probability of 0.5. It should be noted that: when distinguishing between left and right orientations, horizontal flipping is not possible, as this would change the orientation of the pedestrian.

In training, the network is trained for 120 generations, and the initial learning rate is set to be 3.5X 10 ^-4 The first 10 generations were trained using a learning rate warm-up (warmup) strategy, followed by a reduction of the learning rate by 0.1 times the original at generations 35, 75, and 95, respectively. In the design of the loss function, corresponding distance intervals and weight parameters are selected through experiments, and the following steps are performed in sequence: α is 1, β is 0.7, θ is 0.7, δ is 0.001,λ＝0.0005,μ＝0.1。

(4) Re-recognition of network experimental results

Based on the evaluation indexes and the experimental details, the method is tested based on two training sets, and corresponding experimental results are obtained. As shown in tables 2 and 3, the experiment compared the method of the present invention with other network models that are currently more advanced. In comparison, some methods most closely related to the present invention were selected, including: global feature based methods, metric learning methods, pose based methods, horizontal segmentation local methods. In particular, a comparison is also made with the closest viewing-angle-based correlation method. Wherein, RR represents that the search result is reordered.

TABLE 2 comparison of results with other methods (market-1501 data set)

Name of method	Rank-1(％)	Rank-5(％)	mAP(％)
				PCB	92.3	97.2	77.4
AlignedReID	91.8	97.1	79.3
				PIE	87.33	95.56	69.25
GLAD	89.9	-	73.9
				Spindle	76.9	91.5	-
HA-CNN	91.2	-	75.7
				TriHard	86.67	93.38	81.07
HPM	94.2	97.5	82.7
				PGR	93.87	97.74	77.21
OSCNN	83.9	-	73.5
				ours	94.71	98.06	84.11
ours+RR	94.87	98.30	92.71

TABLE 3 comparison of results with other methods (DukeMTMC-ReiD data set)

Name of method	Rank-1(％)	Rank-5(％)	mAP(％)
				PCB	81.7	89.7	66.1
AlignedReID	81.2	-	67.4
				PIE	80.84	88.30	64.09
HA-CNN	80.5	-	63.8
				HPM	86.6	-	74.3
PGR	83.63	91.66	65.98
				SVDNet	76.7	-	56.8
Ours	87.31	93.54	73.20
				Ours+RR	90.63	94.25	87.67

In order to further prove the effectiveness of the network structure and the training strategy provided by the invention, an ablation experiment is designed based on a Market-1501 data set. Firstly, using ResNet50 as a backbone network, training the network by using the combination of triple loss and cross entropy loss, and obtaining a test result as BaseLine.

Next, the sample selection mode of each batch is changed from random selection to orientation-based selection strategy, and the loss function and other parameters of the network are kept unchanged. After repeated experiments, the strategy was found to bring about a 0.7% improvement for rank-1 and mAP.

Then, experiments considering only the same orientation and different orientation branches, and testing with each branch alone, found that each branch alone was relatively poor, which was expected because the individual branches only considered a single orientation combination, missing many representative triples.

Finally, on the basis of two-branch co-training, the experiment verifies the effects of cross-constraint and introduction of local features, and the result proves that the cross-constraint is very effective because the cross-constraint makes negative samples in the same orientation and positive samples in different orientations more separable.

Specific ablation test results are shown in table 4.

TABLE 4 comparison of ablation test results

Through the comparison experiment and the ablation experiment, the method is superior to the existing method in the accuracy rate of pedestrian re-identification. Meanwhile, because the method adds the intra-class constraint in the cross space constraint, a tiny interval exists between samples of the same pedestrian in different directions, namely the samples of the same pedestrian in the same direction are closer. This is very significant in practical applications, such as in object tracking where two objects are very similar, it may be preferable to identify pedestrians that are heading toward the same, and this is often true. As shown in fig. 10, some examples of searching on the data set are given, wherein the leftmost image represents the query image query, and the following five images are the first five images obtained by searching in turn, and the similarity of the images is arranged from high to low, so that the effectiveness of the invention can be verified visually.

As shown in fig. 11, in the data set training process, the present invention records four distance relationships between samples on two data sets, namely, the distance between the same person in the same orientation (curve a in the figure), the distance between the same person in different orientations (curve B in the figure), the distance between different pedestrians in different orientations (curve C), and the distance between different pedestrians in the same orientation (curve D). The relative relation of the four distance curves can represent the training process and the purpose of the method, and has certain descriptive significance.

In conclusion, the invention provides a multi-feature fusion pedestrian re-identification method based on orientation constraint. Different orientation combinations are concerned through a two-branch re-identification network model, the influence of orientation difference on re-identification is overcome, the accuracy of the proposed network on mark-1501 and Duke MTMC-ReiD data sets of rank-1 is 94.71% and 87.31% respectively, and the average level is superior to that of most methods at present. Meanwhile, the invention provides a pedestrian orientation classifier based on multi-feature fusion and an orientation-based sample selection strategy, orientation information of two data sets is marked according to the orientation information, and important influence of orientation change and local shielding on pedestrian re-identification is proved again by fusing global and local features. In particular, the method of the invention can preferentially identify pedestrians with the same orientation during retrieval, and can provide some references for further analysis of orientation factors and construction of future pedestrian re-identification data sets.

Claims

1. A multi-feature fusion pedestrian re-identification method based on orientation constraint is characterized by comprising the following steps:

respectively processing samples with the same orientation and different orientations through an orientation constraint network model of two branches, fusing two features of global and local in each branch to represent pedestrians, and finally fusing three parts of loss joint constraint training to obtain network weight parameters; the method comprises the following implementation steps:

s1, designing an orientation constraint network:

the orientation constraint network is a two-branch network structure, a pedestrian sample can be simultaneously mapped to two different feature spaces, each feature space corresponds to one network branch, and different mixing loss functions are designed for each branch; the network design enables a first branch to mainly focus on samples with the same orientation, a second branch to be more suitable for the change situation of the samples with different orientations, and the two branches are jointly trained to enable a network model to have good adaptability to the orientation of the pedestrian, wherein the first branch is called as the same-orientation branch, and the second branch is called as a different-orientation branch;

firstly, the network selects N pedestrian images of a batch, wherein N is P and K, P represents P pedestrians in the batch, and each pedestrian selects K different samples; then, judging the orientation of the pedestrian in each picture by using the trained orientation classifier to obtain a corresponding orientation label; thus, each pedestrian picture can be represented by the three group of (I, Y, O), where I represents the image, Y represents the ID of the pedestrian, i.e. which pedestrian belongs to, and O represents the orientation label of the pedestrian;

for the input images of the batch, firstly extracting simple features of the orientation constraint network through a shared convolution network module, wherein the simple features comprise common color, attribute and texture features; then, two branch convolution networks are added after the shared convolution module, and the samples are mapped to two different high-dimensional subspaces through the learning of weight parameters; dividing ResNet50 into a sharing layer and a branch layer by taking the ResNet50 as a backbone network; the first layer of convolution of ResNet50 is used as a sharing module, the last three layers are used as branches, and finally, the upper layer and the lower layer of networks respectively output a feature vector with dimension of N x d; extracting two different features from each pedestrian image, and outputting 2N features in total;

after obtaining 2N characteristics, the network selects different types of triples based on orientation in each branch, and trains and iterates continuously based on the orientation-constrained mixed loss function to obtain the final network weight;

s2, a difficult sample sampling strategy based on the orientation of the pedestrian:

in the network training process, P pedestrians are randomly selected in each training batch, K different pictures are randomly selected in the image of each pedestrian, the random selection of the K pictures is based on the orientation, the samples with the same orientation and the samples with different orientations in the K pictures need to be ensured, and the sampling diversity is ensured; after network mapping, selecting positive and negative sample pairs for each image to form a triple and participate in the calculation process of mixing loss; specifically, for a certain pedestrian image a, selecting the most difficult positive sample and the most difficult negative sample in the batch, wherein the most difficult positive sample is the image with the largest difference between the vector distances of the positive samples and the image a, namely the positive sample which is the least similar to a, and the most difficult negative sample is the image with the smallest difference between the vector distances of the negative samples and the image a, namely the negative sample which is the most similar to a;

s3, network joint training strategy:

on the specific training strategy, the network adopts a mode of joint training of three loss functions of triple loss, softmax classification loss and central loss, and constraint designs of orientation factors are added on different branches; for branches with the same orientation, only samples with the same orientation are selected from one batch to form a triple, namely, samples with the same orientation are selected when positive and negative samples are selected for the current training sample, so that the network ignores the influence of orientation information while learning the identity information of the pedestrian; a represents a certain pedestrian in the batch, p represents a positive sample of the same pedestrian as a, n represents a negative sample of a pedestrian different from a, s represents the same orientation, and d represents a different orientation, and the formula is as follows:

wherein ps represents the same pedestrian in the same orientation, ns represents different pedestrians but with the same orientation, d represents the Euclidean distance between two feature vectors, and α represents the distance threshold of the triplet loss;

for classification loss, the pedestrians and the orientation are taken as combined labels by the same orientation branch, and for M pedestrians, the four different orientations of the front, the rear, the left and the right are divided into M x 4 different categories; the formula (2) is a softmax loss, and the formula (3) is a central loss; where N represents the number of images of the batch, f _i Feature vector representing the ith image, f _i (k) A kth dimension representing an ith feature vector, label _ i represents a pedestrian ID and a combined label (ID) of orientation, M × T represents the number of classified categories, and is also the length of the feature vector obtained after the layers are all connected; c _{label_i} Class center representing class i;

finally, the final total loss for the same orientation branch consists of three parts;

L _same ＝L _triSame +L _ceSame +λL _center (4)

wherein λ represents a weighting factor of the center loss;

for the branches with different orientations, mainly considering the training of samples with different orientations, the composition of the triples selects samples with different orientations, and the triples of the branches with different orientations are represented as formula (5):

where pd represents a sample of different orientations of the same person, nd represents different pedestrians in different orientations, and β is also the distance threshold; the difference is that only the ID of the pedestrian sample is classified on the design of classification loss, and M individuals have M categories; the formula (6) is a softmax loss formula of branches with different directions, and the central loss is the same as the formula (3);

wherein M represents the number of classes of pedestrians, and the meanings of the rest characters are the same as the formula (2);

finally, the sum of the losses of the branches in different directions can be obtained, and the formula (7) shows;

L _diff ＝L _triDiff +L _ceDiff +λL _center (7)

meanwhile, cross constraint among branches is added in training, and the triple loss is still mainly based on the triple loss, as shown in the formula (8) and the formula (9);

equation (8) is also a triplet loss function, where θ is the distance threshold; for training sample a, the positive sample selected is different from its orientation, and the negative sample is consistent with its orientation;

the formula (9) does not consider the negative samples, but only considers the relative relation between the positive samples to form the intra-class constraint; setting an interval threshold value delta so that the distance between samples with the same orientation is smaller than the distance between samples with different orientations in the same class, namely, between a plurality of samples of the same pedestrian;

L _Total ＝L _same +L _diff +L _cross +μL _intra (10)

the overall loss function can be represented by equation (10), where μ is a weighting parameter.

2. The pedestrian re-identification method based on multi-feature fusion of orientation constraints as claimed in claim 1, which proposes a pedestrian orientation classification method based on fusion of global features and local features, characterized in that: combining global features and local features of the pedestrian image to make pedestrians in different orientations more separable, wherein the orientations comprise front, back, left and right; the method comprises the following implementation steps:

for a pedestrian image, firstly extracting 18 joint key points of the pedestrian by using a PAFs method, and using the key points to describe the contour of the pedestrian; meanwhile, the coordinate position of each key point in the image can be accurately obtained;

secondly, transversely dividing the whole pedestrian image into three body parts, namely a head part, an upper body part and a lower body part, so that the whole pedestrian image and the three body parts form a plurality of inputs of a convolution neural network, extracting the characteristics of the four parts of images by using convolution modules respectively, using a ResNet50 network, and then splicing the obtained characteristic vectors to form a combined vector for final pedestrian representation;

and finally, adding a full connection layer at the tail end of the network, performing four-classification by using a softmax loss function, and continuously training and iterating to obtain a final classification result.