CN112101217B

CN112101217B - Pedestrian re-identification method based on semi-supervised learning

Info

Publication number: CN112101217B
Application number: CN202010970306.5A
Authority: CN
Inventors: 葛永新; 高志顺
Original assignee: Zhenjiang Qidi Digital World Technology Co ltd
Current assignee: Zhenjiang Qidi Digital World Technology Co ltd
Priority date: 2020-09-15
Filing date: 2020-09-15
Publication date: 2024-04-26
Anticipated expiration: 2040-09-15
Also published as: CN112101217A

Abstract

The invention discloses a pedestrian re-identification method based on semi-supervised learning, which comprises the following steps that S100 learns a projection matrix U epsilon R ^d×c to project an original d-dimensional feature space into a c-dimensional subspace, so that U ^TX∈R^c×N meets the following conditions in a new subspace: the Euclidean distance between pairs of samples from the same pedestrian is smaller, and the Euclidean distance between pairs of samples from different pedestrians is larger; samples from the same pedestrian are defined as homogeneous samples, and samples from different pedestrians are defined as heterogeneous samples; s200, projecting the new sample into a new subspace by adopting a projection matrix U epsilon R ^d×c to obtain a predicted sample sequence, wherein the predicted sample sequence is arranged according to the Euclidean distance between the new sample and the samples in the training sample set from small to large. The method fully utilizes the accurately marked labeled sample, and the positive sample is restrained by using the contrast loss function, and simultaneously, the negative sample pair can be fully utilized, so that the identification speed is high, and the identification accuracy is higher.

Description

Pedestrian re-identification method based on semi-supervised learning

Technical Field

The invention relates to the technical field of pedestrian re-identification, in particular to a pedestrian re-identification method based on semi-supervised learning.

Background

Pedestrian re-recognition (Person-identification), also known as pedestrian re-recognition, is a technique that uses computer vision techniques to determine whether a particular pedestrian is present in an image or video sequence. Widely recognized as a sub-problem of image retrieval, given a monitored pedestrian image, the pedestrian image is retrieved across devices. The camera is used for making up the visual limitation of the fixed camera, can be combined with pedestrian detection and pedestrian tracking technologies, and can be widely applied to the fields of intelligent video monitoring, intelligent security and the like.

Although in recent years computer vision practitioners have proposed a number of algorithms from different perspectives for pedestrian re-recognition tasks, attempting to continually boost the recognition rate on the public data set, pedestrian re-recognition remains a very challenging task due to the effects of several realistic factors.

At present, a semi-supervised learning method is generally used for solving the task of re-identifying pedestrians, and the flow is approximately as follows: firstly, automatically labeling a label-free sample; and secondly, uniformly training the labeled samples and the automatically labeled samples, and optimizing the model to ensure that the model has better discrimination capability. And two problems exist in the automatic labeling and the utilization of labeled non-labeled samples:

① The idea of the method used in the automatic labeling of unlabeled samples is to label the new space that is being mapped using a K-Nearest Neighbor (KNN) algorithm. This makes the errors after automatic labeling relatively large if the new spatial discrimination capability learned is insufficient. When the model is trained by using the data with larger labeling errors, the model is very likely to have better generalization capability due to the increase of training samples, but the worse the discrimination capability of the model is caused by the more training.

② When training is performed, only positive sample pairs in the training set are constrained, and negative sample pairs are not concerned, so that the training samples are not fully utilized.

Disclosure of Invention

Aiming at the problems existing in the prior art, the technical problems to be solved by the invention are as follows: the existing method for solving the task of re-identifying pedestrians by using semi-supervised learning has the problems that automatic labeling errors are easily influenced by new spaces obtained by learning and the utilization of training samples is insufficient.

In order to solve the technical problems, the invention adopts the following technical scheme: the pedestrian re-identification method based on semi-supervised learning comprises the following steps:

S100: learning a projection matrix U ε R ^d×c projects the original d-dimensional feature space into the c-dimensional subspace such that U ^TX∈R^c×N satisfies in the new subspace: the Euclidean distance between pairs of samples from the same pedestrian is smaller, and the Euclidean distance between pairs of samples from different pedestrians is larger; samples from the same pedestrian are defined as homogeneous samples, and samples from different pedestrians are defined as heterogeneous samples;

S200: and projecting the new sample into a new subspace by adopting a projection matrix U epsilon R ^d×c to obtain a predicted sample sequence, wherein the predicted sample sequence is arranged according to the Euclidean distance between the new sample and the samples in the training sample set from small to large.

Preferably, the method for learning the projection matrix U e R ^d×c in S100 specifically includes:

S110, building a training sample set, wherein the training sample set comprises a plurality of samples, the samples comprise labeled samples and unlabeled samples, and the labels of the samples of the same pedestrian in the labeled samples are the same;

Let x= [ X _L,X_U]∈R^d×N ] denote all training samples, where N is the number of all pictures contained in the training set, d is the length of the feature vector, Representing N _L tagged samples,Representing N _U unlabeled exemplars;

s120, establishing an objective function as follows:

Wherein L (U) is a regression function, Omega (U) is a regularized constraint, alpha, lambda > 0 is a balance coefficient;

S130, the labeled sample loss function is a contrast loss function: n _P sample pairs for sampling And/>If/>And/>Samples from the same pedestrian, then in the new projection space/>And/>The Euclidean distance d _n between should be as small as possible, close to 0; conversely, d _n should be at least greater than a predetermined threshold margin >0, which would result in a loss if the above conditions are not met;

S140, labeling labels by using label-free sample labels, namely labeling labels by using a method of K nearest neighbors, wherein the loss function of label-free sample labels is as follows:

wherein if U ^Tx_i and U ^Tx_j meet K nearest neighbors to each other and x _i and x _j are from different cameras, then take

Otherwise W _ij = 0; (8);

after labeling the label-free sample, further restricting the existing subspace by using the labeled sample, wherein the restricting weight is the cosine distance of the two samples in the new projection space;

s150: regularization term: the projection matrix U is constrained using L2,1 norm:

Ω(U)＝||U||_2,1 (4)。

Preferably, the labeled sample loss function of S130 is:

Wherein:

Preferably, the N _P samples sampled in S130 are sampled with a sampling strategy that maximizes the top-k recognition rate, i.e. for each image, all samples with k nearest neighbors are sampled.

Preferably, in S140, the method for labeling the label on the label-free sample by adopting the method of K nearest neighbors includes:

the K nearest neighbor N (x, K) defining sample x is as follows:

N(x,k)＝{x₁,x₂,...,x_k},|N(p,k)|＝k (5)；

Where |·| represents the number of samples in the set, then K nearest neighbors R (x, K) to each other are defined as follows:

R(x,k)＝{x_i|(x_i∈N(x,k))∧(x∈N(x_i,k))} (6)。

compared with the prior art, the invention has at least the following advantages:

(1) The invention uses K nearest neighbors to make the automatic labeling result of the unlabeled sample more reliable.

(2) And fully utilizing the accurately marked label sample. The contrast loss function commonly used in training the deep neural network can be used for restraining the positive sample and fully utilizing the negative sample pair. It should be noted that any type of loss for identification or classification may be used as an alternative to the labeled sample loss function.

(3) In order to enable convenient migration of the subsequent models to the depth model, an end-to-end training approach is used herein, and the training strategy uses a random gradient descent approach. The batch generation strategy for maximizing the top-k recognition rate is provided, and the problems that the convergence speed of the paired training strategy is low under random batches, the model is prevented from being fitted excessively and the like are solved.

Drawings

The K mutual nearest neighbor sampling strategy of the pedestrian re-recognition problem of fig. 1. First row: one picture to be retrieved and its 10 nearest neighbors, where P1-P4 are positive samples and N1-N6 are negative samples. Second row: every two columns are the 10 nearest neighbor images that correspond to the first row of images. The thick line non-chamfered rectangular frame and the thin line rectangular frame with chamfer represent the retrieved picture and the positive sample picture, respectively.

The negative sample closest to the image to be retrieved in fig. 2 is the most difficult negative sample; the first positive sample just smaller than the most difficult negative sample is the moderate positive sample; the intra-block samples are the herein sampling strategy.

Fig. 3 is a suitable positive sample sampling.

Detailed Description

The invention is described in further detail below with reference to the accompanying drawings.

Referring to fig. 1-3, the pedestrian re-recognition method based on semi-supervised learning comprises the following steps:

S100: learning a projection matrix U ε R ^d×c projects the original d-dimensional feature space into the c-dimensional subspace such that U ^TX∈R^c×N satisfies in the new subspace: the Euclidean distance between pairs of samples from the same pedestrian is smaller, and the Euclidean distance between pairs of samples from different pedestrians is larger; samples from the same pedestrian are defined as homogeneous samples, and samples from different pedestrians are defined as heterogeneous samples.

The method for learning the projection matrix U epsilon R ^d×c specifically comprises the following steps:

s120, establishing an objective function as follows:

Where L (U) is a regression function, in order to make the labeled samples satisfy the same label sample pair in the new space mapped to closer distances, the uncorrelated label sample pair farther distances, As a weighted regression function, the discriminant of the model can be improved by using unlabeled samples, omega (U) is a regularized constraint, features with more discriminant capability can be selected from the original feature space, overfitting is avoided, and alpha, lambda > 0 are balance coefficients;

S130, a labeled sample loss function: the purpose of this constraint is to make full use of the tag information of the tagged sample. In order to simultaneously utilize the positive and negative sample pair constraint, we use a training contrast loss function

Wherein:

n _P sample pairs for sampling And/>If/>And/>Samples from the same pedestrian, then in the new projection space/>And/>The Euclidean distance d _n between should be as small as possible, close to 0; conversely, d _n should be at least greater than a predetermined threshold margin >0, which would result in a loss if the above conditions are not met;

In order to effectively utilize the discrimination information of the unlabeled exemplars and reduce the adverse effect of error labeling on the model, K is adopted to label the unlabeled exemplars instead of K nearest neighbors, and only positive exemplar pairs are constrained in the term. The specific loss function is as follows:

Otherwise W _ij = 0 (8);

The significance of this term is that it is believed that the pairs of K mutually nearest neighbors in the learned subspace with discriminatory power are most likely from the same pedestrian. And then, after labeling the unlabeled samples, further restricting the existing subspace by using the labeled samples, wherein the restricting weight is the cosine distance of the two samples in the new projection space.

S150: regularization term: the regularization term is added to make the learned projection matrix more sparse while avoiding the occurrence of overfitting. Here we use L2,1 norm to constrain the projection matrix U:

Ω(U)＝||U||_2,1 (4)。

As an improvement, the N _P samples sampled in S130 is a sampling strategy that maximizes the top-k recognition rate, i.e. for each image, all samples with k nearest neighbors are sampled. In this way, the over-fitting is avoided, and meanwhile, the discrimination information of the labeled sample can be utilized to the maximum extent.

When optimizing using random gradient descent, all samples need to be fed into the model in batches. All samples were randomly sampled, a small portion of the categories were randomly selected each time, and two images were taken for each category. At the time of loss calculation, all pairs of samples that all images in each batch may make up participate in the calculation. In this way, while many pairs of samples can be calculated at a time, the direction of optimization of such sampling may not be the direction that can cause the target to drop most quickly due to the randomness in the class sampling. Each time the optimization is completed, the distances of all samples under the current model will be calculated. To make the target drop faster, only a pair of the most difficult negative samples under the current model are selected for each image, as in fig. 2.

It is noted that some positive pairs of samples have too large intra-class differences due to drastic changes, which would most likely overfit the model if they were trained, as in fig. 3. To avoid this overfitting, a modest positive sample (moderate positive sample) is taken for each picture in the manner of fig. 2, i.e., sampling just less than the first positive sample of the most difficult negative samples. In order to utilize as much information as possible provided by the tagged samples, we propose a sampling strategy that maximizes the top-k recognition rate. As shown in fig. 2, for each image, all samples of its k nearest neighbors are sampled, thus avoiding overfitting while maximizing the discrimination information for labeled samples.

As an improvement, the method for marking the label on the label-free sample by adopting the K nearest neighbors method in S140 is as follows:

As shown in fig. 1, P1-P4 are four positive samples of the picture to be retrieved, but are not arranged in the first four bits of the nearest neighbor picture, and a large error is introduced if the K nearest neighbor result is directly used. However, it is worth noting that the picture to be retrieved and the four positive samples are each K nearest neighbors of each other, which we will refer to as K nearest neighbors of each other. Labeling unlabeled data in this manner reduces error introduction to some extent.

The K nearest neighbor N (x, K) defining sample x is as follows:

N(x,k)＝{x₁,x₂,...,x_k},|N(p,k)|＝k (5)；

R(x,k)＝{x_i|(x_i∈N(x,k))∧(x∈N(x_i,k))} (6)。

s200: and projecting the new sample into a new subspace by adopting a projection matrix U epsilon R ^d×c to obtain a predicted sample sequence, wherein the predicted sample sequence is arranged according to the Euclidean distance between the new sample and the samples in the training sample set from small to large. The predictive sample is ranked first, indicating that the probability of being the same person between the new sample and the predictive sample is highest.

Experiment and analysis:

feature selection: in order to quickly verify the validity of the proposed method, the LOMO features and GOG features commonly used in pedestrian re-recognition tasks are used herein.

Parameter setting: the algorithm is implemented by using theano framework. Wherein the minimum interval margin is taken to be 0.5, the balance coefficients α, λ are taken to be 0.005 and 0.0001, respectively, mapped to the subspace dimension c is taken to be 512, and the batch size, learning rate and k are taken to be 32, 1 and 10, respectively.

VIPeR database test results and analysis

VIPeR database is one of the most popular databases for pedestrian re-identification tasks. It contains 1264 images of 632 pedestrians acquired by two cameras with different illumination conditions with 90 ° angle of view. The training set is composed of 316 pedestrians, the test set is composed of the rest 316 pedestrians, and semi-supervision and full-supervision experimental settings are respectively carried out.

Semi-supervised experiments: for semi-supervised setting, we randomly take 1/3 of the pictures of pedestrians in the training set, wipe out the labels as unlabeled samples, and the remaining 2/3 of the pictures of pedestrians as labeled samples. The experimental results are shown in Table 4.1. By comparing the method with SSCDL and DLLAP, the performance of the method is greatly improved, and particularly the recognition rate of Rank-1 can reach 47.5% after LOMO features and GOG features are combined.

Table 4.1 VIPeR comparison of recognition rates of semi-supervised learning methods on database

Rank	1	5	10	20
					SSCDL	25.6	53.7	68.2	83.6
DLLAP	32.5	61.8	74.3	84.1
					LOMO+Our	34.2	65.2	76.4	85.4
GOG+Our	42.4	73.4	83.9	91.0
					LOMO+GOG+Our	47.5	78.3	86.9	92.1

Full supervision experiment: we also set up the method presented here with full supervision, i.e. using labels of all training samples. The experimental results are shown in Table 4.2. Comparing DLLAP with L1Graph, it was found that there was a greater improvement in the methods presented herein when using GOG features and when using both LOMO and GOG features in combination. Comparing with semi-supervised setup, it can be seen that using LOMO and gos features, a recognition rate of 47.5% can be achieved with only 2/3 training sample labels, which is only 3% different than in the fully supervised case, fully demonstrating the effectiveness of the methods presented herein.

Table 4.2 VIPeR comparison of recognition rates under full supervision setting on database

Rank	1	5	10	20
					DLLAP^[41]	38.5	70.8	78.5	86.1
L1Graph^[42]	41.5	-	-	-
					LOMO+Our	36.1	68.2	79.6	88.5
GOG+Our	48.6	77.1	87.3	92.9
					LOMO+GOG+Our	50.5	79.6	88.8	94.3

The method uses a contrast loss function to fully utilize the label information of the labeled sample, and uses a K mutual nearest neighbor method to replace a K nearest neighbor method to label the unlabeled sample. The experimental results on the pedestrian re-identification public dataset VIPeR confirm the effectiveness of the method.

Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered by the scope of the claims of the present invention.

Claims

1. The pedestrian re-identification method based on semi-supervised learning is characterized by comprising the following steps of:

S100: learning a projection matrix U ε R ^d×c projects the original d-dimensional feature space into the c-dimensional subspace such that U ^TX∈R^c ^×N satisfies in the new subspace: the Euclidean distance between pairs of samples from the same pedestrian is smaller, and the Euclidean distance between pairs of samples from different pedestrians is larger; samples from the same pedestrian are defined as homogeneous samples, and samples from different pedestrians are defined as heterogeneous samples;

The method for learning the projection matrix U epsilon R ^d×c in the S100 specifically comprises the following steps:

Let x= [ X _L,X_U]∈R^d×N ] denote all training samples, where N is the number of all pictures contained in the training set, d is the length of the feature vector, Representing N _L tagged samples,/>Representing N _U unlabeled exemplars;

s120, establishing an objective function as follows:

Wherein L (U) is a regression function, W (U) is a regularized constraint, and alpha, lambda >0 is a balance coefficient;

the labeled sample loss function of S130 is:

Wherein:

Otherwise W _ij = 0; (8);

W(U)＝||U||_2,1 (4)；

2. The pedestrian re-recognition method based on semi-supervised learning of claim 1, wherein the N _P sample sampling strategies sampled in S130 are sampling strategies that maximize top-k recognition rate, i.e., for each image, all samples with k nearest neighbors are sampled.

3. The pedestrian re-recognition method based on semi-supervised learning as set forth in claim 1, wherein the method for marking the label on the unlabeled exemplar by adopting the method of K nearest neighbors in S140 is as follows:

the K nearest neighbor N (x, K) defining sample x is as follows:

N(x,k)＝{x₁,x₂,...,x_k},|N(p,k)|＝k (5)；

R(x,k)＝{x_i|(x_i∈N(x,k))∧(x∈N(x_i,k))} (6)。