CN111126360A

CN111126360A - Cross-domain pedestrian re-identification method based on unsupervised combined multi-loss model

Info

Publication number: CN111126360A
Application number: CN202010143811.2A
Authority: CN
Inventors: 田玉敏; 杨芸; 吴自力; 王笛; 潘蓉
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2019-11-15
Filing date: 2020-03-04
Publication date: 2020-05-08
Anticipated expiration: 2040-03-04
Also published as: CN111126360B

Abstract

The invention discloses a cross-domain pedestrian re-identification method based on an unsupervised combined multi-loss model, and mainly solves the problem that the existing unsupervised cross-domain pedestrian re-identification method is low in identification rate. The scheme is that 1) a data set is obtained and divided into a training set and a testing set; 2) carrying out various preprocessing and expansion on the training set; 3) selecting a residual error network as a reference network model, initializing network parameters, and adjusting a network structure; 4) constructing a target domain loss function; 5) fusing a target domain loss function, a source domain loss function and a triple loss function to obtain a total loss function; 6) training the residual error network by using the total loss function to obtain a trained network model; 7) and inputting the test set into the trained network model, and outputting a recognition result. The invention improves the recognition rate of unsupervised cross-domain pedestrian re-recognition, effectively avoids the occurrence of over-fitting condition, and can be used for the intelligent security of suspect target searching and cross-camera tracking of pedestrians.

Description

Cross-domain pedestrian re-identification method based on unsupervised combined multi-loss model

Technical Field

The invention belongs to the field of computer vision and deep learning, and particularly relates to a cross-domain pedestrian re-identification method which can be used for the intelligent security field of suspect target searching and pedestrian cross-camera tracking.

Background

With the construction of digital cities, surveillance videos are more and more widely applied to various places where people live, the massive surveillance videos play an important role in public safety fields such as city public security and crime tracking, and how to better process the data is a great challenge in the future.

Pedestrian re-recognition, also known as pedestrian re-recognition, is a technique that uses computer vision techniques to determine whether a particular pedestrian is present in an image or video sequence, and is widely recognized as a sub-problem in image retrieval. Given a monitored pedestrian image, the pedestrian image is retrieved across the device.

In real life, under the influence of factors such as the visual angle of a camera, the posture of a pedestrian, shielding and illumination, images of the same person under different cameras are greatly different, and images of different pedestrians are likely to be very similar. This makes pedestrian re-identification a very challenging hot topic in the field of computer vision. The traditional pedestrian re-identification method is mainly researched from two aspects, namely, a pedestrian feature representation method is researched, and robust pedestrian features are designed to enable the pedestrian feature representation method to have invariance to factors such as visual angle change of a camera, pedestrian posture change, illumination change and background interference. And secondly, measurement learning, namely, the distance between different images of the same pedestrian is smaller and the distance between different pedestrians is larger by learning a distance measurement function.

In recent years, deep learning is rapidly developed, more and more researchers use deep learning technology to research pedestrian re-identification, and due to the good performance of deep learning in image high-level feature extraction and the appearance of large-scale data sets, the performance of the pedestrian re-identification method based on deep learning on each data set is better than that of the traditional pedestrian re-identification method. The existing pedestrian re-identification method based on deep learning comprises supervised learning and unsupervised learning, in the supervised learning, all data sets used for training need to be provided with labels, however, the labeling of data needs to consume a large amount of manpower and material resources, in real life, massive data are not provided with labels, and in the case of huge data, the manual labeling of the data is almost impossible. The pedestrian re-identification model learned through supervised learning has very poor performance on a label-free test set, the generalization capability of the model is very weak, and the expandability and the practicability are lacked in practical application. However, the pure utilization of unsupervised learning has poor performance of the learned network model due to no label of data, and cannot be applied in actual situations. By combining the two methods, how to obtain a better recognition result on a network model trained by a source domain with a label in a target domain without the label is a hotspot of pedestrian re-recognition research.

Hehe Fan, Liang Zheng et al, published in "Unsupervised person re-identification," Clustering and fining ". The thesis learns a deep network model as an initial pedestrian feature extractor using a labeled source domain data set, and then improves the network model on a target domain data set through unsupervised clustering. But this method cannot fully utilize the data information of the tagged source domain.

The paper Image-Image domain adaptation with predicted self-similarity and domain-similarity for person re-identification published by Wei jian Deng et al mainly studies how to reduce the difference of two cross-domain datasets at the pixel level. Although the method has good identification effect, the pedestrian re-identification error rate is high under the crossed cameras due to the fact that the intra-domain data characteristic distribution of the target domain is ignored.

Disclosure of Invention

The invention aims to provide a cross-domain pedestrian re-identification method based on an unsupervised joint multi-loss model, aiming at solving the problems in the prior art, so that the error rate of pedestrian re-identification on a label-free target domain data set is reduced, and the identification precision is improved.

In order to achieve the purpose, the technical scheme of the invention comprises the following steps:

1) acquiring a target domain data set without a label and a source domain data set with a label from an open website, randomly selecting a part of pictures in the target domain data set as a test set, dividing the test set into a query set and a candidate set, and taking the rest pictures in the target domain data set and the pictures in the source domain data set as training sets;

2) sequentially turning over, cutting and randomly erasing the pictures in the training set, preprocessing the data, and expanding the preprocessed pictures in the training set through cross-camera style migration; unifying the pictures of the training set to the same size;

3) selecting a residual error network as a reference network model, initializing network parameters, and adjusting the reference network structure;

4) constructing a target domain loss function:

4a) measuring the characteristics of each sample picture in the target domain data set by adopting cosine distance, sorting the distance from large to small, selecting the first k pictures as sample neighbors, and giving weight w to the neighbors of each picture_i,j：

Wherein i is the sample number, j is the sample neighbor number, k is the neighbor number, x_t,iIs a target domain sampleThis picture, M (x)_t,iK) is a set of numbers of k neighbors of the sample picture;

4b) according to the weight w_i,jConstructing a target domain loss function L_tgt：

Wherein the content of the first and second substances,

is a randomly sampled picture in the training set, n_tIs the number of pictures of the batch training target domain,

is a picture

Probability of belonging to neighbor j;

5) fusing the target domain loss function with the existing source domain loss function and the triple loss function to obtain a total loss function L:

L＝λ₁L_src+λ₂L_tgt+λ₃L_T

wherein L is_srcAs a source domain loss function, L_TIs a triplet loss function, λ₁Is a weighted value, λ, of the proportion of the source domain loss function₂Is the proportional weighted value of the loss function of the target domain, lambda₃The weight value of the proportion of the triple loss function is calculated;

6) inputting the training set picture into a residual error network, training the residual error network by using a total loss function L and optimizing a training process to obtain a trained residual error network model;

7) inputting the pictures in the query set and the candidate set into a trained residual error network model for feature extraction, respectively calculating Euclidean distances between the features of the target pictures in the query set and the features of the pictures in the candidate set, sequencing the calculated Euclidean distances, and selecting the picture closest to the target in the query set in the candidate set as a recognition result.

The invention has the following advantages:

firstly, the invention effectively avoids the occurrence of the over-fitting condition and improves the generalization capability of the model because the data is subjected to various pretreatments of overturning, cutting and random erasing;

secondly, the data set is subjected to cross-camera style migration, so that the error rate of pedestrian re-identification under the view angle of the crossed cameras is reduced while data is expanded;

thirdly, the method greatly excavates the characteristic distribution characteristics of the label-free target domain data set due to the construction of the target domain loss function, fuses the source domain loss function, the target domain loss function and the triple loss function, and obviously improves the recognition rate of the unsupervised cross-domain pedestrian re-recognition.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention;

FIG. 2 is a diagram of a residual network architecture in accordance with the present invention;

fig. 3 is a schematic diagram of a residual error network training process in the present invention.

Detailed Description

The invention is described in detail below with reference to the figures and specific embodiments. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments that can be obtained by a person skilled in the art based on the embodiments of the present invention without any inventive step shall fall within the scope of the present invention.

Referring to fig. 1, the specific implementation of this example is as follows:

the method comprises the steps of firstly, acquiring a target domain data set and a source domain data set, and organizing the acquired data sets into a training set and a testing set.

Downloading a Market-1501 data set and a DukeMTMC-reiD data set from a public website, taking the Market-1501 data set as a target domain data set and taking the DukeMTMC-reiD data set as a source domain data set;

using DukeMTMC-reiD data set as the first training set R₁；

The Market-1501 data set is divided into 6: 4 into a second training set R₂And the test set T is further processed by the following steps of: 2 into a candidate set G and a query set Q;

the first training set R₁And a second training set R₂Forming a total training set R.

Step two, preprocessing the pictures in the training set R and expanding the second training set R₂And unifying the picture size.

2.1) carrying out the following preprocessing on the training set R:

randomly extracting a plurality of pictures on the training set R for turning, including horizontal turning and vertical turning;

randomly extracting a plurality of pictures on the training set R for cutting, wherein the pictures are 2/3 of the length and the width of the original picture;

randomly extracting a plurality of pictures on the training set R to erase random areas in random sizes;

2.2) using the cyclic generation to generate a countermeasure network to carry out cross-camera style migration, and using a second training set R₂Pictures shot by different cameras are input into the cyclic generation countermeasure network to obtain pictures with converted camera visual angles, the converted pictures are stored in a folder, and the pictures in the folder and the pictures before conversion are added into a second training set R₂To the second training set R₂The expansion is performed and all pictures are unified to 256 x 128 in size.

Referring to fig. 2, the structure of the residual error network selected in this example is described in further detail:

and step three, selecting a reference network model, initializing network parameters and adjusting the reference network structure.

3.1) selecting a residual error network ResNet-50 as a reference network model, and initializing the residual error network ResNet-50 by using model parameters pre-trained on ImageNet by utilizing transfer learning;

the residual error network ResNet-50 comprises five convolution layers, five pooling layers and a full connection layer, and the structural relationship is as follows in sequence: the first convolution layer, the first normalization processing layer, the first pooling layer, the second convolution layer, the second normalization processing layer, the second pooling layer, the third convolution layer, the third normalization processing layer, the third pooling layer, the fourth convolution layer, the fourth normalization processing layer, the fourth pooling layer, the fifth convolution layer, the fifth normalization processing layer, the fifth pooling layer and the 1000-dimensional full-connection layer;

3.2) adjusting the structure of ResNet-50:

3a) all layers before the fifth pooling layer of ResNet-50 are reserved, the last 1000-dimensional full-link layer is removed, and a 4096-dimensional output full-link layer is added;

3b) sequentially adding a normalization processing layer, a linear rectification function layer, a random discarding layer and two components after the 4096-dimensional output full connection layer, wherein the first component is a classification module of source domain data and comprises an M-dimensional full connection layer and an activation function layer; the second component is a sample memory module of the target domain, which is a key value pair structure with a key slot for storing picture characteristics and a value slot for storing the input number of the picture, and the key has uniqueness.

Referring to fig. 3, the training process of the residual error network in this example is described in further detail:

and step four, constructing a target domain loss function.

Based on the following three considerations, the objective domain loss function needs to be constructed:

regarding each sample of the target domain as an independent class, wherein each sample belongs to the class of the sample and is far away from other samples as far as possible;

the image after the cross-camera style migration and the image before the migration represent the same identity of the pedestrian, and the distance between the two images needs to be shortened;

for each sample picture, a sample with the same identity as the pedestrian exists in the neighbor of the sample picture, and the distance between the sample picture and the neighbor with the same identity is required to be shortened;

the objective loss function is constructed as follows:

4a) measuring the characteristics of each sample picture in the target domain data set by adopting cosine distanceThe cosine distance is the product of multiplying two eigenvectors and dividing the product by the modulus of the two eigenvectors, the larger the cosine distance is, the smaller the included angle between the two eigenvectors is, the closer the two eigenvectors are, otherwise, the farther the two eigenvectors are; according to the sorting of the distances from big to small, the first k pictures are selected as sample neighbors, and the neighbors of each picture are given a weight w_i,j：

Wherein i is the sample number, j is the sample neighbor number, k is the neighbor number, x_t,iIs a target domain sample picture, M (x)_t,iK) is a sample picture x_t,iThe number of k neighbors, k being 6 in this example;

4b) according to the weight w_i,jTo obtain a target domain loss function L_tgt：

Wherein the content of the first and second substances,

is a picture

Probability of belonging to neighbor j;

and step five, fusing the target domain loss function with the existing source domain loss function and the triple loss function.

5.1) Source Domain loss function:

the source domain loss function is a cross entropy loss function, and the formula is as follows:

in the formula, i is the sample number, m is the number of source domain pictures in the training batch, and x_s,iFor source domain sample pictures, y_s,iIs x_s,iLabel information carried by the specimen, p (y)_s,i|x_s,i) Is x_s,iBelong to y_s,iThe probability of (d);

5.2) triple loss function:

because the source domain data set and the target domain data set do not have the same pedestrian identity, a picture before the target domain data set is subjected to style migration is randomly selected as an anchor sample, the picture after the picture is subjected to style migration is a positive sample, the picture of the source domain data set is taken as a negative sample, and the anchor sample, the positive sample and the negative sample form a three-tuple object.

Constructing a triple loss function by using the triple objects, wherein the formula is as follows:

wherein D is an edge parameter representing the minimum distance between the anchor point and the positive sample and the distance between the anchor point and the negative sample, D (.) is the Euclidean distance, and x is_tFor the picture before the target domain data set format migration,

for pictures after style migration, x_sThe source domain picture is taken, X is the size of the batch training picture set, and X is taken as 128 in this example;

5.3) fusing the target domain loss function, the source domain loss function and the triple loss function to obtain a total loss function L:

L＝λ₁L_src+λ₂L_tgt+λ₃L_T

wherein L is_srcAs a source domain loss function, L_TIs a triplet loss function, λ₁Is a weighted value, λ, of the proportion of the source domain loss function₂Is the proportional weighted value of the loss function of the target domain, lambda₃For the proportion weighted value of the triple loss function, lambda is taken in the example₁＝0.5，λ₂＝0.3，λ₃＝0.2。

And step six, inputting the training set R into the residual error network ResNet-50, and training the residual error network ResNet-50 to obtain a trained residual error network model.

The training process is as follows:

6a) all the features stored in the sample memory are initialized to 0, and the sample memory is updated by the following formula:

K[i]←αK[i]+(1-α)f(x_t,i)

wherein α is a hyperparameter, α epsilon [0,1 ∈ ]]Controlling the update frequency, K [ i ]]For sample picture x_t,iFeatures stored in the sample memory, f (x)_t,i) Is x_t,iThe feature vector of the sample, which is α in this example, is 0.01 as the initial value;

6b) setting the total iteration times of training as 60 times, the number of batch pictures as 128, selecting the optimizer as a random gradient optimizer, setting the learning rate as 0.1 in the first 40 iterations, and dividing the learning rate by 10 in each iteration in the last 20 iterations;

6c) dividing a training set R into a plurality of small-batch data sets by taking 128 as a base number by adopting a small-batch iteration method, sequentially sending each small-batch data set into a residual error network ResNet-50 to extract a characteristic vector, inputting the obtained characteristic vector into a total loss function L to calculate loss, and calculating the gradient of the loss obtained by calculation;

6d) inputting the obtained gradient into a random gradient optimizer to update the weight and the threshold of the network;

6e) and repeating the steps 6c) to 6d), and stopping training when the total iteration times reach 60 times to obtain a trained model.

And step seven, testing the trained model and outputting a recognition result.

7.1) inputting the query set Q and the candidate set G into a trained residual error network model for feature extraction;

7.2) respectively calculating the Euclidean distance of the characteristics of the target picture in the query set Q and the characteristics of each picture in the candidate set G, wherein the formula is as follows:

wherein x is the target picture, y_iN is the number of pictures in the candidate set G for inquiring the pictures;

7.3) ordering the calculated Euclidean distances from small to large, and selecting the picture closest to the target picture in the query set Q from the candidate set G as an identification result.

The effects of the present invention can be further explained by the evaluation results of the following experiments.

1. Experimental methods and evaluation criteria

The method used in the experiment was the present invention and the existing HHL method, an unsupervised cross-domain pedestrian re-identification method, HHL for short, proposed in the paper "general aperson regenerative model-and homogeneous", published by Zhuin Zhong, et al 2018 at the ECCV (European Conference on Computer Vision) Conference.

Evaluation criteria, using three criteria mAP, rank-1, rank-5 in the area of pedestrian re-identification, wherein:

mAP is average precision mean value, which is to sum and average the average precision in the multi-classification task;

rank-1 is the probability that the result of the 1 st graph in the search results is correct;

rank-5 is the probability that the results of the first 5 pictures in the search results are correct;

2. contents and results of the experiments

The results of the tests on the Market-1501 data set using the method of the present invention and the prior HHL method are shown in table 1.

TABLE 1

Method of producing a composite material	mAP	rank-1	rank-5
				The invention	43.1	75.2	87.9
HHL	31.4	66.4	78.8

As can be seen from Table 1, compared with the prior art, the average accuracy mean mAP of the method is improved by 11.7, the probability rank-1 that the result of the 1 st graph in the search result is correct is improved by 8.8, and the probability rank-5 that the result of the first 5 graphs in the search result is correct is improved by 9.1.

Claims

1. A cross-domain pedestrian re-identification method based on an unsupervised combined multi-loss model is characterized by comprising the following steps:

4) constructing a target domain loss function:

Wherein i is the sample number, j is the sample neighbor number, k is the neighbor number, x_t,iIs a target domain sample picture, M (x)_t,iK) is a set of numbers of k neighbors of the sample picture;

Wherein the content of the first and second substances,

is a picture

Probability of belonging to neighbor j;

L＝λ₁L_src+λ₂L_tgt+λ₃L_T

wherein L is_srcAs a source domain loss function, L_TIs a triplet loss function, λ₁Is a weighted value, λ, of the proportion of the source domain loss function₂Occupied for target domain loss functionProportional weight value, λ₃The weight value of the proportion of the triple loss function is calculated;

2. The method of claim 1, wherein 2) the cyclic generation countermeasure network is utilized to perform the cross-camera style migration, the pictures taken by different cameras in the target domain data set are input into the cyclic generation countermeasure network, the pictures with converted camera view angles are obtained, and the converted pictures and the pictures before conversion are added into the target domain data set together to expand the data set.

3. The method of claim 1, wherein in 3), migration learning is used, model parameters pre-trained on ImageNet are used for initializing the residual error network, and the structure of the residual error network is adjusted, so that the following is achieved:

3a) removing the last full connection layer of the residual error network and adding a 4096-dimensional output full connection layer;

3b) sequentially adding a normalization processing layer, a linear rectification function layer, a random discarding layer and two components after the 4096-dimensional output full connection layer, wherein the first component is a classification module of source domain data and comprises an M-dimensional full connection layer and an activation function layer; the second component is a sample memory module of the target domain, which is a key value pair structure with a key slot for storing picture characteristics and a value slot for storing the input number of the picture, the key having uniqueness.

4. The method of claim 1, wherein the cosine distance in 4) is a product of multiplying two eigenvectors and dividing the product by a modulus of the two eigenvectors, and wherein the larger the cosine distance, the smaller the included angle between the two eigenvectors, the closer the two eigenvectors are, and vice versa, the farther the distance is.

5. The method of claim 1, wherein the source domain loss function in 5) is a cross-entropy loss function, and the formula is as follows:

in the formula, i is the sample number, m is the number of source domain pictures in the training batch, and x_s,iFor source domain sample pictures, y_s,iIs x_s,iLabel information carried by the specimen, p (y)_s,i|x_s,i) Is x_s,iBelong to y_s,iThe probability of (c).

6. The method according to claim 1, wherein 5) a picture before the target domain data set style migration is randomly selected as an anchor sample, the picture after the picture style migration is a positive sample, the picture of the source domain data set is a negative sample, and a triplet object is formed by the anchor sample, the positive sample and the negative sample;

for pictures after style migration, x_sIs the source domain picture, X is the batch training picture set size.

7. The method of claim 1, wherein 6) the residual network is trained using the total loss function L, and the method is implemented as follows:

K[i]←αK[i]+(1-α)f(x_t,i)

wherein α is a hyperparameter, α epsilon [0,1 ∈ ]]Controlling the update frequency, K [ i ]]For sample picture x_t,iFeatures stored in the sample memory, f (x)_t,i) Is x_t,iA feature vector of the sample;

6b) setting the total iteration times, the learning rate and the number of batch pictures of training, and selecting an optimizer;

6c) dividing a data set into a plurality of small-batch data sets by using a small-batch iteration method based on the number of batch pictures, sequentially sending each small-batch data set into a residual error network to extract a characteristic vector, and inputting the obtained characteristic vector into a total loss function L to calculate loss;

6d) solving the gradient of the calculated loss, and inputting the obtained gradient into the selected optimizer to update the network parameters;

6e) and repeating the steps 6c) to 6d), and stopping training when the total iteration times are reached to obtain a trained model.

8. The method according to claim 1, wherein 7) the euclidean distances are calculated for the features of the target pictures in the query set and the picture features in the candidate set respectively by the following formula:

wherein Q is a query set, G is a candidate set, x is a target picture, and y_iFor querying pictures, n is the number of pictures.