CN112069940A

CN112069940A - Cross-domain pedestrian re-identification method based on staged feature learning

Info

Publication number: CN112069940A
Application number: CN202010854297.3A
Authority: CN
Inventors: 种衍文; 彭程威; 潘少明
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2020-08-24
Filing date: 2020-08-24
Publication date: 2020-12-11
Anticipated expiration: 2040-08-24
Also published as: CN112069940B

Abstract

The invention belongs to the technical field of pedestrian re-recognition and discloses a cross-domain pedestrian re-recognition method based on staged feature learning, which comprises the steps of obtaining a source domain sample with a tag based on a source domain data set with the tag, and inputting the source domain sample with the tag into a pedestrian general similarity model for training to obtain an initial model; obtaining a label-free target domain sample based on the label-free target domain data set, and inputting the label-free target domain sample into the initial model for feature extraction to obtain pedestrian features; calculating to obtain a pseudo label of the target domain sample by adopting a mutual neighbor pseudo label distribution method according to the characteristics of the pedestrian; forming a target domain sample with a label based on the pseudo label of the target domain sample, and inputting the target domain sample into a pedestrian general similarity model for training to obtain a deployment model; and inputting the sample to be queried into the deployment model for feature extraction, obtaining the pedestrian feature to be queried, and matching and outputting a retrieval result. The invention solves the problem of low accuracy of cross-domain pedestrian re-identification in the prior art, and can obtain the cross-domain performance with high accuracy.

Description

Cross-domain pedestrian re-identification method based on staged feature learning

Technical Field

The invention relates to the technical field of pedestrian re-identification, in particular to a cross-domain pedestrian re-identification method based on staged feature learning.

Background

The pedestrian Re-identification technology (Re-ID) is an important research direction in the field of computer vision and pattern recognition, and the goal thereof is to automatically retrieve a specific pedestrian from an image or video sequence with disjoint view angles, which has become a hot research problem in recent years due to the large-scale application of monitoring networks and the enhancement of public safety awareness.

The early pedestrian re-identification method belongs to a supervised method, and the performance of the model can be ensured only by carrying out supervised training on a large amount of label data acquired from an application scene; and when the trained model is deployed in other application scenes, the accuracy of the model is greatly reduced due to data distribution difference. Therefore, such methods require sufficient amounts of labeled data to be collected for each deployment scenario to be trained, which greatly increases deployment costs.

With the interest of researchers and increasing real-world demands, cross-domain pedestrian re-identification has been proposed. Under the cross-domain setting, data from two domains is needed: one is source domain data, i.e., existing annotated data, which may be from an open academic dataset, with low acquisition costs, and the other is target domain data, i.e., non-annotated data from an application scenario acquisition. Different from the previous supervision method, the target domain data does not need to be subjected to complicated labeling work, so that the deployment cost is greatly reduced.

At present, cross-domain pedestrian re-recognition is mainly divided into two types of methods, the first type is a method based on style migration, the data distribution difference between two domains is reduced by using a generative countermeasure network, and a pedestrian re-recognition model suitable for a target domain is trained by migrating source domain data to the style of the target domain; the second type is a method based on pseudo label training, the method firstly uses source domain data training to obtain an initial model, then uses the initial model to calculate characteristics of target domain label-free samples and cluster and endow pseudo labels, and finally uses the pseudo labels to train and fine tune the initial model.

The two methods have some problems respectively, the quality of style migration in the first method is difficult to control, and misleading samples are generated to cause error of supervision information; the second method is that the initial model is not specially designed, so that the performance of the target domain is low, and then the pseudo label obtained by clustering calculation is unreliable. The same result they do is that cross-domain pedestrian re-identification accuracy is not high.

Disclosure of Invention

The embodiment of the application solves the problem of low cross-domain pedestrian re-identification accuracy in the prior art by providing the cross-domain pedestrian re-identification method based on staged feature learning.

The embodiment of the application provides a cross-domain pedestrian re-identification method based on staged feature learning, which comprises the following steps of:

the method comprises the following steps that 1, a source domain sample with a label is obtained based on a source domain data set with the label, the source domain sample with the label is input into a pedestrian general similarity model for training, the pedestrian general similarity model which is trained by using source domain data is obtained, and the pedestrian general similarity model which is trained by using the source domain data is used as an initial model;

step 2, obtaining a label-free target domain sample based on a label-free target domain data set, inputting the label-free target domain sample into the initial model for feature extraction, and obtaining pedestrian features formed by splicing pedestrian local features and pedestrian global features; calculating to obtain a pseudo label of the target domain sample by adopting a mutual neighbor pseudo label distribution method according to the pedestrian characteristics;

step 3, forming a labeled target domain sample based on the pseudo label of the target domain sample, inputting the labeled target domain sample into the pedestrian general similarity model for training to obtain a pedestrian general similarity model trained by using target domain data, wherein the pedestrian general similarity model trained by using the target domain data is used as a deployment model;

and 4, inputting the sample to be queried into the deployment model for feature extraction to obtain the pedestrian feature to be queried, comparing the features of the pedestrian to be queried in a pedestrian database, and matching and outputting a retrieval result.

Preferably, the pedestrian general similarity model selects ResNet50 as a component of the style-independent backbone network, selects a part before layer4 to construct the style-independent backbone network, and selects layer4 as a residual block, and normalizes the examples and adds the examples into the residual block of the style-independent backbone network.

Preferably, the pedestrian general similarity model comprises a style-independent backbone network, a first branch and a second branch, wherein the first branch and the second branch are respectively connected with the output end of the style-independent backbone network;

the first branch comprises an attention model, a residual block, a global maximum pooling layer and a bottleneck layer which are connected in sequence;

the second branch comprises a residual block, a global maximum pooling layer, a background elimination module and a bottleneck layer; the output end of the residual block is respectively connected with the input ends of the global maximum pooling layer and the background elimination module, the output end of the background elimination module is connected with the input end of the global maximum pooling layer, and the output end of the global maximum pooling layer is connected with the input end of the bottleneck layer;

the first branch is used for obtaining the local features of the pedestrians, and the second branch is used for obtaining the global features of the pedestrians.

Preferably, the attention model is constructed by a spatial transformation network, the bottleneck layer is composed of a full connection layer and batch normalization, and the background elimination module adopts an SCDA (sparse code division multiple access) in the field of fine-grained image retrieval to realize background elimination.

Preferably, the pedestrian local feature and the pedestrian global feature are both subjected to triple loss constraint, and both adopt cross entropy loss as an optimization function.

Preferably, the mutual neighbor pseudo label allocation method is implemented as follows:

calculating each sample according to the Euclidean distance between the characteristics of the samples to obtain k neighbors, if the two samples a and b are the k neighbors of the other side, the two samples a and b meet the mutual neighbor relation, and the two samples a and b meeting the strong constraint relation of the mutual neighbor relation belong to the same pedestrian;

and (4) propagating the mutual-neighbor relation, and when the two samples a and c meet the mutual-neighbor relation and the two samples a and b meet the mutual-neighbor relation, determining that the three samples a, b and c belong to the same individual.

Preferably, in the step 1, before inputting the labeled source domain sample into a pedestrian general similarity model for training, preprocessing the labeled source domain sample; the preprocessing comprises the steps of scaling the size of a sample image, converting the sample image into a tensor type specified by a pytorech framework, and normalizing;

in the step 2, before inputting the unlabeled target domain sample into the initial model for feature extraction, preprocessing the unlabeled target domain sample; the preprocessing comprises the steps of scaling the size of the sample image, converting the sample image into a tensor type specified by the pytorech framework, and normalizing.

One or more technical solutions provided in the embodiments of the present application have at least the following technical effects or advantages:

in the embodiment of the application, aiming at the problem that the pedestrian features hardly meet the regional differentiation and the inter-domain generalization simultaneously, the method provides the steps of dividing the cross-domain pedestrian re-identification into two stages of the region invariant feature learning and the region specific feature learning, and designs each stage, so that the high-accuracy cross-domain performance is realized. In addition, aiming at the problem of low reliability of the target domain pseudo label, the invention designs a mutual neighbor-based pseudo label distribution method, and effectively solves the problem that the existing clustering algorithm is difficult to effectively cluster the same pedestrian with huge appearance difference, so that the high-quality pseudo label is provided to meet the requirement of realizing domain specific feature learning on the label-free sample of the target domain.

Drawings

FIG. 1 is an overall frame diagram of a cross-domain pedestrian re-identification method based on staged feature learning according to the present invention;

FIG. 2 is a schematic diagram of an improved residual block of the present invention;

FIG. 3 is an illustration of parameters in an attention model of the present invention;

FIG. 4 is an overall flowchart of a cross-domain pedestrian re-identification method based on staged feature learning according to the present invention;

fig. 5 is a cross-domain pedestrian re-recognition result obtained by the cross-domain pedestrian re-recognition method based on staged feature learning provided by the invention.

Detailed Description

In order to better understand the technical solution, the technical solution will be described in detail with reference to the drawings and the specific embodiments.

The invention provides a cross-domain pedestrian re-identification method based on staged feature learning, which mainly comprises the following steps of:

the method comprises the following steps of 1, obtaining a source domain sample with a label based on a source domain data set with the label, inputting the source domain sample with the label into a pedestrian general similarity model for training, obtaining the pedestrian general similarity model trained by using the source domain data, and taking the pedestrian general similarity model trained by using the source domain data as an initial model.

Step 2, obtaining a label-free target domain sample based on a label-free target domain data set, inputting the label-free target domain sample into the initial model for feature extraction, and obtaining pedestrian features formed by splicing pedestrian local features and pedestrian global features; and calculating the pseudo label of the target domain sample by adopting a mutual neighbor pseudo label distribution method according to the pedestrian characteristics.

And 3, forming a labeled target domain sample based on the pseudo label of the target domain sample, inputting the labeled target domain sample into the pedestrian general similarity model for training to obtain the pedestrian general similarity model trained by using the target domain data, and taking the pedestrian general similarity model trained by using the target domain data as a deployment model.

The pedestrian general similarity model selects ResNet50 as a component of the style-independent backbone network, selects a part before layer4 to construct the style-independent backbone network, uses layer4 as a residual block, and normalizes the examples and adds the examples into the residual block of the style-independent backbone network.

Referring to fig. 1, the pedestrian general similarity model includes a style-independent backbone network, a first branch, and a second branch, where the first branch and the second branch are respectively connected to an output end of the style-independent backbone network; the first branch comprises an attention model, a residual block, a global maximum pooling layer and a bottleneck layer which are connected in sequence; the second branch comprises a residual block, a global maximum pooling layer, a background elimination module and a bottleneck layer; the output end of the residual block is respectively connected with the input ends of the global maximum pooling layer and the background elimination module, the output end of the background elimination module is connected with the input end of the global maximum pooling layer, and the output end of the global maximum pooling layer is connected with the input end of the bottleneck layer; the first branch is used for obtaining the local features of the pedestrians, and the second branch is used for obtaining the global features of the pedestrians.

The attention model is constructed by a space transformation network, the bottleneck layer is composed of a full connection layer and batch normalization, and the background elimination module adopts SCDA in the field of fine-grained image retrieval to realize background elimination.

And the pedestrian local feature and the pedestrian global feature are both subjected to the constraint of triple loss, and cross entropy loss is adopted as an optimization function.

The mutual neighbor pseudo label allocation method is specifically implemented as follows: calculating each sample according to the Euclidean distance between the characteristics of the samples to obtain k neighbors, if the two samples a and b are the k neighbors of the other side, the two samples a and b meet the mutual neighbor relation, and the two samples a and b meeting the strong constraint relation of the mutual neighbor relation belong to the same pedestrian; and propagating the mutual-neighbor relation, and if two samples a and c satisfy the mutual-neighbor relation and two samples a and b satisfy the mutual-neighbor relation, determining that the three samples a, b and c belong to the same individual.

The present invention is further described below.

The embodiment provides a cross-domain pedestrian re-identification method based on staged feature learning, wherein a pytorch is used as a convolutional neural network building frame, and a staged learning process of a domain invariant feature and a domain specific feature is constructed, so that efficient and automatic retrieval of a target pedestrian is realized. Specifically, in consideration that the pedestrian features hardly satisfy inter-domain generalization and intra-domain discrimination at the same time, the cross-domain pedestrian feature learning is divided into two stages, namely, domain invariant feature learning and domain specific feature learning. In the first stage of domain invariant feature learning, the pedestrian feature with strong universality is obtained from the source domain data through network structure design, and the model is used as an initial model. In the second stage of domain specific feature learning, the invention provides a novel pseudo label assignment method by utilizing naturally existing similarity among samples of a target domain and according to the general pedestrian feature extracted by an initial model, and then uses a pseudo label as supervision information to realize the supervision training on the target domain sample and learn the domain specific feature from the target domain.

The method mainly comprises two parts, namely a pedestrian general similarity model designed for domain invariant feature learning and a mutual neighbor pseudo label distribution method designed for domain specific feature learning.

In order to ignore the characteristics of the data set and learn the pedestrian characteristics with strong universality, the pedestrian general similarity model designed by the scheme is shown as the attached figure 1. In this scheme, two main factors that are concerned about affecting the generalization of the model are the camera style difference and the background information.

In the domain invariant feature learning, the pedestrian features which are irrelevant to the style are extracted from a source domain sample through a style-irrelevant backbone network, so that the pedestrian features caused by the style difference of cameras are avoided being different. Classical ResNet50 was chosen as a component of the style independent backbone, specifically, the part before layer4 was chosen to build the style independent backbone, while layer4 was used as the residual block shown in FIG. 1. ResNet50 is not specifically designed for style issues, and the present invention adds Instance Normalization to the Residual Block of a style-independent backbone, as shown in FIG. 2. The Instance Normalization eliminates the intra-sample commonality by carrying out statistics on each sample independently, and endows the network with the capability of extracting style-independent features to a certain extent.

After the style-independent features are extracted, the network is divided into an upper branch and a lower branch, in the upper branch, the network automatically finds the pedestrian region with the most distinguishing force through an attention model, the attention model is constructed by STN (Spatial transform Networks), and specific parameters are shown in figure 3. And then obtaining the local features of the pedestrians through the residual blocks and the global maximum pooling and bottleneck layers, wherein the local features represent the features which have the most distinguishing capability for identifying the pedestrians. The bottleneck layer consists of a 1024-dimensional fully connected layer and batch normalization. The lower branch does not go through the attention model to learn the global features of the pedestrian. In addition, the following branches are different in training and reasoning, and the global features of the pedestrians can be obtained through background elimination in reasoning. Background removal is done using the classical working SCDA (selective relational Descriptor aggregation) in the field of fine-grained search. The basic principle is that the average response of the features extracted by the trained model on the channels has higher intensity in the main body area, and the error response of some channels on the background area can be filtered by taking the average response intensity as a threshold value, so that the influence of background information is ignored. On one hand, the local features and the global features are directly restricted by triple loss, on the other hand, prediction categories are output through a classifier and cross entropy loss is calculated through the labels, and in the domain invariant feature learning at the stage, real labels from a source domain are used as labels needed for supervision.

In the domain specific feature learning, firstly, a pseudo label is obtained by calculating an unlabeled sample in a target domain, and then the model is retrained according to the above process by using the pseudo label as supervision information. In order to obtain the pseudo label, the pedestrian general similarity model is used for reasoning on the unlabeled sample of the target domain to respectively obtain the local characteristic and the global characteristic of the pedestrian, and the local characteristic and the global characteristic of the pedestrian are directly connected in series to form the final pedestrian characteristic. After extracting features of all target domain samples, calculating the pseudo labels by using the mutual neighbor pseudo label distribution method provided by the scheme.

The specific operation of the mutual neighbor pseudo label allocation method is as follows: firstly, k neighbor is obtained by calculating the Euclidean distance between the characteristics of each sample, and if two samples a and b are the k neighbor of the other side, the a and b satisfy the mutual neighbor relation. The samples of a and b which satisfy the strong constraint relationship of mutual neighbor belong to the same pedestrian, and the mutual neighbor relationship is propagated, namely when a and c satisfy the mutual neighbor and a and b satisfy the mutual neighbor, the a, b and c are considered to belong to the same individual.

The method has the advantages that the visual difference caused by different angles of the pedestrians is considered, if a is a front photograph, b is a side photograph and c is a back photograph of a certain pedestrian, the conventional clustering algorithm is difficult to judge that a and c with large appearance difference belong to the same person, but the front photograph a and the back photograph c respectively have certain similarity with the side photograph b. This pseudo label is then used to supervise model training.

Through the above domain invariant feature learning and the domain specific feature learning, a final cross-domain pedestrian re-identification model is obtained. Deploying the model to a target domain scene, receiving pedestrian picture input, executing inference to obtain pedestrian features, then performing Euclidean distance comparison between the features in a pedestrian database, and outputting a retrieval result according to the ascending order of the distances, as shown in figure 5.

The cross-domain pedestrian re-identification process is further described below.

With 384 × 128 × 3 images as training and testing images, referring to fig. 1 and 4, the main steps include:

(1) domain invariant feature learning

The image size of the source domain data set (e.g., Market1501) is first scaled to 384 × 128 × 3, then converted to the tensor type specified by the pytorech framework, and normalized. The batch size used for training is 32, the epoch is 64, input samples pass through a training process in a pedestrian general similarity model in the attached figure 1 to obtain global features and local features, the global features and the local features calculate triple losses and cross entropy losses according to real labels provided by source domain data, then model parameters are optimized by using a random gradient descent method, domain invariant feature learning is completed, and a model trained by using the source domain data is obtained.

(2) Pseudo label assignment

An image of the target domain data (e.g., DukeMTMC) is subjected to the same data preprocessing as in step (1). Then inputting each sample into a pedestrian general similarity model which is trained by source domain data in the attached figure 1, executing a test process to obtain pedestrian characteristics formed by splicing local characteristics and global characteristics, and then calculating to obtain a pseudo label of a target domain sample according to a mutual neighbor pseudo label distribution method provided by the invention.

(3) Domain specific feature learning

And (3) using the pseudo label obtained in the step (2) as supervision information, inputting a target domain sample, training the target domain sample according to the process in the step (1), completing the learning of domain specific characteristics, and obtaining a model trained by using target domain data, wherein the model is used as a deployment model.

(4) Pedestrian retrieval

When the query occurs, preprocessing the query sample by the same data as the data in the step (1), inputting the preprocessed query sample into the deployment model to perform reasoning to obtain the pedestrian characteristics, comparing the characteristics in a pedestrian database by using the pedestrian characteristics, and outputting a retrieval result in an ascending Euclidean distance manner.

In order to verify the performance of the invention, the cross-domain performance of the scheme is respectively tested on the Market1501 and DukeMTMC-reiD data sets, and the results are as follows:

(1) market1501 data set: rank-1 was 82.2 and mAP was 58.6.

(2) DukeMTMC-reiD dataset: rank-1 was 75.7 and mAP was 58.0.

The above data are superior to the cross-domain performance of existing schemes.

The cross-domain pedestrian re-identification method based on staged feature learning provided by the embodiment of the invention at least comprises the following technical effects:

(1) aiming at the problem that the pedestrian characteristics are difficult to simultaneously meet the regional component and inter-domain generalization in the region, the invention provides the method for identifying the pedestrian across the region into two stages of learning the region-invariant characteristics and learning the region-specific characteristics, and each stage is specially designed, so that the cross-region performance with high accuracy is realized.

(2) Aiming at the problem that the pedestrian characteristics are greatly influenced by the difference between the style of a camera and the background, the style-independent backbone network is constructed by introducing Instance Normalization and the picture background is eliminated by introducing SCDA (simple coding and data acquisition), so that the pedestrian characteristics with strong domain invariance are learned in a source domain sample; meanwhile, an attention mechanism is introduced, and a network architecture combining global features and local features is constructed to enhance feature differentiation force.

(3) Aiming at the problem of low reliability of the target domain pseudo label, the invention designs a mutual neighbor-based pseudo label distribution method, which effectively solves the problem that the existing clustering algorithm is difficult to effectively cluster the same pedestrian with huge appearance difference, thereby providing a high-quality pseudo label to meet the requirement of realizing domain specific feature learning on the label-free sample of the target domain.

Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention has been described in detail with reference to examples, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, which should be covered by the claims of the present invention.

Claims

1. A cross-domain pedestrian re-identification method based on staged feature learning is characterized by comprising the following steps of:

2. The method of claim 1, wherein the pedestrian universal similarity model selects ResNet50 as a component of a style-independent backbone network, selects a part before layer4 to construct the style-independent backbone network, and selects layer4 as a residual block, and adds instance normalization to the residual block of the style-independent backbone network.

3. The method for cross-domain pedestrian re-identification based on staged feature learning according to claim 1, wherein the pedestrian general similarity model comprises a style-independent backbone network, a first branch and a second branch, and the first branch and the second branch are respectively connected with an output end of the style-independent backbone network;

4. The method according to claim 3, wherein the attention model is constructed by a space transformation network, the bottleneck layer is composed of a full connection layer and batch normalization, and the background elimination module adopts SCDA in the fine-grained image retrieval field to realize background elimination.

5. The method of claim 3, wherein the pedestrian local features and the pedestrian global features are constrained by triplet loss and adopt cross entropy loss as an optimization function.

6. The method for cross-domain pedestrian re-identification based on staged feature learning as claimed in claim 1, wherein the mutual neighbor pseudo label assignment method is implemented as follows:

7. The method for cross-domain pedestrian re-recognition based on staged feature learning according to claim 1, wherein in the step 1, the labeled source domain samples are preprocessed before being input into a pedestrian general similarity model for training; the preprocessing comprises the steps of scaling the size of a sample image, converting the sample image into a tensor type specified by a pytorech framework, and normalizing;