CN112949608B

CN112949608B - Pedestrian re-identification method based on twin semantic self-encoder and branch fusion

Info

Publication number: CN112949608B
Application number: CN202110410769.0A
Authority: CN
Inventors: 李群; 张子屹; 肖甫; 徐鼎; 周剑
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2021-04-15
Filing date: 2021-04-15
Publication date: 2022-08-02
Anticipated expiration: 2041-04-15
Also published as: CN112949608A

Abstract

The invention provides a pedestrian re-identification method based on twin semantic self-encoder and branch fusion, which comprises the following steps: constructing a pedestrian re-identification data set and a twin data set thereof, and training a convolution self-encoder; extracting twin feature vectors by taking an encoder network of a convolutional self-encoder as a feature extraction backbone network, training a twin semantic self-encoder, and constructing a semantic self-encoding branch network by virtue of network weights of encoding parts of the twin semantic self-encoder; taking an encoder network of a convolution self-encoder as a pedestrian re-identification backbone network, performing branch construction on the network, and training to obtain a complete pedestrian re-identification network model; and respectively inputting the pedestrian image to be identified and the pedestrian image set to be inquired into the trained model, extracting pedestrian re-identification features after multi-level feature fusion, and realizing pedestrian re-identification through similarity sorting. The invention can obviously improve the robustness of the pedestrian re-identification network model and the accuracy of pedestrian re-identification.

Description

Pedestrian re-identification method based on twin semantic self-encoder and branch fusion

Technical Field

The invention belongs to the technical field of artificial intelligence and computer vision, and particularly relates to a pedestrian re-identification method based on twin semantic self-encoders and branch fusion.

Background

The pedestrian re-identification task is an image identification task which is widely applied in the technical field of computer vision at present, and is especially key in the aspect of pedestrian monitoring. In addition, the attribute identification task is also applicable to pedestrian monitoring, and is generally used as an auxiliary task for tasks such as pedestrian re-identification or pedestrian detection. The purpose of pedestrian re-identification is to find the inquired person from the non-overlapping shooting picture, the purpose of attribute identification is to predict the attributes existing in the picture, the attributes can describe the detailed information of one person, including gender, accessories, length and color of clothes, and the like.

In the conventional pedestrian re-identification method, a convolutional neural network such as ResNet is mostly adopted as a backbone network to extract features.

ZL2019105493455 discloses a pedestrian re-identification method of twin generation type countermeasure network based on posture guidance pedestrian image generation, the method utilizes interpretable convolutional neural network to extract features of input pedestrian images, and utilizes an attention mechanism to carry out cross-channel representation on various shielding modes, so that the pedestrian detection performance is improved.

ZL2020100926087 discloses a pedestrian re-identification method fusing pedestrian attributes, which fuses the pedestrian attributes and feature vectors of pedestrian identification branches, wherein the pedestrian attributes are predicted, and the fusion of the feature vectors of two execution branches is completed by using a product mode.

Disclosure of Invention

In order to solve the problems, the invention provides a pedestrian re-identification method based on a twin semantic self-encoder and branch fusion.

In order to achieve the purpose, the invention is realized by the following technical scheme:

the invention relates to a pedestrian re-identification method based on twin semantic self-encoder and branch fusion, which specifically comprises the following steps:

s1: collecting a pedestrian image, marking a pedestrian identity label and an attribute label, and constructing a pedestrian re-identification data set; then, respectively carrying out data enhancement processing on each image to obtain a twin data set of the data set;

s2: training a convolution self-encoder by utilizing the pedestrian re-identification data set in the S1;

s3: extracting feature vectors of all images in the pedestrian re-identification data set and the twin data set by utilizing an S2 convolution self-encoder network, wherein given one image and the image subjected to data enhancement, two feature vectors extracted by the convolution encoder can be called twin feature vectors;

s4: the twin feature vector extracted in the S3 is matched with corresponding attribute label information to pre-train a twin semantic self-encoder, network weight parameters from an input layer to a hidden layer are stored, a semantic self-encoding branch network is constructed by using the partial weight parameters, and the output of the branch network can be called as a semantic attribute feature vector;

s5: taking the encoder network of the convolution self-encoder in the S2 as a backbone network, and performing branch construction and multi-level feature fusion on the backbone network to construct a complete pedestrian re-identification network;

s6: inputting the S1 pedestrian re-recognition training data set into the pedestrian re-recognition network in S5, and performing iterative training on the network by using a self-constructed loss function to finally obtain a pedestrian re-recognition network model based on twin semantic self-encoder and branch fusion;

s7: respectively inputting the pedestrian image to be identified and the pedestrian image set to be inquired into a pedestrian re-identification network model in S6, and extracting pedestrian re-identification feature vectors of all the images;

s8: respectively calculating Euclidean similarity between the feature vectors of the pedestrian images to be identified and the feature vectors of all the pedestrian images to be inquired one by one, namely firstly solving the Euclidean distance between the feature vectors, and then carrying out normalization processing on the distance values to obtain the similarity between the feature vectors.

S9: and sequencing all the pedestrian images to be inquired from high to low according to the similarity scores, and obtaining a pedestrian re-identification result according to the similarity score sequencing, namely that the pedestrian in the pedestrian image to be identified is the same as the pedestrian in the pedestrian image to be inquired with the highest similarity score.

Wherein the branch construction specifically comprises:

s5-1: adding a semantic self-coding branch network of S4, and outputting as a semantic attribute feature vector;

s5-2: constructing a pedestrian attribute identification branch network and outputting the pedestrian attribute identification branch network as a pedestrian attribute feature vector;

s5-3: constructing a pedestrian re-identification branch network and outputting the pedestrian re-identification branch network as a final pedestrian re-identification feature vector;

s5-4: adding a feature fusion branch on a backbone network, inputting pedestrian feature vectors with different scales and sizes output by each convolution module of the backbone network, outputting the pedestrian feature vectors subjected to feature fusion, and then respectively inputting a semantic self-coding branch network, a pedestrian attribute identification branch network and a pedestrian re-identification branch network.

The multilevel feature fusion specifically comprises the following steps:

s5-5: performing multi-scale feature fusion on a backbone network, namely fusing middle-low-layer feature vectors with different scales extracted by different convolution modules of the backbone network;

s5-6: fusing the pedestrian attribute features and the semantic attribute feature vectors, wherein the obtained feature vectors are used for identifying the pedestrian attributes;

s5-7: and fusing the pedestrian attribute feature vector and the pedestrian re-identification feature vector on the pedestrian re-identification branch network, and using the obtained feature vector for final pedestrian re-identification.

The invention is further improved in that: the convolutional self-encoder in S2 is composed of an encoder and a decoder, the encoder is composed of 4 convolutional modules, the decoder is composed of 4 deconvolution modules and 1 additional convolutional layer, the convolutional modules include 1 convolutional layer and 1 maximum pooling layer, and the deconvolution module includes 1 upsampling layer and 1 convolutional layer.

The invention is further improved in that: the S4 includes the following steps:

s4-1: training a twin semantic autoencoder: the semantic self-encoder is of a three-layer structure and comprises the following steps: the device comprises an input layer, a hidden layer and an output layer, wherein the input layer is the bottom visual feature of an image; the hidden layer is also called a semantic attribute layer, contains high-level semantic features and is obtained by encoding data of the input layer; the output layer is the recovery data obtained by reversely decoding the coded data according to the previous coding rule, namely the recovery data can be regarded as the reconstruction of the bottom visual characteristic in the input layer, the coding and decoding processes are all composed of a plurality of layers of fully connected neural networks, the bidirectional mapping transformation is carried out between the bottom visual characteristic and the high semantic characteristic by utilizing a mapping matrix, wherein the mapping matrix used in the decoding process from the hidden layer to the output layer is the transposition of the mapping matrix used in the coding process from the input layer to the hidden layer.

The twin semantic autoencoder is a self-encoder structure obtained by inputting two twin feature vectors into two semantic autoencoders with the same structure respectively and training by using the same objective function, wherein the mapping matrixes finally obtained by the two semantic autoencoders are completely the same.

In order to make the input characteristic and the output characteristic of the twin semantic self-encoder as identical as possible, the objective function is:

wherein, X ₁ 、X ₂ Is twin eigenvector, W is mapping matrix, W ^T Is the transpose of W, S is the semantic attribute feature vector, λ ₁ 、λ ₂ Respectively, constraint "WX ₁ (ii) S "and" WX ₂ Weight coefficient of S ". The objective function is mainly characterized in that additional constraint is added to the mapping function by utilizing the original data of the S vector, so that the coded data can be restored into the original data as much as possible; and the two semantic self-coders are combined for training, so that the finally obtained mapping matrix can better convert visual features into semantic features, and the modulus is improvedThe robustness of the model.

In the training stage of the twin semantic self-encoder, each pair of twin feature vectors are respectively input into two input layers as X in the target function ₁ 、X ₂ The input of (1); and converting corresponding attribute label information in the pedestrian re-identification data set into Word vectors, namely semantic attribute feature vectors, by a Word2Vec method, and then inputting the Word vectors into a hidden layer to serve as original data input of S vectors in a target function. Therefore, the twin semantic self-coding device is trained, a mapping matrix W is finally obtained, and network weight parameters from the input layer to the hidden layer are stored and used for initializing the semantic self-coding branch network.

S4-2: constructing a semantic self-coding branch network: and (2) constructing a semantic self-coding branch network by using the network weight between the semantic self-coder and the hidden layer stored in the step (S4-1), namely, independently taking out the network in the coding process part as a feature extraction branch of the backbone network, and converting the medium-low layer feature vector extracted by the backbone network into a high-layer semantic attribute feature vector by using a mapping matrix W to realize the conversion of the pedestrian feature from a visual space to a semantic space.

The invention is further improved in that: the loss function of S6 is specifically:

wherein L is _ID Re-identifying a loss of a branch for a pedestrian;

for the identification loss of the ith pedestrian attribute,

the average total loss of the pedestrian attribute identification branch is obtained; m is the total number of categories of the pedestrian attributes; w _i The weight ratio of the ith pedestrian attribute identification loss is used for balancing the contribution of the identification losses of different attribute classes in the average total loss of the pedestrian attribute identification branches; and λ is used forBalancing the pedestrian re-identification branch loss and the pedestrian attribute identification branch loss in the magnitude of the contribution in the overall loss of the overall network.

The invention has the beneficial effects that:

(1) the pedestrian re-identification network model based on semantic self-coding and branch, namely multi-level feature fusion under the assistance of attribute information is constructed, pedestrian features are extracted in a multi-branch, multi-level, multi-space and multi-scale mode, and the robustness of the pedestrian re-identification network model and the accuracy of pedestrian re-identification can be remarkably improved;

(2) the invention combines an unsupervised learning method, takes an encoder network obtained after a convolution self-encoder is pre-trained as a feature extraction backbone network, and adopts a multi-scale feature fusion method to fuse medium and low-layer feature vectors with different scales and sizes extracted by different convolution modules to obtain a feature vector which can give consideration to both local detail information and overall abstract information on a pedestrian image, so as to carry out optimization design on the backbone network and enhance the feature discrimination and the discrimination of a pedestrian re-identification model;

(3) the method adopts the semantic self-encoder to extract the high-level semantic features, adopts a twin network structure to train the semantic self-encoder to obtain a semantic self-encoding branch network, then combines the semantic self-encoding branch network with a pedestrian attribute identification branch network and a pedestrian re-identification branch network to perform multi-branch feature fusion, and enables the extracted features to simultaneously contain high-level semantic information and middle-low-level visual information, so that the semantic self-encoder has better mobility and generalization capability;

(4) according to the pedestrian re-identification method, the pedestrian attribute identification task and the pedestrian re-identification task are combined, a multi-task learning method is adopted, attribute information is used as assistance to train a pedestrian re-identification model, and the efficiency of pedestrian re-identification is remarkably improved.

The invention provides a twin semantic self-encoder structure, combines the twin semantic self-encoder structure with technologies such as a convolution self-encoder, multi-scale feature fusion, multi-task learning and the like, extracts pedestrian features in a multi-branch, multi-level, multi-space and multi-scale manner, and can obviously improve the robustness of a pedestrian re-identification network model and the accuracy of pedestrian re-identification.

Drawings

Fig. 1 is a flowchart of a pedestrian re-identification method according to the present invention.

Fig. 2 is a diagram of a network model architecture of the convolutional auto-encoder of the present invention.

FIG. 3 is a network model structure diagram of the twin semantic autoencoder of the present invention.

Fig. 4 is a network model structure diagram of the pedestrian re-identification method of the present invention.

Detailed Description

In the following description, for purposes of explanation, numerous implementation details are set forth in order to provide a thorough understanding of the embodiments of the present invention. It should be understood, however, that these implementation details should not be taken to limit the invention. That is, in some embodiments of the invention, such implementation details are not necessary.

s1: the method includes the steps of collecting pedestrian images, marking pedestrian identity labels and Attribute labels, and constructing a pedestrian re-identification data set, wherein the images are marked with 23 kinds and 27 kinds of pedestrian attributes respectively by adopting a DukeMTMC-Attribute data set or a Market-1501-Attribute data set, and then each image is subjected to data enhancement processing respectively, such as mirror image transformation, rotation transformation, Gaussian blur and the like, so that a twin data set of the data set is obtained.

S2: the method comprises the steps of training a convolution self-encoder by utilizing a pedestrian re-identification data set in S1, wherein the convolution self-encoder consists of an encoder and a decoder, the encoder consists of 4 convolution modules, the decoder consists of 4 deconvolution modules and 1 additional convolution layer, each convolution module comprises 1 convolution layer and 1 maximum pooling layer, and each deconvolution module comprises 1 up-sampling layer and 1 convolution layer.

S3: and (3) extracting feature vectors of all images in the pedestrian re-identification data set and the twin data set by utilizing an S2 convolution self-encoder network, wherein given an image and the image subjected to data enhancement, two feature vectors extracted by the convolution encoder can be called twin feature vectors.

wherein, X ₁ 、X ₂ Is twin eigenvector, W is mapping matrix, W ^T Is a turn of WS is a semantic attribute feature vector, λ ₁ 、λ ₂ Respectively, constraint "WX ₁ (ii) S "and" WX ₂ Weight coefficient of S ". The objective function is mainly characterized in that additional constraint is added to the mapping function by utilizing the original data of the S vector, so that the coded data can be restored into the original data as much as possible; and the two semantic self-coders are combined for training, so that the finally obtained mapping matrix can better convert the visual features into the semantic features, and the robustness of the model is improved.

the specific branch construction comprises the following steps:

s5-4: adding a feature fusion branch on the backbone network, inputting pedestrian feature vectors with different scales and sizes output by each convolution module of the backbone network, outputting the pedestrian feature vectors subjected to feature fusion, and then respectively inputting a semantic self-coding branch network, a pedestrian attribute identification branch network and a pedestrian re-identification branch network.

The specific multilevel feature fusion comprises the following steps:

S6: inputting a pedestrian re-recognition training data set into a pedestrian re-recognition network, and performing iterative training on the network by using a self-constructed loss function to finally obtain a pedestrian re-recognition network model based on twin semantic self-encoder and branch fusion;

the loss function is specifically:

wherein L is _ID Re-identifying a loss of a branch for a pedestrian;

for the identification loss of the ith pedestrian attribute,

namely the genus of pedestrianAverage total loss of sex identified branches; m is the total number of categories of the pedestrian attributes; w _i The weight ratio of the ith pedestrian attribute identification loss is used for balancing the contribution of the identification losses of different attribute classes in the average total loss of the pedestrian attribute identification branches; and λ is used to balance the magnitude of the contribution of the pedestrian re-identification branch loss and the pedestrian attribute identification branch loss in the total loss of the overall network.

The invention combines an unsupervised learning method, takes an encoder network obtained after a convolution self-encoder is pre-trained as a feature extraction backbone network, adopts a multi-scale feature fusion method, fuses medium-low layer feature vectors with different scales and sizes extracted by different convolution modules to obtain a feature vector which can give consideration to both local detail information and overall abstract information on a pedestrian image, thereby carrying out optimization design on the backbone network and enhancing the feature discrimination and the discrimination of a pedestrian re-identification model.

Although the semantic self-encoder is a three-layer structure including an input layer, a hidden layer and an output layer like the traditional self-encoder, the hidden layer of the semantic encoder in the invention corresponds to a semantic space, has obvious semantic information and is strongly supervised and constrained, so the semantic self-encoder does not belong to the category of unsupervised learning any more. The aim of training the semantic self-encoder is to obtain a mapping matrix, which can better convert visual features into semantic features, and the advantage of the semantic self-encoder is that the problem of mapping drift can be relieved.

The twin semantic self-encoder is a composite self-encoder structure constructed by using twin networks for reference, and has better mobility and generalization capability. The twin semantic self-encoder consists of two completely identical semantic self-encoders, the two semantic self-encoders are trained simultaneously by using the same target function, mapping matrixes obtained by the two self-encoders are completely identical, and the semantic information in visual features is extracted by means of the semantic self-encoders, so that the feature description capability of a pedestrian re-recognition model in a semantic space can be enhanced.

The above description is only an embodiment of the present invention, and is not intended to limit the present invention. Various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims

1. A pedestrian re-identification method based on twin semantic self-encoder and branch fusion is characterized in that: the pedestrian re-identification method takes an encoder network of a convolution self-encoder as a backbone network, and carries out branch construction and branch fusion on the backbone network to construct a complete pedestrian re-identification network, and the pedestrian re-identification method specifically comprises the following steps:

s1: collecting pedestrian images, constructing a pedestrian re-identification data set, and obtaining a twin data set of the data set after data enhancement processing;

s3: extracting feature vectors of all images in the pedestrian re-identification data set and the twin data set thereof by using an S2 convolution self-encoder network;

s4: pre-training a twin semantic self-encoder by using the twin feature vector extracted in the S3 to construct a semantic self-encoding branch network, wherein the output of the branch network is a semantic attribute feature vector;

s6: inputting the S1 pedestrian re-recognition training data set into the pedestrian re-recognition network in S5, performing iterative training on the network by using a loss function, and finally obtaining a pedestrian re-recognition network model based on twin semantic self-encoder and branch fusion;

s8: respectively calculating Euclidean similarity between the feature vectors of the pedestrian images to be identified and the feature vectors of all the pedestrian images to be inquired one by one;

s9: and sequencing according to the similarity score calculated in the step S8 to obtain the pedestrian re-recognition result.

2. The pedestrian re-identification method based on twin semantic self-encoder and branch fusion as claimed in claim 1, wherein: the branch construction in S5 specifically includes:

3. The pedestrian re-identification method based on the twin semantic self-encoder and the bifurcation fusion according to claim 1, characterized in that: the multi-level feature fusion in step S5 specifically includes:

s5-5: carrying out multi-scale feature fusion on a backbone network;

4. The pedestrian re-identification method based on twin semantic self-encoder and branch fusion as claimed in claim 1, wherein: the convolutional self-encoder in S2 is composed of two parts, an encoder and a decoder, wherein the encoder is composed of 4 convolutional modules, and the decoder is composed of 4 deconvolution modules and 1 additional convolutional layer.

5. The pedestrian re-identification method based on twin semantic self-encoder and branch fusion as claimed in claim 4, wherein: the convolution module comprises 1 convolution layer and 1 maximum pooling layer, and the deconvolution module comprises 1 upsampling layer and 1 convolution layer.

6. The pedestrian re-identification method based on twin semantic self-encoder and branch fusion as claimed in claim 1, wherein: the S4 includes the following steps:

s4-1: training a twin semantic autoencoder: the semantic self-encoder is of a three-layer structure comprising an input layer, a hidden layer and an output layer, wherein the hidden layer is a semantic attribute layer and contains high-layer semantic features, the semantic attribute layer is obtained by encoding data of the input layer, and a target function of the twin semantic self-encoder is set as follows:

wherein, X ₁ 、X ₂ Is twin eigenvector, W is mapping matrix, W ^T Is the transpose of W, S is the semantic attribute feature vector, λ ₁ 、λ ₂ Respectively, constraint "WX ₁ (ii) S "and" WX ₂ Weight coefficient of S ";

s4-2: constructing a semantic self-coding branch network: and constructing a semantic self-coding branch network by using the network weight from the input layer to the hidden layer of the semantic self-coder saved in the step S4-1.

7. The pedestrian re-identification method based on twin semantic self-encoder and branch fusion as claimed in claim 1, wherein: the loss function of S6 is specifically:

wherein L is _ID Re-identifying a loss of a branch for a pedestrian;

for the identification loss of the ith pedestrian attribute,

identifying an average total loss of branches for the pedestrian attributes; m is the total number of categories of the pedestrian attributes; w _i λ is the weight ratio occupied by the ith pedestrian attribute identification loss, and is used for balancing the contribution of the pedestrian re-identification branch loss and the pedestrian attribute identification branch loss in the total loss of the whole network.