CN113449552A

CN113449552A - Pedestrian re-identification method based on blocking indirect coupling GAN network

Info

Publication number: CN113449552A
Application number: CN202010218063.XA
Authority: CN
Inventors: 岑翼刚; 张悦; 阚世超; 童忆; 安高云
Original assignee: Jiangsu Yishi Intelligent Technology Co ltd
Current assignee: Jiangsu Yishi Intelligent Technology Co ltd
Priority date: 2020-03-25
Filing date: 2020-03-25
Publication date: 2021-09-28

Abstract

The invention discloses a pedestrian re-identification method based on a blocking indirect coupling GAN network. And then, the partitioned indirect coupling characterization learning network is used as an encoder of the GAN network, namely the partitioned indirect coupling GAN network is used for training the GAN network, and the same pedestrian features capable of being recognized under different postures are obtained through back propagation of a discriminator, a decoder and the like. Therefore, more robust features are obtained, and pedestrian re-identification is carried out. When the pedestrian re-identification is finally carried out, only an encoder in the GAN network is used for extracting features, and extra attitude information and computing power are not needed; for the illumination change, the recognition rate of different postures of pedestrians with the same scene change and posture change is higher; different pedestrians have strong distinguishing force due to clothes similarity and the like; the generated graph is clearer and contains more detailed information.

Description

Pedestrian re-identification method based on blocking indirect coupling GAN network

Technical Field

The invention belongs to the technical field of pedestrian re-identification, and particularly relates to a pedestrian re-identification method based on a blocking indirect coupling GAN network.

Background

Pedestrian Re-identification (Re-ID) is an important application in security and intelligent surveillance systems, and is an effective alternative when face recognition cannot be used. The Re-ID is intended to search for the same pedestrian under different cameras according to the characteristics of the body shape, appearance and posture of the person. Initially, the pedestrian Re-ID method was primarily based on traditional algorithms, including manual extraction of visual features and similarity measures. Different from the traditional method, the deep learning method can automatically extract better features and learn better similarity indexes. With the development of deep learning, the pedestrian recognition technology has also made great progress. However, pedestrians in real scenes are often occluded by moving objects or stationary objects. And the pedestrian to camera distance is also not fixed, which will result in low resolution objects and large scale variations. In addition, the same person may have large deformation and posture change under different cameras. Different pedestrians may have very similar dresses, poses, body shapes and appearances. Therefore, pedestrian re-identification remains a hot problem in recent years.

Block-based representation learning methods have proven to be very effective methods for pedestrian Re-identification (Re-ID) and have a fast convergence rate, but the features extracted by existing block-based methods tend to have a high correlation between different blocks. And the Re-ID method based on expression learning is less effective for pedestrians with large posture variations. In order to reduce the influence of posture difference and body obstruction, a generation countermeasure network (GAN) is applied in the Re-ID field, which mainly requires that extracted features exist in different posture images when another posture is generated according to the features of an input image in the training process, so that the extracted features can better identify the same pedestrian in different postures, but the features extracted by the current GAN network-based Re-ID method have the problem of redundancy for a small part, so that the generated map contains less detailed information and is fuzzy.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the defects in the prior art, the invention aims to provide a blocking indirect coupling GAN network-based method. And then, the partitioned indirect coupling characterization learning network is used as an encoder of the GAN network, namely the partitioned indirect coupling GAN network is used for training the GAN network, and the same pedestrian features capable of being recognized under different postures are obtained through back propagation of a discriminator, a decoder and the like. Therefore, more robust features are obtained, and pedestrian re-identification is carried out.

The technical scheme is as follows: in order to achieve the purpose of the invention, the technical scheme adopted by the invention is as follows: the pedestrian re-identification method based on the blocking indirect coupling GAN network comprises the following steps:

step 1: inputting a pedestrian image into an encoder in the GAN network; the GAN network consists of an encoder, a decoder and two discriminators; the two discriminators are an identity discriminator Did and a gesture discriminator Dpd respectively;

step 2: training an encoder to obtain characteristics capable of distinguishing different pedestrians;

and step 3: inputting the pedestrian picture again, and extracting the pedestrian features through an encoder; inputting a target attitude heat map, and obtaining attitude characteristics through an attitude characteristic extraction network; then combining the pedestrian feature vector, the attitude feature vector and the noise vector together in a channel dimension through concatee operation, inputting the combined pedestrian feature vector, attitude feature vector and noise vector into a decoder in a GAN network, and then training the combined pedestrian feature vector, attitude feature vector and noise vector to obtain a generated image;

and 4, step 4: an identity discriminator D for inputting the input pedestrian image, the target posture original image and the generated image into the GAN network_idPerforming the following steps;

and 5: an attitude discriminator D for inputting the input pedestrian image attitude map, the target attitude map and the generated map into the GAN network_pdPerforming the following steps;

step 6: training the whole GAN network, extracting the characteristics of the query graph and the graph to be queried by using the coding network in the GAN network, and calculating the similarity of the characteristics for re-recognition.

Further, the Encoder in step 2 is composed of a partitioned indirectly coupled characterization learning network (i.e. an encoding network Encoder), and step 2 includes:

step 2-1: uniformly scaling the pedestrian images to 384 × 128, and performing image enhancement operations such as horizontal turning, random erasing and the like on the images;

step 2-2: inputting the graph obtained in the step 2-1 into a coding network for training, horizontally dividing the obtained feature vectors into 6 blocks, obtaining 256-dimensional column vectors for each block, and calculating cross entropy loss with a real class label;

step 2-3: the vector of each image block obtained in the step 2-1 is calculated, the cosine similarity of any two characteristic vectors is calculated, and L1 loss is measured together with the zero vector;

further, the encoding network Encoder is composed of a reference network and two loss functions, and is specifically described as follows:

for step 2-2, for each of the 6 blocks, performing an average pooling and 1x1 convolution operation; obtaining a 256-dimensional column vector for each block;

and 2-3, performing L2 regularization on the network weight matrix when the cosine similarity is close to zero, thereby obtaining the characteristic of segment independence.

Further, the decoding network Decoder of the Decoder in step 3 is an up-sampling process, and step 3 includes:

step 3-1: inputting the fused features, and obtaining a feature vector of a layer 1 through the operation of a deconvolution module;

step 3-2: inputting the output of the 1 st layer into a deconvolution module to obtain a feature vector of the 2 nd layer;

step 3-3: the up-sampling operation of the feature vector of the layer 2 integrates a deconvolution module and a bilinear interpolation module, so that a layer 3 feature vector is obtained;

step 3-4: the feature vector of the layer 3 is subjected to the up-sampling operation of the fusion module to obtain a feature vector of the layer 4;

step 3-5: and the feature vector of the 4 th layer is subjected to a deconvolution module to obtain a generated graph.

Further, the acquisition process on the decoding network Decoder is as follows: the network input layer fuses the pedestrian characteristics, the target attitude characteristics and the combined characteristics of the noise vectors, then carries out up-sampling operation and decodes the combined characteristics into the size of an input graph;

the sub-modules sampled by the decoder are composed of a deconvolution module and a Bilinear interpolation module, the Bilinear interpolation module is composed of a Biliner-Cov network, and a 1x1 convolution layer is added to obtain the same number of characteristic channels as the deconvolution module.

Further, the identity discriminator D is inputted in step 4_idPerforming corresponding pixel subtraction on pixels in the image, and solving a square difference; then, a batch processing normalization layer (BN) and a full connection layer (FC) are carried out, and finally, a nonlinear function outputs classification probability; if the input and target are the same person, the tag is set to true, otherwise it is set to false.

Further, the attitude is input to the attitude discriminator D in step 5_pdThe images and pose maps in (1) are cascaded along the channel dimension and then processed through 4 Conv-BN-ReLU sub-networks and nonlinear functions to obtain values in the interval [0, 1 ]]The confidence of (c).

Further, inputting the query graph and the graph to be queried to a decoding network in step 6, extracting features, calculating Euclidean distances of the features of the query graph and the graph to be queried, and sorting the Euclidean distances in an ascending order as the distances are more similar, and taking the results of Top1, Top5 and Top 10.

Has the advantages that: compared with the prior art, the invention has the following advantages: (1) when the pedestrian re-identification is finally carried out, only an encoder in the GAN network is used for extracting features, and extra attitude information and computing power are not needed;

(2) for the illumination change, the recognition rate of different postures of pedestrians with the same scene change and posture change is higher;

(3) different pedestrians have strong distinguishing force due to clothes similarity and the like;

(4) the generated graph is clearer and contains more detailed information.

Drawings

FIG. 1 is a diagram of the overall architecture of the process of the present invention;

FIG. 2 is a block diagram of an encoder network of the present invention;

FIG. 3 is a diagram of a decoding network model architecture of the present invention;

FIG. 4 is a graph of the results of the present invention.

Fig. 5 is a pedestrian search result diagram of the present invention.

Detailed Description

The present invention will be further illustrated with reference to the accompanying drawings and specific examples, which are carried out on the premise of the technical solution of the present invention, and it should be understood that these examples are only for illustrating the present invention and are not intended to limit the scope of the present invention.

The blocking indirect coupling GAN network consists of an encoder, a decoder and two discriminators;

as shown in fig. 1, the pedestrian re-identification method based on the blocking indirect coupling GAN network of the present invention has the following steps:

step 1: inputting a pedestrian image into an encoder in the GAN network;

step 2: training an encoder to obtain characteristics capable of distinguishing different pedestrians; in step 2, the encoder network structure is trained and is composed of a partitioned indirect coupling representation learning network;

and step 3: inputting the pedestrian picture again, and extracting the pedestrian features through an encoder; inputting a target attitude heat map, and obtaining attitude characteristics through an attitude characteristic extraction network; then combining the pedestrian feature vector, the attitude feature vector and the noise vector together in a channel dimension through concate operation, inputting the combined vectors into a decoder in a GAN network, and then training to generate a target attitude graph (namely a generated graph) of the pedestrian;

and 4, step 4: inputting the input pedestrian image, the target posture original image, and the generated image into an identity discriminator Did in the GAN network;

and 5: an attitude determination device D for inputting the input pedestrian image attitude map, target attitude original map, and generated map into GAN network_pdPerforming the following steps;

step 6: training the whole GAN network, finally utilizing the coding network in the GAN network to extract the characteristics of the query graph and the graph to be queried, and calculating the similarity of the characteristics for re-identification.

The specific implementation is as follows

The detailed steps of step 2 are as follows:

as shown in fig. 2, the pedestrian image is scaled to 384 × 128 size uniformly, and the image is subjected to image enhancement operations such as horizontal flipping and random erasing, and the processed image is input into the ResNet50 network in the encoder (the ResNet50 network is a part of the encoder, and the ResNet50 is the reference network framework of the encoder);

adopting ResNet50 as backbone network, and deleting original global average pool layer and network structure behind it; then, the resulting three-dimensional tensor is divided equally into six blocks in the horizontal direction before the pooling layer, and for each block, the average pooling and 1 × 1 convolution operations are performed; through the above operations, we can obtain a 256-dimensional column vector for each block; predicting a class label for each 256-dimensional vector of each block, and calculating cross entropy loss with a real class label;

in order to obtain more and richer features and ensure that the feature redundancy between different blocks is less and a complementary effect can be achieved, the cosine similarity of any two eigenvectors is calculated by adopting NCM loss, in order to ensure that any two eigenvectors are not similar, the blocking indirect coupling loss is provided, the calculated similarity is enabled to be L1 loss with zero vector measurement, in addition, in order to avoid the situation that the network weight matrix is a zero matrix when the calculated similarity is close to 0, the network weight matrix is subjected to L2 regularization, and thus the segment independent features are obtained.

Detailed description of step 3

The decoding network Decoder of the Decoder is an upsampling process. The method specifically comprises the following steps: the pedestrian features, the features of the target attitude and the noise vectors are fused to be used as the input of a decoder, and then the up-sampling operation is carried out to decode the input image size (namely, the image is generated).

The sub-module of the general decoder up-sampling is composed of Relu-Transconv-Norm network structure, namely, a deconvolution module; in order to fully utilize the characteristic information of a decoding layer, a deconvolution module and a Bilinear interpolation module are fused, the Bilinear interpolation module consists of a Biliner-Cov network, a 1x1 convolution layer is added to obtain the same number of characteristic channels as the deconvolution module, and only two layers of interpolation modules are added to realize the rapid convergence of the network.

As shown in fig. 3, according to the number of upsampling operations, we name the feature vector after the upsampling operation as a layer 1 feature vector, a layer 2 feature vector, a layer 3 feature vector, and a layer 4 feature vector in sequence to generate a map.

Specifically, the fused features are subjected to deconvolution module to obtain feature vectors of a 1 st layer, then the feature vectors of a 2 nd layer are obtained through the deconvolution module, the deconvolution module and the bilinear interpolation module are fused in the up-sampling operation of the feature vectors of the 2 nd layer to obtain feature vectors of a 3 rd layer, the feature vectors of a 4 th layer are obtained through the up-sampling operation of the feature vectors of the 3 rd layer through the fusion module, and finally the generated image is obtained through the feature vectors of the 4 th layer through the deconvolution module. And in the third and fourth times of upsampling, a fusion module is adopted instead of all layers, so that the quality of the generated image is effectively improved, and the convergence speed of the network is not reduced.

Detailed steps of step 4 and step 5: and judging the real image and generating an image.

The invention comprises two discriminators, e.g. FIG. 1(D) and (e), an identity discriminator D_idAn attitude discriminator D_pd. Identity discriminator D_idThe input image is encoded using ResNet50 as the backbone network, but the weight parameters are not shared with the Encoder network. Corresponding pixel subtraction is performed on the obtained pixels in the input image and the target image (or generation map), and then the squared difference is found. Then, a batch processing normalization layer (BN) and a full connection layer (FC) are carried out, and finally, a nonlinear function outputs classification probability. If the input and target are the same person, the tag is set to true, otherwise it is set to false. According to these operations, the distance between different gestures of the same ID can be reduced.

Attitude discriminator D_pdFor determining whether the generated map has the same pose as the target. The gesture discriminator uses PatchGAN. The input image and pose graph are first cascaded along the channel dimension and then passed through 4 Conv-BThe N-ReLU sub-network and the non-linear function are processed to obtain values in the interval [0, 1 ]]Wherein the confidence map represents a degree of match between the input image and the pose map.

Detailed step of step 6:

training the whole GAN network to ensure that the network can obtain the characteristics of distinguishing different pedestrians and identifying the same pedestrian in different postures; and finally, inputting the query graph and the graph to be queried to a decoding network, extracting features, calculating Euclidean distances of the features of the query graph and the graph to be queried, sequencing the Euclidean distances in an ascending order when the distances are more similar, and taking the results of Top1, Top5 and Top 10.

Fig. 4 shows the result of the map generation of the present invention, the first line is the input image of the pedestrian, the second line is the target image to be generated by the network, the third line is the result of the map generation of the current better GAN network, and the fourth line is the result map of the present invention.

Fig. 5 shows Top10 results of a pedestrian search on a Market1501 data set, using a model trained by the present invention. The column to the left of the dotted line is the query graph and the right is the top10 searched images with the best image of the query graph.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. The pedestrian re-identification method based on the blocking indirect coupling GAN network is characterized by comprising the following steps: comprises the following steps of (a) carrying out,

step 1: inputting a pedestrian image into an encoder in the GAN network; the GAN network consists of an encoder, a decoder and two discriminators; two discriminators being identity discriminators D_idAnd attitude discriminator D_pd；

2. The pedestrian re-identification method based on the blocking indirect coupling GAN network as claimed in claim 1, wherein: in step 2, the Encoder is composed of a partitioned indirect coupling characterization learning network (namely, an encoding network Encoder), and step 2 comprises:

step 2-3: and (4) obtaining the vector of each image block obtained in the step 2-1, calculating the cosine similarity of any two characteristic vectors, and measuring the L1 loss with the zero vector.

3. The pedestrian re-identification method based on the blocking indirect coupling GAN network as claimed in claim 2, wherein: the encoding network Encoder is composed of a reference network and two loss functions, and is specifically described as follows:

4. The pedestrian re-identification method based on the blocking indirect coupling GAN network as claimed in claim 1, wherein: the decoding network Decoder of the Decoder in step 3 is an up-sampling process, and step 3 includes:

5. The pedestrian re-identification method based on the blocking indirect coupling GAN network as claimed in claim 4, wherein: the acquisition process on the decoding network Decoder is as follows: the network input layer fuses the pedestrian characteristics, the target attitude characteristics and the combined characteristics of the noise vectors, then carries out up-sampling operation and decodes the combined characteristics into the size of an input graph;

6. The pedestrian re-identification method based on the blocking indirect coupling GAN network as claimed in claim 1, wherein: in step 4, the identity is input into the discriminatorD_idPerforming corresponding pixel subtraction on pixels in the image, and solving a square difference; then, a batch processing normalization layer (BN) and a full connection layer (FC) are carried out, and finally, a nonlinear function outputs classification probability; if the input and target are the same person, the tag is set to true, otherwise it is set to false.

7. The pedestrian re-identification method based on the blocking indirect coupling GAN network as claimed in claim 1, wherein: in step 5, an input posture discriminator D_pdThe images and pose maps in (1) are cascaded along the channel dimension and then processed through 4 Conv-BN-ReLU sub-networks and nonlinear functions to obtain values in the interval [0, 1 ]]The confidence of (c).

8. The pedestrian re-identification method based on the blocking indirect coupling GAN network as claimed in claim 1, wherein: inputting the query graph and the graph to be queried to a decoding network in step 6, extracting features, calculating Euclidean distances of the features of the query graph and the graph to be queried, sequencing the Euclidean distances in an ascending order when the distances are more similar, and taking the results of Top1, Top5 and Top 10.