CN115393902A

CN115393902A - Pedestrian re-identification method based on comparison language image pre-training model CLIP

Info

Publication number: CN115393902A
Application number: CN202211173432.3A
Authority: CN
Inventors: 孙力; 李思源
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2022-09-26
Filing date: 2022-09-26
Publication date: 2022-11-25

Abstract

The invention discloses a pedestrian re-identification method based on a comparison language image pre-training model CLIP, which trains an image encoder taking CNN or Transformer as a backbone network: fixing parameters of a text encoder and an image encoder, setting a description text containing learnable parameters for each identity, and sending the image and the corresponding description text into the image encoder and the text encoder; calculating a text-to-image and image-to-text contrast loss function, and training learnable parameters in the description text; fixing a text encoder and a description text, generating and storing text characteristics of each identity, and sending an image to an image encoder; calculating a cross entropy loss function from an image to a text, and training an image encoder; the features for pedestrian re-identification are obtained by an image encoder. Compared with the prior art, the method has the advantages that the language image pre-training model is applied to the re-recognition task, the method is simple, the problem that the pedestrian re-recognition task lacks text label description in the process is solved, and the accuracy is improved.

Description

Pedestrian re-identification method based on comparison language image pre-training model CLIP

Technical Field

The invention relates to the technical field of computer vision and digital Image processing, in particular to a pedestrian re-identification method based on a contrast Language Image Pre-training model CLIP (contrast Language-Image Pre-training).

Background

The goal of the pedestrian re-recognition task is to match the same object at different camera perspectives. The past pedestrian re-identification work usually takes the convolutional neural network CNN as a backbone network, which easily causes that local information is over-concerned, and the capability of paying attention to the whole body is lacked. To solve this problem, many methods introduce a priori knowledge into the network, for example, different branches are used, and each branch focuses on different local area features for identification; or the result of semantic segmentation is introduced to help the network distinguish different modules and further mine the characteristics. In addition, recently, there is a method of expanding the attention area by using an attention mechanism or using a Transformer network as a backbone network.

Whether based on convolutional neural networks CNN or Transformer networks, pre-training is very important for re-recognition tasks. Most of the backbone networks used in the past for re-recognition are pre-trained on ImageNet classification tasks, while recent cross-modal language image pre-training models, such as CLIP models, relate visual information to language descriptions by changing the pre-training tasks, so that the image models can perceive high-level semantics from texts and learn transferable features. These tasks are trained on a larger data set with image and text pairs, and can better adapt to downstream tasks such as image classification and image segmentation. However, in the pedestrian re-identification task, since the tag is only an index and lacks a specific text description, the description text required by the language image model cannot be simply generated, and it is difficult to fully utilize the text model.

Therefore, it is necessary to provide a pedestrian re-identification method based on the comparison language image pre-training model CLIP.

Disclosure of Invention

The invention aims to provide a pedestrian re-identification method based on a comparison language image pre-training model CLIP (CLIP) aiming at a pedestrian re-identification task lacking a text label.

The purpose of the invention is realized as follows:

a pedestrian re-identification method based on a comparison language image pre-training model CLIP is characterized in that a text encoder using a Transformer as a backbone network is used for training an image encoder using CNN or the Transformer as the backbone network, and the image encoder is used for generating image features for pedestrian re-identification, and the method comprises the following specific steps:

step 1: for MSMT17 data set with thousands of identities, a set of descriptive text containing learnable parameters is set for each identity in the training set in the training stage, and the template of the descriptive text is alpha photo of a [ X ]] ₁ [X] ₂ [X] ₃ ...[X] _M Person, wherein [ X] _m (M ∈ 1.. M) is a corresponding learnable token parameter;

step 2: fixing parameters of an image encoder and a text encoder, and sending the image and the corresponding description text into the image encoder and the text encoder;

and step 3: computing image-to-text and text-to-image contrast loss functions L _i2t And L _t2i Training learnable parameters in the description text, wherein the corresponding formula is as follows;

s(V _i ,T _i )＝V _i ·T _i ＝g _I (img _i )·g _T (text _i ) (a)

wherein img _i Class mark [ CLS ] for ith image output from image encoder]token, text _i Indicating an output marker EOS of a corresponding descriptive text passing through a text encoder]token，g _I And g _T To be [ CLS]token and [ EOS ]]token is mapped to a linear layer in the same space, and finally an image feature V is obtained _i And text feature T _i ，s(V _i ,T _i ) Is the image feature V _i And text feature T _i The similarity of (2); b is the number of images contained in the current batch, a is the index in the current batch, y _i Is the identity tag of the ith diagram, P (y) _i ) Represents the same batch belonging to y _i Index set of all images of this identity, | P (y) _i ) L represents the number of images contained in this set;

and step 3: fixing a text encoder and a description text, generating and storing text characteristics of each identity, and sending an image to an image encoder;

and 4, step 4: calculating a cross entropy loss function from an image to a text, and training an image encoder;

q _k ＝(1-∈)δ _k,y +∈/N (e)

where N is the number of identities contained in the training set of the data set, k is the index of all identities in the data set, q _k For a flat label, represent the expected probability that the current picture belongs to the kth identity, where δ _k,y The impulse function is 1 when k = y and 0 in the rest of the impulse function, and the epsilon is a parameter for controlling the smoothness;

and 5: and in the testing stage, sending the images of the test set into a trained image encoder to obtain corresponding image characteristics, carrying out pedestrian re-identification, finding the most similar images under other cameras in the galery for each image in the query of the test set, and calculating mAP and Rank-1 indexes.

The method comprises the following steps of training an image encoder which takes a CNN (compressed natural number) or a Transformer as a backbone network by using a text encoder which takes the Transformer as the backbone network, wherein the backbone network of the image encoder is specifically selected as follows: the backbone network used by the text encoder is a Transformer network with 8 layers, namely ResNet-50 of CNN network or ViT-B/16 of Transformer network.

The descriptive text containing the learnable parameters does not share the learnable parameters among different identities; it is used as an ambiguous description for each identity to supplement the textual description not included in the re-recognition task. For a data set containing N identities, the dimension size of all identity text features is N × C.

The method is characterized in that a group of learnable mark token parameters are set for each camera in a data set in advance, the mark token of the corresponding camera is added with a classification mark [ CLS ] token, and then the image is sent to the image encoder.

The method comprises the steps of calculating a cross entropy loss function from an image to a text, training an image encoder, and calculating an identity loss function L commonly used for pedestrian re-identification besides the cross entropy loss function from the image to the text _id And a triplet loss function L _tri The calculation method is as follows:

L _tri ＝max(d _p -d _n +α,0) (g)

wherein p is _k Probability of belonging to class k predicted for the network, d _p And d _n Representing the distance from the hardest positive and hardest negative samples, alpha is L _tri A set threshold value.

The invention utilizes a text encoder which takes a Transformer as a backbone network to train an image encoder which takes a CNN or the Transformer as the backbone network: fixing parameters of a text encoder and an image encoder, setting a description text containing learnable parameters for each identity, and sending the image and the corresponding description text into the image encoder and the text encoder; calculating a text-to-image and image-to-text contrast loss function, and training learnable parameters in the description text; fixing a text encoder and a description text, generating and storing text characteristics of each identity, and sending an image to an image encoder; calculating a cross entropy loss function from an image to a text, and training an image encoder; features for pedestrian re-identification are obtained by an image encoder. The invention applies the language image pre-training model CLIP to the ReID task for the first time, solves the problem that a text encoder in the CLIP is difficult to utilize because no label exists in the ReID task, has simple method, enables the result of finding the most similar image under other cameras in the galleries to be more accurate for each image in the query, and improves the mAP and Rank-1 indexes of pedestrian re-identification.

Drawings

FIG. 1 is a schematic flow diagram of the present invention;

FIG. 2 is a flow chart of an embodiment of the present invention.

Detailed Description

In order to more clearly explain the technical means, technical improvements and advantageous effects of the present invention, the present invention is described in detail below with reference to the accompanying drawings.

Example 1

Referring to fig. 1-2, the present invention learns a group of fuzzy description texts for each identity by using a text encoder using a Transformer as a backbone network, obtains text features of all identities by using the description texts and the text encoder, trains an image encoder using CNN or Transformer as a backbone network, and finally obtains features for pedestrian re-identification by using the image encoder, and specifically includes the following steps:

s1: for the MSMT17 data set with 1041 identities in the training set and 3060 identities in the test set, a set of descriptive texts containing learnable parameters is respectively set for the 1041 identities in the training stage, and the template of the descriptive text is alpha photo of a [ X ]] ₁ [X] ₂ [X] ₃ ...[X] _M Person, wherein [ X] _m (M ∈ 1.. M) is the corresponding learnable token parameter, M is set to 5；

S2: fixing parameters of an image encoder and a text encoder, and sending the image and the corresponding description text into the image encoder and the text encoder;

s3: computing an image-to-text and text-to-image contrast loss function L _i2t And L _t2i Training learnable parameters in the description text, wherein a corresponding formula is as follows;

s(V _i ,T _i )＝V _i `T _i ＝g _I (img _i )`g _T (text _i ) (a)

wherein img _i Class mark [ CLS ] for ith image output from image encoder]token, and text _i Indicating an output marker [ EOS ] of the corresponding descriptive text passing through the text encoder]token，g _I And g _T To be [ CLS ]]token and [ EOS]token is mapped to a linear layer in the same space, and finally, the image characteristic V is obtained _i And text feature T _i ，s(V _i ,T _i ) Is an image feature V _i And text feature T _i Similarity of (2); b is the number of images contained in the current batch, a is the index in the current batch, y _i Is the identity label of the ith graph, P (y) _i ) Represents the same batch belonging to y _i Index set of all images of this identity, | P (y) _i ) L represents the number of images contained in this set;

s3: fixing a text encoder and a description text, generating and storing text characteristics of each identity, and sending an image to an image encoder;

s4: computing an image-to-text cross entropy loss function L _i2tce Identity loss function L commonly used for re-identification of pedestrians _id And a triplet loss function L _tri Training chartAn image encoder;

q _k ＝(1-∈)δ _k,y +∈/N (e)

L _tri ＝max(d _p -d _n +α,0) (g)

where N is the number of identities in the training set of the data set, for the MSMT17 data set, N =1041, k is the index of all identities in the data set, q _k For a flat label, represent the expected probability that the current picture belongs to the kth identity, where δ _k,y The impulse function is 1 when k = y and 0 in the rest of the impulse function, and the epsilon is a parameter for controlling the smoothness; p is a radical of formula _k Probability of belonging to class k, d, predicted for the network _p And d _n Representing the distance from the hardest positive and hardest negative samples, alpha is L _tri A set threshold set to 0.3;

s5: and obtaining corresponding image characteristics of the images of the test set through a trained image encoder, finding the most similar images under other cameras in the galery for each image in the query of the test set, and calculating mAP and Rank-1 indexes. Finally, the results of 63.0% mAP and 84.4% Rank-1 were obtained when CNN was used as the image encoder backbone network, and 75.8% mAP and 89.7% Rank-1 were obtained when Transformer was used as the image encoder backbone network.

The method applies the language image pre-training model to the re-recognition task, is simple, solves the problem that the pedestrian re-recognition task lacks text label description in the process, and improves the accuracy. The present invention is not limited to the above preferred embodiments, and any modifications, equivalent replacements, improvements, etc. within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A pedestrian re-recognition method based on a comparison language image pre-training model CLIP is characterized in that a text encoder using a Transformer as a backbone network is used for training an image encoder using a CNN or a Transformer as the backbone network, and image features generated by the image encoder are used for carrying out pedestrian re-recognition, and the method comprises the following specific steps:

step 2: fixing parameters of an image coder and a text coder, and sending an image and a corresponding description text into the image coder and the text coder;

and step 3: computing an image-to-text and text-to-image contrast loss function L _i2t And L _t2i Training learnable parameters in the description text, wherein the corresponding formula is as follows;

s(V _i ,T _i )＝V _i ·T _i ＝g _I (img _i )·g _T (text _i ) (a)

wherein img _i Classification mark [ CLS ] output by image encoder for ith image]token, text _i Indicating an output marker EOS of a corresponding descriptive text passing through a text encoder]token，g _I And g _T To be [ CLS ]]token and [ EOS]token mapping to same nullLinear layer in between, finally obtaining image characteristic V _i And text feature T _i ，s(V _i ,T _i ) Is an image feature V _i And text feature T _i The similarity of (2); b is the number of images contained in the current batch, a is the index in the current batch, y _i Is the identity label of the ith graph, P (y) _i ) Represents the same batch belonging to y _i Index set of all images of this identity, | P (y) _i ) L represents the number of images contained in this set;

q _k ＝(1-∈)δ _k,y +∈/N (e)

and 5: and (3) sending the test set images into a trained image encoder to obtain corresponding image characteristics in a test stage, and carrying out pedestrian re-identification: for each graph in the query of the test set, the most similar graph under other cameras is found in the gallery, and the mAP and Rank-1 indexes are calculated.

2. The method of claim 1, wherein the image encoder using the CNN or the Transformer as the backbone network is trained by using a text encoder using the Transformer as the backbone network, and the backbone network of the image encoder is specifically selected from: the backbone network used by the text encoder is an 8-layer Transformer network with ResNet-50 of CNN network or ViT-B/16 of Transformer network.

3. The pedestrian re-identification method according to claim 1, wherein the descriptive text containing the learnable parameters is not shared among different identities; the descriptive text serves as an ambiguous description for each identity to supplement the textual description not included in the re-recognition task.