CN115393902A - Pedestrian re-identification method based on comparison language image pre-training model CLIP - Google Patents

Pedestrian re-identification method based on comparison language image pre-training model CLIP Download PDF

Info

Publication number
CN115393902A
CN115393902A CN202211173432.3A CN202211173432A CN115393902A CN 115393902 A CN115393902 A CN 115393902A CN 202211173432 A CN202211173432 A CN 202211173432A CN 115393902 A CN115393902 A CN 115393902A
Authority
CN
China
Prior art keywords
text
image
encoder
training
pedestrian
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211173432.3A
Other languages
Chinese (zh)
Inventor
孙力
李思源
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China Normal University
Original Assignee
East China Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China Normal University filed Critical East China Normal University
Priority to CN202211173432.3A priority Critical patent/CN115393902A/en
Publication of CN115393902A publication Critical patent/CN115393902A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a pedestrian re-identification method based on a comparison language image pre-training model CLIP, which trains an image encoder taking CNN or Transformer as a backbone network: fixing parameters of a text encoder and an image encoder, setting a description text containing learnable parameters for each identity, and sending the image and the corresponding description text into the image encoder and the text encoder; calculating a text-to-image and image-to-text contrast loss function, and training learnable parameters in the description text; fixing a text encoder and a description text, generating and storing text characteristics of each identity, and sending an image to an image encoder; calculating a cross entropy loss function from an image to a text, and training an image encoder; the features for pedestrian re-identification are obtained by an image encoder. Compared with the prior art, the method has the advantages that the language image pre-training model is applied to the re-recognition task, the method is simple, the problem that the pedestrian re-recognition task lacks text label description in the process is solved, and the accuracy is improved.

Description

Pedestrian re-identification method based on comparison language image pre-training model CLIP
Technical Field
The invention relates to the technical field of computer vision and digital Image processing, in particular to a pedestrian re-identification method based on a contrast Language Image Pre-training model CLIP (contrast Language-Image Pre-training).
Background
The goal of the pedestrian re-recognition task is to match the same object at different camera perspectives. The past pedestrian re-identification work usually takes the convolutional neural network CNN as a backbone network, which easily causes that local information is over-concerned, and the capability of paying attention to the whole body is lacked. To solve this problem, many methods introduce a priori knowledge into the network, for example, different branches are used, and each branch focuses on different local area features for identification; or the result of semantic segmentation is introduced to help the network distinguish different modules and further mine the characteristics. In addition, recently, there is a method of expanding the attention area by using an attention mechanism or using a Transformer network as a backbone network.
Whether based on convolutional neural networks CNN or Transformer networks, pre-training is very important for re-recognition tasks. Most of the backbone networks used in the past for re-recognition are pre-trained on ImageNet classification tasks, while recent cross-modal language image pre-training models, such as CLIP models, relate visual information to language descriptions by changing the pre-training tasks, so that the image models can perceive high-level semantics from texts and learn transferable features. These tasks are trained on a larger data set with image and text pairs, and can better adapt to downstream tasks such as image classification and image segmentation. However, in the pedestrian re-identification task, since the tag is only an index and lacks a specific text description, the description text required by the language image model cannot be simply generated, and it is difficult to fully utilize the text model.
Therefore, it is necessary to provide a pedestrian re-identification method based on the comparison language image pre-training model CLIP.
Disclosure of Invention
The invention aims to provide a pedestrian re-identification method based on a comparison language image pre-training model CLIP (CLIP) aiming at a pedestrian re-identification task lacking a text label.
The purpose of the invention is realized as follows:
a pedestrian re-identification method based on a comparison language image pre-training model CLIP is characterized in that a text encoder using a Transformer as a backbone network is used for training an image encoder using CNN or the Transformer as the backbone network, and the image encoder is used for generating image features for pedestrian re-identification, and the method comprises the following specific steps:
step 1: for MSMT17 data set with thousands of identities, a set of descriptive text containing learnable parameters is set for each identity in the training set in the training stage, and the template of the descriptive text is alpha photo of a [ X ]] 1 [X] 2 [X] 3 ...[X] M Person, wherein [ X] m (M ∈ 1.. M) is a corresponding learnable token parameter;
step 2: fixing parameters of an image encoder and a text encoder, and sending the image and the corresponding description text into the image encoder and the text encoder;
and step 3: computing image-to-text and text-to-image contrast loss functions L i2t And L t2i Training learnable parameters in the description text, wherein the corresponding formula is as follows;
s(V i ,T i )=V i ·T i =g I (img i )·g T (text i ) (a)
Figure BDA0003864260610000021
Figure BDA0003864260610000022
wherein img i Class mark [ CLS ] for ith image output from image encoder]token, text i Indicating an output marker EOS of a corresponding descriptive text passing through a text encoder]token,g I And g T To be [ CLS]token and [ EOS ]]token is mapped to a linear layer in the same space, and finally an image feature V is obtained i And text feature T i ,s(V i ,T i ) Is the image feature V i And text feature T i The similarity of (2); b is the number of images contained in the current batch, a is the index in the current batch, y i Is the identity tag of the ith diagram, P (y) i ) Represents the same batch belonging to y i Index set of all images of this identity, | P (y) i ) L represents the number of images contained in this set;
and step 3: fixing a text encoder and a description text, generating and storing text characteristics of each identity, and sending an image to an image encoder;
and 4, step 4: calculating a cross entropy loss function from an image to a text, and training an image encoder;
Figure BDA0003864260610000023
q k =(1-∈)δ k,y +∈/N (e)
where N is the number of identities contained in the training set of the data set, k is the index of all identities in the data set, q k For a flat label, represent the expected probability that the current picture belongs to the kth identity, where δ k,y The impulse function is 1 when k = y and 0 in the rest of the impulse function, and the epsilon is a parameter for controlling the smoothness;
and 5: and in the testing stage, sending the images of the test set into a trained image encoder to obtain corresponding image characteristics, carrying out pedestrian re-identification, finding the most similar images under other cameras in the galery for each image in the query of the test set, and calculating mAP and Rank-1 indexes.
The method comprises the following steps of training an image encoder which takes a CNN (compressed natural number) or a Transformer as a backbone network by using a text encoder which takes the Transformer as the backbone network, wherein the backbone network of the image encoder is specifically selected as follows: the backbone network used by the text encoder is a Transformer network with 8 layers, namely ResNet-50 of CNN network or ViT-B/16 of Transformer network.
The descriptive text containing the learnable parameters does not share the learnable parameters among different identities; it is used as an ambiguous description for each identity to supplement the textual description not included in the re-recognition task. For a data set containing N identities, the dimension size of all identity text features is N × C.
The method is characterized in that a group of learnable mark token parameters are set for each camera in a data set in advance, the mark token of the corresponding camera is added with a classification mark [ CLS ] token, and then the image is sent to the image encoder.
The method comprises the steps of calculating a cross entropy loss function from an image to a text, training an image encoder, and calculating an identity loss function L commonly used for pedestrian re-identification besides the cross entropy loss function from the image to the text id And a triplet loss function L tri The calculation method is as follows:
Figure BDA0003864260610000031
L tri =max(d p -d n +α,0) (g)
wherein p is k Probability of belonging to class k predicted for the network, d p And d n Representing the distance from the hardest positive and hardest negative samples, alpha is L tri A set threshold value.
The invention utilizes a text encoder which takes a Transformer as a backbone network to train an image encoder which takes a CNN or the Transformer as the backbone network: fixing parameters of a text encoder and an image encoder, setting a description text containing learnable parameters for each identity, and sending the image and the corresponding description text into the image encoder and the text encoder; calculating a text-to-image and image-to-text contrast loss function, and training learnable parameters in the description text; fixing a text encoder and a description text, generating and storing text characteristics of each identity, and sending an image to an image encoder; calculating a cross entropy loss function from an image to a text, and training an image encoder; features for pedestrian re-identification are obtained by an image encoder. The invention applies the language image pre-training model CLIP to the ReID task for the first time, solves the problem that a text encoder in the CLIP is difficult to utilize because no label exists in the ReID task, has simple method, enables the result of finding the most similar image under other cameras in the galleries to be more accurate for each image in the query, and improves the mAP and Rank-1 indexes of pedestrian re-identification.
Drawings
FIG. 1 is a schematic flow diagram of the present invention;
FIG. 2 is a flow chart of an embodiment of the present invention.
Detailed Description
In order to more clearly explain the technical means, technical improvements and advantageous effects of the present invention, the present invention is described in detail below with reference to the accompanying drawings.
Example 1
Referring to fig. 1-2, the present invention learns a group of fuzzy description texts for each identity by using a text encoder using a Transformer as a backbone network, obtains text features of all identities by using the description texts and the text encoder, trains an image encoder using CNN or Transformer as a backbone network, and finally obtains features for pedestrian re-identification by using the image encoder, and specifically includes the following steps:
s1: for the MSMT17 data set with 1041 identities in the training set and 3060 identities in the test set, a set of descriptive texts containing learnable parameters is respectively set for the 1041 identities in the training stage, and the template of the descriptive text is alpha photo of a [ X ]] 1 [X] 2 [X] 3 ...[X] M Person, wherein [ X] m (M ∈ 1.. M) is the corresponding learnable token parameter, M is set to 5;
S2: fixing parameters of an image encoder and a text encoder, and sending the image and the corresponding description text into the image encoder and the text encoder;
s3: computing an image-to-text and text-to-image contrast loss function L i2t And L t2i Training learnable parameters in the description text, wherein a corresponding formula is as follows;
s(V i ,T i )=V i `T i =g I (img i )`g T (text i ) (a)
Figure BDA0003864260610000041
Figure BDA0003864260610000042
wherein img i Class mark [ CLS ] for ith image output from image encoder]token, and text i Indicating an output marker [ EOS ] of the corresponding descriptive text passing through the text encoder]token,g I And g T To be [ CLS ]]token and [ EOS]token is mapped to a linear layer in the same space, and finally, the image characteristic V is obtained i And text feature T i ,s(V i ,T i ) Is an image feature V i And text feature T i Similarity of (2); b is the number of images contained in the current batch, a is the index in the current batch, y i Is the identity label of the ith graph, P (y) i ) Represents the same batch belonging to y i Index set of all images of this identity, | P (y) i ) L represents the number of images contained in this set;
s3: fixing a text encoder and a description text, generating and storing text characteristics of each identity, and sending an image to an image encoder;
s4: computing an image-to-text cross entropy loss function L i2tce Identity loss function L commonly used for re-identification of pedestrians id And a triplet loss function L tri Training chartAn image encoder;
Figure BDA0003864260610000043
q k =(1-∈)δ k,y +∈/N (e)
Figure BDA0003864260610000044
L tri =max(d p -d n +α,0) (g)
where N is the number of identities in the training set of the data set, for the MSMT17 data set, N =1041, k is the index of all identities in the data set, q k For a flat label, represent the expected probability that the current picture belongs to the kth identity, where δ k,y The impulse function is 1 when k = y and 0 in the rest of the impulse function, and the epsilon is a parameter for controlling the smoothness; p is a radical of formula k Probability of belonging to class k, d, predicted for the network p And d n Representing the distance from the hardest positive and hardest negative samples, alpha is L tri A set threshold set to 0.3;
s5: and obtaining corresponding image characteristics of the images of the test set through a trained image encoder, finding the most similar images under other cameras in the galery for each image in the query of the test set, and calculating mAP and Rank-1 indexes. Finally, the results of 63.0% mAP and 84.4% Rank-1 were obtained when CNN was used as the image encoder backbone network, and 75.8% mAP and 89.7% Rank-1 were obtained when Transformer was used as the image encoder backbone network.
The method applies the language image pre-training model to the re-recognition task, is simple, solves the problem that the pedestrian re-recognition task lacks text label description in the process, and improves the accuracy. The present invention is not limited to the above preferred embodiments, and any modifications, equivalent replacements, improvements, etc. within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (3)

1. A pedestrian re-recognition method based on a comparison language image pre-training model CLIP is characterized in that a text encoder using a Transformer as a backbone network is used for training an image encoder using a CNN or a Transformer as the backbone network, and image features generated by the image encoder are used for carrying out pedestrian re-recognition, and the method comprises the following specific steps:
step 1: for MSMT17 data set with thousands of identities, a set of descriptive text containing learnable parameters is set for each identity in the training set in the training stage, and the template of the descriptive text is alpha photo of a [ X ]] 1 [X] 2 [X] 3 ...[X] M Person, wherein [ X] m (M ∈ 1.. M) is a corresponding learnable token parameter;
step 2: fixing parameters of an image coder and a text coder, and sending an image and a corresponding description text into the image coder and the text coder;
and step 3: computing an image-to-text and text-to-image contrast loss function L i2t And L t2i Training learnable parameters in the description text, wherein the corresponding formula is as follows;
s(V i ,T i )=V i ·T i =g I (img i )·g T (text i ) (a)
Figure FDA0003864260600000011
Figure FDA0003864260600000012
wherein img i Classification mark [ CLS ] output by image encoder for ith image]token, text i Indicating an output marker EOS of a corresponding descriptive text passing through a text encoder]token,g I And g T To be [ CLS ]]token and [ EOS]token mapping to same nullLinear layer in between, finally obtaining image characteristic V i And text feature T i ,s(V i ,T i ) Is an image feature V i And text feature T i The similarity of (2); b is the number of images contained in the current batch, a is the index in the current batch, y i Is the identity label of the ith graph, P (y) i ) Represents the same batch belonging to y i Index set of all images of this identity, | P (y) i ) L represents the number of images contained in this set;
and step 3: fixing a text encoder and a description text, generating and storing text characteristics of each identity, and sending an image to an image encoder;
and 4, step 4: calculating a cross entropy loss function from an image to a text, and training an image encoder;
Figure FDA0003864260600000013
q k =(1-∈)δ k,y +∈/N (e)
where N is the number of identities contained in the training set of the data set, k is the index of all identities in the data set, q k For a flat label, represent the expected probability that the current picture belongs to the kth identity, where δ k,y The impulse function is 1 when k = y and 0 in the rest of the impulse function, and the epsilon is a parameter for controlling the smoothness;
and 5: and (3) sending the test set images into a trained image encoder to obtain corresponding image characteristics in a test stage, and carrying out pedestrian re-identification: for each graph in the query of the test set, the most similar graph under other cameras is found in the gallery, and the mAP and Rank-1 indexes are calculated.
2. The method of claim 1, wherein the image encoder using the CNN or the Transformer as the backbone network is trained by using a text encoder using the Transformer as the backbone network, and the backbone network of the image encoder is specifically selected from: the backbone network used by the text encoder is an 8-layer Transformer network with ResNet-50 of CNN network or ViT-B/16 of Transformer network.
3. The pedestrian re-identification method according to claim 1, wherein the descriptive text containing the learnable parameters is not shared among different identities; the descriptive text serves as an ambiguous description for each identity to supplement the textual description not included in the re-recognition task.
CN202211173432.3A 2022-09-26 2022-09-26 Pedestrian re-identification method based on comparison language image pre-training model CLIP Pending CN115393902A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211173432.3A CN115393902A (en) 2022-09-26 2022-09-26 Pedestrian re-identification method based on comparison language image pre-training model CLIP

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211173432.3A CN115393902A (en) 2022-09-26 2022-09-26 Pedestrian re-identification method based on comparison language image pre-training model CLIP

Publications (1)

Publication Number Publication Date
CN115393902A true CN115393902A (en) 2022-11-25

Family

ID=84129348

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211173432.3A Pending CN115393902A (en) 2022-09-26 2022-09-26 Pedestrian re-identification method based on comparison language image pre-training model CLIP

Country Status (1)

Country Link
CN (1) CN115393902A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116701637A (en) * 2023-06-29 2023-09-05 中南大学 Zero sample text classification method, system and medium based on CLIP
CN117079048A (en) * 2023-08-29 2023-11-17 贵州电网有限责任公司 Geological disaster image recognition method and system based on CLIP model

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116701637A (en) * 2023-06-29 2023-09-05 中南大学 Zero sample text classification method, system and medium based on CLIP
CN116701637B (en) * 2023-06-29 2024-03-08 中南大学 Zero sample text classification method, system and medium based on CLIP
CN117079048A (en) * 2023-08-29 2023-11-17 贵州电网有限责任公司 Geological disaster image recognition method and system based on CLIP model
CN117079048B (en) * 2023-08-29 2024-05-14 贵州电网有限责任公司 Geological disaster image recognition method and system based on CLIP model

Similar Documents

Publication Publication Date Title
CN109086756B (en) Text detection analysis method, device and equipment based on deep neural network
CN114241282B (en) Knowledge distillation-based edge equipment scene recognition method and device
CN111985239B (en) Entity identification method, entity identification device, electronic equipment and storage medium
CN115393902A (en) Pedestrian re-identification method based on comparison language image pre-training model CLIP
CN110555475A (en) few-sample target detection method based on semantic information fusion
CN109670494B (en) Text detection method and system with recognition confidence
CN110851641B (en) Cross-modal retrieval method and device and readable storage medium
CN111582126B (en) Pedestrian re-recognition method based on multi-scale pedestrian contour segmentation fusion
CN111680484B (en) Answer model generation method and system for visual general knowledge reasoning question and answer
CN111598041A (en) Image generation text method for article searching
CN111738169A (en) Handwriting formula recognition method based on end-to-end network model
CN112712052A (en) Method for detecting and identifying weak target in airport panoramic video
CN115131753A (en) Heterogeneous multi-task cooperative system in automatic driving scene
CN114722822B (en) Named entity recognition method, named entity recognition device, named entity recognition equipment and named entity recognition computer readable storage medium
CN116994021A (en) Image detection method, device, computer readable medium and electronic equipment
WO2021051502A1 (en) Long short-term memory-based teaching method and apparatus, and computer device
CN116450829A (en) Medical text classification method, device, equipment and medium
CN110929013A (en) Image question-answer implementation method based on bottom-up entry and positioning information fusion
CN109409359A (en) A kind of method for extracting video captions based on deep learning
CN115186670A (en) Method and system for identifying domain named entities based on active learning
CN114898290A (en) Real-time detection method and system for marine ship
CN114357166A (en) Text classification method based on deep learning
CN113362088A (en) CRNN-based telecommunication industry intelligent customer service image identification method and system
CN116384439B (en) Target detection method based on self-distillation
CN113792703B (en) Image question-answering method and device based on Co-Attention depth modular network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination