CN115393902A - Pedestrian re-identification method based on comparison language image pre-training model CLIP - Google Patents
Pedestrian re-identification method based on comparison language image pre-training model CLIP Download PDFInfo
- Publication number
- CN115393902A CN115393902A CN202211173432.3A CN202211173432A CN115393902A CN 115393902 A CN115393902 A CN 115393902A CN 202211173432 A CN202211173432 A CN 202211173432A CN 115393902 A CN115393902 A CN 115393902A
- Authority
- CN
- China
- Prior art keywords
- text
- image
- encoder
- training
- pedestrian
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/126—Character encoding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T9/00—Image coding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Human Computer Interaction (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a pedestrian re-identification method based on a comparison language image pre-training model CLIP, which trains an image encoder taking CNN or Transformer as a backbone network: fixing parameters of a text encoder and an image encoder, setting a description text containing learnable parameters for each identity, and sending the image and the corresponding description text into the image encoder and the text encoder; calculating a text-to-image and image-to-text contrast loss function, and training learnable parameters in the description text; fixing a text encoder and a description text, generating and storing text characteristics of each identity, and sending an image to an image encoder; calculating a cross entropy loss function from an image to a text, and training an image encoder; the features for pedestrian re-identification are obtained by an image encoder. Compared with the prior art, the method has the advantages that the language image pre-training model is applied to the re-recognition task, the method is simple, the problem that the pedestrian re-recognition task lacks text label description in the process is solved, and the accuracy is improved.
Description
Technical Field
The invention relates to the technical field of computer vision and digital Image processing, in particular to a pedestrian re-identification method based on a contrast Language Image Pre-training model CLIP (contrast Language-Image Pre-training).
Background
The goal of the pedestrian re-recognition task is to match the same object at different camera perspectives. The past pedestrian re-identification work usually takes the convolutional neural network CNN as a backbone network, which easily causes that local information is over-concerned, and the capability of paying attention to the whole body is lacked. To solve this problem, many methods introduce a priori knowledge into the network, for example, different branches are used, and each branch focuses on different local area features for identification; or the result of semantic segmentation is introduced to help the network distinguish different modules and further mine the characteristics. In addition, recently, there is a method of expanding the attention area by using an attention mechanism or using a Transformer network as a backbone network.
Whether based on convolutional neural networks CNN or Transformer networks, pre-training is very important for re-recognition tasks. Most of the backbone networks used in the past for re-recognition are pre-trained on ImageNet classification tasks, while recent cross-modal language image pre-training models, such as CLIP models, relate visual information to language descriptions by changing the pre-training tasks, so that the image models can perceive high-level semantics from texts and learn transferable features. These tasks are trained on a larger data set with image and text pairs, and can better adapt to downstream tasks such as image classification and image segmentation. However, in the pedestrian re-identification task, since the tag is only an index and lacks a specific text description, the description text required by the language image model cannot be simply generated, and it is difficult to fully utilize the text model.
Therefore, it is necessary to provide a pedestrian re-identification method based on the comparison language image pre-training model CLIP.
Disclosure of Invention
The invention aims to provide a pedestrian re-identification method based on a comparison language image pre-training model CLIP (CLIP) aiming at a pedestrian re-identification task lacking a text label.
The purpose of the invention is realized as follows:
a pedestrian re-identification method based on a comparison language image pre-training model CLIP is characterized in that a text encoder using a Transformer as a backbone network is used for training an image encoder using CNN or the Transformer as the backbone network, and the image encoder is used for generating image features for pedestrian re-identification, and the method comprises the following specific steps:
step 1: for MSMT17 data set with thousands of identities, a set of descriptive text containing learnable parameters is set for each identity in the training set in the training stage, and the template of the descriptive text is alpha photo of a [ X ]] 1 [X] 2 [X] 3 ...[X] M Person, wherein [ X] m (M ∈ 1.. M) is a corresponding learnable token parameter;
step 2: fixing parameters of an image encoder and a text encoder, and sending the image and the corresponding description text into the image encoder and the text encoder;
and step 3: computing image-to-text and text-to-image contrast loss functions L i2t And L t2i Training learnable parameters in the description text, wherein the corresponding formula is as follows;
s(V i ,T i )=V i ·T i =g I (img i )·g T (text i ) (a)
wherein img i Class mark [ CLS ] for ith image output from image encoder]token, text i Indicating an output marker EOS of a corresponding descriptive text passing through a text encoder]token,g I And g T To be [ CLS]token and [ EOS ]]token is mapped to a linear layer in the same space, and finally an image feature V is obtained i And text feature T i ,s(V i ,T i ) Is the image feature V i And text feature T i The similarity of (2); b is the number of images contained in the current batch, a is the index in the current batch, y i Is the identity tag of the ith diagram, P (y) i ) Represents the same batch belonging to y i Index set of all images of this identity, | P (y) i ) L represents the number of images contained in this set;
and step 3: fixing a text encoder and a description text, generating and storing text characteristics of each identity, and sending an image to an image encoder;
and 4, step 4: calculating a cross entropy loss function from an image to a text, and training an image encoder;
q k =(1-∈)δ k,y +∈/N (e)
where N is the number of identities contained in the training set of the data set, k is the index of all identities in the data set, q k For a flat label, represent the expected probability that the current picture belongs to the kth identity, where δ k,y The impulse function is 1 when k = y and 0 in the rest of the impulse function, and the epsilon is a parameter for controlling the smoothness;
and 5: and in the testing stage, sending the images of the test set into a trained image encoder to obtain corresponding image characteristics, carrying out pedestrian re-identification, finding the most similar images under other cameras in the galery for each image in the query of the test set, and calculating mAP and Rank-1 indexes.
The method comprises the following steps of training an image encoder which takes a CNN (compressed natural number) or a Transformer as a backbone network by using a text encoder which takes the Transformer as the backbone network, wherein the backbone network of the image encoder is specifically selected as follows: the backbone network used by the text encoder is a Transformer network with 8 layers, namely ResNet-50 of CNN network or ViT-B/16 of Transformer network.
The descriptive text containing the learnable parameters does not share the learnable parameters among different identities; it is used as an ambiguous description for each identity to supplement the textual description not included in the re-recognition task. For a data set containing N identities, the dimension size of all identity text features is N × C.
The method is characterized in that a group of learnable mark token parameters are set for each camera in a data set in advance, the mark token of the corresponding camera is added with a classification mark [ CLS ] token, and then the image is sent to the image encoder.
The method comprises the steps of calculating a cross entropy loss function from an image to a text, training an image encoder, and calculating an identity loss function L commonly used for pedestrian re-identification besides the cross entropy loss function from the image to the text id And a triplet loss function L tri The calculation method is as follows:
L tri =max(d p -d n +α,0) (g)
wherein p is k Probability of belonging to class k predicted for the network, d p And d n Representing the distance from the hardest positive and hardest negative samples, alpha is L tri A set threshold value.
The invention utilizes a text encoder which takes a Transformer as a backbone network to train an image encoder which takes a CNN or the Transformer as the backbone network: fixing parameters of a text encoder and an image encoder, setting a description text containing learnable parameters for each identity, and sending the image and the corresponding description text into the image encoder and the text encoder; calculating a text-to-image and image-to-text contrast loss function, and training learnable parameters in the description text; fixing a text encoder and a description text, generating and storing text characteristics of each identity, and sending an image to an image encoder; calculating a cross entropy loss function from an image to a text, and training an image encoder; features for pedestrian re-identification are obtained by an image encoder. The invention applies the language image pre-training model CLIP to the ReID task for the first time, solves the problem that a text encoder in the CLIP is difficult to utilize because no label exists in the ReID task, has simple method, enables the result of finding the most similar image under other cameras in the galleries to be more accurate for each image in the query, and improves the mAP and Rank-1 indexes of pedestrian re-identification.
Drawings
FIG. 1 is a schematic flow diagram of the present invention;
FIG. 2 is a flow chart of an embodiment of the present invention.
Detailed Description
In order to more clearly explain the technical means, technical improvements and advantageous effects of the present invention, the present invention is described in detail below with reference to the accompanying drawings.
Example 1
Referring to fig. 1-2, the present invention learns a group of fuzzy description texts for each identity by using a text encoder using a Transformer as a backbone network, obtains text features of all identities by using the description texts and the text encoder, trains an image encoder using CNN or Transformer as a backbone network, and finally obtains features for pedestrian re-identification by using the image encoder, and specifically includes the following steps:
s1: for the MSMT17 data set with 1041 identities in the training set and 3060 identities in the test set, a set of descriptive texts containing learnable parameters is respectively set for the 1041 identities in the training stage, and the template of the descriptive text is alpha photo of a [ X ]] 1 [X] 2 [X] 3 ...[X] M Person, wherein [ X] m (M ∈ 1.. M) is the corresponding learnable token parameter, M is set to 5;
S2: fixing parameters of an image encoder and a text encoder, and sending the image and the corresponding description text into the image encoder and the text encoder;
s3: computing an image-to-text and text-to-image contrast loss function L i2t And L t2i Training learnable parameters in the description text, wherein a corresponding formula is as follows;
s(V i ,T i )=V i `T i =g I (img i )`g T (text i ) (a)
wherein img i Class mark [ CLS ] for ith image output from image encoder]token, and text i Indicating an output marker [ EOS ] of the corresponding descriptive text passing through the text encoder]token,g I And g T To be [ CLS ]]token and [ EOS]token is mapped to a linear layer in the same space, and finally, the image characteristic V is obtained i And text feature T i ,s(V i ,T i ) Is an image feature V i And text feature T i Similarity of (2); b is the number of images contained in the current batch, a is the index in the current batch, y i Is the identity label of the ith graph, P (y) i ) Represents the same batch belonging to y i Index set of all images of this identity, | P (y) i ) L represents the number of images contained in this set;
s3: fixing a text encoder and a description text, generating and storing text characteristics of each identity, and sending an image to an image encoder;
s4: computing an image-to-text cross entropy loss function L i2tce Identity loss function L commonly used for re-identification of pedestrians id And a triplet loss function L tri Training chartAn image encoder;
q k =(1-∈)δ k,y +∈/N (e)
L tri =max(d p -d n +α,0) (g)
where N is the number of identities in the training set of the data set, for the MSMT17 data set, N =1041, k is the index of all identities in the data set, q k For a flat label, represent the expected probability that the current picture belongs to the kth identity, where δ k,y The impulse function is 1 when k = y and 0 in the rest of the impulse function, and the epsilon is a parameter for controlling the smoothness; p is a radical of formula k Probability of belonging to class k, d, predicted for the network p And d n Representing the distance from the hardest positive and hardest negative samples, alpha is L tri A set threshold set to 0.3;
s5: and obtaining corresponding image characteristics of the images of the test set through a trained image encoder, finding the most similar images under other cameras in the galery for each image in the query of the test set, and calculating mAP and Rank-1 indexes. Finally, the results of 63.0% mAP and 84.4% Rank-1 were obtained when CNN was used as the image encoder backbone network, and 75.8% mAP and 89.7% Rank-1 were obtained when Transformer was used as the image encoder backbone network.
The method applies the language image pre-training model to the re-recognition task, is simple, solves the problem that the pedestrian re-recognition task lacks text label description in the process, and improves the accuracy. The present invention is not limited to the above preferred embodiments, and any modifications, equivalent replacements, improvements, etc. within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (3)
1. A pedestrian re-recognition method based on a comparison language image pre-training model CLIP is characterized in that a text encoder using a Transformer as a backbone network is used for training an image encoder using a CNN or a Transformer as the backbone network, and image features generated by the image encoder are used for carrying out pedestrian re-recognition, and the method comprises the following specific steps:
step 1: for MSMT17 data set with thousands of identities, a set of descriptive text containing learnable parameters is set for each identity in the training set in the training stage, and the template of the descriptive text is alpha photo of a [ X ]] 1 [X] 2 [X] 3 ...[X] M Person, wherein [ X] m (M ∈ 1.. M) is a corresponding learnable token parameter;
step 2: fixing parameters of an image coder and a text coder, and sending an image and a corresponding description text into the image coder and the text coder;
and step 3: computing an image-to-text and text-to-image contrast loss function L i2t And L t2i Training learnable parameters in the description text, wherein the corresponding formula is as follows;
s(V i ,T i )=V i ·T i =g I (img i )·g T (text i ) (a)
wherein img i Classification mark [ CLS ] output by image encoder for ith image]token, text i Indicating an output marker EOS of a corresponding descriptive text passing through a text encoder]token,g I And g T To be [ CLS ]]token and [ EOS]token mapping to same nullLinear layer in between, finally obtaining image characteristic V i And text feature T i ,s(V i ,T i ) Is an image feature V i And text feature T i The similarity of (2); b is the number of images contained in the current batch, a is the index in the current batch, y i Is the identity label of the ith graph, P (y) i ) Represents the same batch belonging to y i Index set of all images of this identity, | P (y) i ) L represents the number of images contained in this set;
and step 3: fixing a text encoder and a description text, generating and storing text characteristics of each identity, and sending an image to an image encoder;
and 4, step 4: calculating a cross entropy loss function from an image to a text, and training an image encoder;
q k =(1-∈)δ k,y +∈/N (e)
where N is the number of identities contained in the training set of the data set, k is the index of all identities in the data set, q k For a flat label, represent the expected probability that the current picture belongs to the kth identity, where δ k,y The impulse function is 1 when k = y and 0 in the rest of the impulse function, and the epsilon is a parameter for controlling the smoothness;
and 5: and (3) sending the test set images into a trained image encoder to obtain corresponding image characteristics in a test stage, and carrying out pedestrian re-identification: for each graph in the query of the test set, the most similar graph under other cameras is found in the gallery, and the mAP and Rank-1 indexes are calculated.
2. The method of claim 1, wherein the image encoder using the CNN or the Transformer as the backbone network is trained by using a text encoder using the Transformer as the backbone network, and the backbone network of the image encoder is specifically selected from: the backbone network used by the text encoder is an 8-layer Transformer network with ResNet-50 of CNN network or ViT-B/16 of Transformer network.
3. The pedestrian re-identification method according to claim 1, wherein the descriptive text containing the learnable parameters is not shared among different identities; the descriptive text serves as an ambiguous description for each identity to supplement the textual description not included in the re-recognition task.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211173432.3A CN115393902A (en) | 2022-09-26 | 2022-09-26 | Pedestrian re-identification method based on comparison language image pre-training model CLIP |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211173432.3A CN115393902A (en) | 2022-09-26 | 2022-09-26 | Pedestrian re-identification method based on comparison language image pre-training model CLIP |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115393902A true CN115393902A (en) | 2022-11-25 |
Family
ID=84129348
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211173432.3A Pending CN115393902A (en) | 2022-09-26 | 2022-09-26 | Pedestrian re-identification method based on comparison language image pre-training model CLIP |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115393902A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116701637A (en) * | 2023-06-29 | 2023-09-05 | 中南大学 | Zero sample text classification method, system and medium based on CLIP |
CN117079048A (en) * | 2023-08-29 | 2023-11-17 | 贵州电网有限责任公司 | Geological disaster image recognition method and system based on CLIP model |
-
2022
- 2022-09-26 CN CN202211173432.3A patent/CN115393902A/en active Pending
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116701637A (en) * | 2023-06-29 | 2023-09-05 | 中南大学 | Zero sample text classification method, system and medium based on CLIP |
CN116701637B (en) * | 2023-06-29 | 2024-03-08 | 中南大学 | Zero sample text classification method, system and medium based on CLIP |
CN117079048A (en) * | 2023-08-29 | 2023-11-17 | 贵州电网有限责任公司 | Geological disaster image recognition method and system based on CLIP model |
CN117079048B (en) * | 2023-08-29 | 2024-05-14 | 贵州电网有限责任公司 | Geological disaster image recognition method and system based on CLIP model |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109086756B (en) | Text detection analysis method, device and equipment based on deep neural network | |
CN114241282B (en) | Knowledge distillation-based edge equipment scene recognition method and device | |
CN111985239B (en) | Entity identification method, entity identification device, electronic equipment and storage medium | |
CN115393902A (en) | Pedestrian re-identification method based on comparison language image pre-training model CLIP | |
CN110555475A (en) | few-sample target detection method based on semantic information fusion | |
CN109670494B (en) | Text detection method and system with recognition confidence | |
CN110851641B (en) | Cross-modal retrieval method and device and readable storage medium | |
CN111582126B (en) | Pedestrian re-recognition method based on multi-scale pedestrian contour segmentation fusion | |
CN111680484B (en) | Answer model generation method and system for visual general knowledge reasoning question and answer | |
CN111598041A (en) | Image generation text method for article searching | |
CN111738169A (en) | Handwriting formula recognition method based on end-to-end network model | |
CN112712052A (en) | Method for detecting and identifying weak target in airport panoramic video | |
CN115131753A (en) | Heterogeneous multi-task cooperative system in automatic driving scene | |
CN114722822B (en) | Named entity recognition method, named entity recognition device, named entity recognition equipment and named entity recognition computer readable storage medium | |
CN116994021A (en) | Image detection method, device, computer readable medium and electronic equipment | |
WO2021051502A1 (en) | Long short-term memory-based teaching method and apparatus, and computer device | |
CN116450829A (en) | Medical text classification method, device, equipment and medium | |
CN110929013A (en) | Image question-answer implementation method based on bottom-up entry and positioning information fusion | |
CN109409359A (en) | A kind of method for extracting video captions based on deep learning | |
CN115186670A (en) | Method and system for identifying domain named entities based on active learning | |
CN114898290A (en) | Real-time detection method and system for marine ship | |
CN114357166A (en) | Text classification method based on deep learning | |
CN113362088A (en) | CRNN-based telecommunication industry intelligent customer service image identification method and system | |
CN116384439B (en) | Target detection method based on self-distillation | |
CN113792703B (en) | Image question-answering method and device based on Co-Attention depth modular network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |