CN116935329B - Weak supervision text pedestrian retrieval method and system for class-level comparison learning - Google Patents

Weak supervision text pedestrian retrieval method and system for class-level comparison learning Download PDF

Info

Publication number
CN116935329B
CN116935329B CN202311204550.0A CN202311204550A CN116935329B CN 116935329 B CN116935329 B CN 116935329B CN 202311204550 A CN202311204550 A CN 202311204550A CN 116935329 B CN116935329 B CN 116935329B
Authority
CN
China
Prior art keywords
text
image
features
class
level
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311204550.0A
Other languages
Chinese (zh)
Other versions
CN116935329A (en
Inventor
郑艳伟
赵新鹏
王鹏
孙恩涛
杜超
于东晓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Shanghai Step Electric Corp
Original Assignee
Shandong University
Shanghai Step Electric Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University, Shanghai Step Electric Corp filed Critical Shandong University
Priority to CN202311204550.0A priority Critical patent/CN116935329B/en
Publication of CN116935329A publication Critical patent/CN116935329A/en
Application granted granted Critical
Publication of CN116935329B publication Critical patent/CN116935329B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • G06V20/53Recognition of crowd images, e.g. recognition of crowd congestion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/762Arrangements for image or video recognition or understanding using pattern recognition or machine learning using clustering, e.g. of similar faces in social networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Computational Linguistics (AREA)
  • Image Analysis (AREA)

Abstract

The application belongs to the field of image processing, and particularly relates to a weak supervision text pedestrian retrieval method and system for class level contrast learning, which are used for retrieving pedestrian images or videos through inputting natural language description in all scenes with pedestrians including but not limited to elevators, streets, malls and the like. And then constructing a multi-mode memory module at class level according to the cluster ID, wherein the module is dynamically updated in the training process. During training, the cross-mode matching module of the mixed level draws the distance between the similar images and the texts from two angles of the class level and the instance level, and pushes away the distance between the different similar images and the texts. The application greatly improves the accuracy of text pedestrian retrieval under the weak supervision condition.

Description

Weak supervision text pedestrian retrieval method and system for class-level comparison learning
Technical Field
The application belongs to the field of image processing, and particularly relates to a weak supervision text pedestrian retrieval method and system for class level contrast learning.
Background
In recent years, pedestrian retrieval is widely focused, and has important application value in the field of intelligent video monitoring. The goal of this task is to give a query, such as a photograph of a pedestrian or a textual description of the pedestrian, and then retrieve the corresponding pedestrian image from the database. Pedestrian retrieval can be classified into: image-based pedestrian retrieval and text-based pedestrian retrieval. Where at least one pedestrian image of interest is required for image-based pedestrian retrieval as a query, in reality, the pedestrian image of interest is often difficult to obtain.
Currently, text-based pedestrian retrieval tasks often employ a supervised approach to model training. This means that in addition to the pedestrian image and the corresponding pedestrian text description, people need to label the pedestrian ID, which does not differ from increasing the substantial human cost, increasing the application threshold for text-based pedestrian retrieval. The difficult problem faced by the weak supervision text pedestrian retrieval not only comprises the data difference of two different modes of spanning text and images, which is faced by the supervision text pedestrian retrieval, but also comprises how the model retrieves all images of the same pedestrian under the interference of illumination change, shielding, visual angle change, low resolution and the like under different cameras without the guidance of pedestrian ID information. The prior method alleviates the two problems to a certain extent, but has poor effect. First, previous work employed a pre-trained model in a single modality as the backbone network, such as image encoder employing a ResNet network trained on ImageNet, text encoder employing BERT, and so forth. Pretraining is very important for text pedestrian retrieval tasks, and models employing single-modality pretraining lack the necessary cross-modality alignment capability between text and images, which affects the final model performance. Second, previous work mostly used instance-level cross-modal loss functions, ignoring the final objective of retrieving all images of the same person.
Disclosure of Invention
In order to overcome the differences of the prior art, the application provides a weak supervision text pedestrian retrieval method for class level contrast learning, which is used for retrieving pedestrian images or videos by inputting natural language descriptions in all scenes with pedestrians, including but not limited to elevators, streets, malls and the like. The technical proposal is as follows:
a weak supervision text pedestrian retrieval method for category level comparison learning comprises the following steps:
s1, extracting image features and text features by using an image encoder and a text encoder of a CLIP model;
s2, clustering the image features and the text features by using a clustering algorithm;
s3, excavating valuable samples in the clustering outlier samples according to the corresponding relation between the images and the texts;
s4, calculating class center features of the images and class center features of texts according to the cluster IDs, and storing the class center features and the class center features of the texts into memory modules of respective modes;
s5, respectively calculating the cross-modal contrast matching loss of the class level and the cross-modal projection loss of the instance level to obtain the cross-modal matching loss of the mixed level,
s6, updating the CLIP model parameters in a gradient updating mode, and storing the parameters of the image encoder and the text encoder after training;
s7, extracting image and text features by adopting the parameters of the image encoder and the text encoder in the step S6 when in use, then calculating cosine similarity between the image features and the text features, sorting the pedestrian images to be retrieved according to the similarity, and returning a sorting result.
Preferably, in step S2, the image features and the text features are clustered by using a clustering algorithm to obtain a cluster labelAnd->,/>Is the firstiCluster ID of the sheet image; />Is the firstiCluster ID of the individual text; for clustering outliers, the tags are all +.>
Preferably, in step S3, the image outlier is mined, specifically as follows:
s31, assume the firstThe outlier samples of the individual images are denoted +.>Finding all and +.>Matching text descriptions, filtering text dissimilarity samples in the matched text descriptions, and obtaining a text description setIndicating +.>Personal text description and->Pairing and the text samples have cluster labels;
s32, traversing according to the corresponding relation between the image and the textFinding the image sample paired with it, obtaining a clustered image set +.>
S33, calculating an image outlier sampleTo the collection->Distance of all image samples in the set, and +.>Sequencing all samples in the database;
s34, traversing the collection in turnAll samples of->If the sample is not an alien sample, the image is alien sample +.>Is composed of->Sample change->And ending the traversal; if not, continuing traversing; if the set is traversed +.>After all samples of->Still as a separate sample, say +.>And (3) continuing to excavate the sample without deserving, continuing to excavate the next image rejection sample until all the image rejection samples are tried, and ending the excavation of the image rejection samples.
Preferably, in step S3, a text departure sample is mined, and the specific steps are as follows:
s3-1. Suppose the firstiThe individual text disjunct samples are represented asFinding and combining the text according to the corresponding relation between the image and the textPaired image->The method comprises the steps of carrying out a first treatment on the surface of the If the image->Also a disjunct sample, then end the pair +.>Digging of sample, ++>The state of the dissimilarity sample is still kept, and the next text dissimilarity sample is traversed; if the image->Is a clustered sample, then proceed to the next step;
s3-2, finding all the images according to the condition that one image possibly has corresponding relation with a plurality of textsPaired text descriptions, get a text description set +.>Indicating +.>Personal text description and image->Pairing;
s3-3, calculating a text deviation sampleTo the collection->Distance of all text samples in the set, and +.>Sequencing all samples in the database;
s3-4. Traversing the collection in turnAll samples of->If sample->If not, the text is left in the different sample +.>Is composed of->Sample change->And ending the traversal; if not, continuing traversing; if the set is traversed +.>After all samples of->Still as a variant sample, the sample is described +.>And (3) continuing to mine the next text departure sample without deserving the mining of the sample until all the text departure samples are tried, and ending the mining of the text departure samples.
Preferably, in step S4, the class center feature of the image and the class center feature of the text are calculated as follows:
according to the clustering labels, summing the image features of the same category, taking an average value, taking the obtained features as the category center features of the image category, and specifically calculating the following steps:
wherein the method comprises the steps ofRepresent the firstiClass center feature of individual image class, +.>Represent the firstiA set of features for all samples of the individual image categories,/->A number of features used to represent the image samples in the collection;
according to the clustering labels, the text features of the same category are summed and the intermediate value is taken, the obtained features are used as the center-like features of the text category, and the specific calculation mode is as follows:
wherein the method comprises the steps ofRepresent the firstiClass center feature of individual text class, +.>Show the firstiA set of features for all samples of the individual text categories,/->The number of features used to represent the text samples in the collection;
initializing class-level visual memory module and class-level respectively using all the calculated image class center features and text class center featuresA text memory module; wherein the class-level visual memory module storesClass center feature of each image, class level text memory module stores +.>Class center feature of the individual text.
Preferably, in step S5, the total class level cross-modal contrast match is lostThe calculation steps are as follows,
given a single bySmall batch composed of individual pedestrian image features and text description features, cross-modal contrast matching loss of images to text class center +.>The calculation method comprises the following steps:
wherein the method comprises the steps ofFor the features of a certain image sample in the small lot, +.>Representation and->Text class center feature with the same cluster ID, < ->Is a temperature coefficient that the image can learn; />Denoted as the firstjClass center feature for individual text classes;
Cross-modal contrast matching loss of text to image class centerThe calculation method comprises the following steps:
wherein the method comprises the steps ofFor the characteristics of a certain text sample in the small lot, < >>Representation and->Image class center feature with same cluster ID, < ->Is a temperature coefficient that text can learn; />Is the firstjClass center features of the individual image classes;
overall class-level cross-modal contrast matching loss
Preferably, in step S5, the cross-modal projection loss at the instance level is calculated:
loss of cross-modal projection loss image feature projection to text feature space at instance levelAnd loss of text feature projection into image feature space +.>
Calculating projection loss of image features to text featuresThe method comprises the following specific steps:
given a single byA small batch of individual pedestrian image features and text description features +.>This small lot can be denoted +.>Wherein->Representation->And->Image features, text features, which belong to the same pedestrian,/->Representation +.>And->Mismatch; />And->The probability of a match can be defined as:
wherein the method comprises the steps ofIs a learnable parameter;
in a small batch, the number of positive samples matched may be more than one, and the image featuresThe number of positive samples matched may be more than one, so the true probability is normalized with the softmax function, and the true matching probability is expressed as:
calculating the KL divergence of the probability of image to text and the true matching probability to obtain a small-batch image to text matching loss function, specifically comprising the following steps:
is a super parameter, the numerical value of which approaches 0, and is used for preventing the problem of numerical value overflow;
calculating projection loss of text features to image featuresThe method comprises the following specific steps:
given a single byA small batch of individual pedestrian image features and text description features, for each text feature +.>This small lot can be denoted +.>Wherein->Representation->And->Text features, image features, which belong to the same pedestrian, < ->Representation->And->Mismatch; />And->The probability of a match is defined as:
wherein the method comprises the steps ofIs a learnable parameter;
in a small batch, with text featuresThe number of positive samples matched may be more than one, so the text featuresThe true probability of other image features is normalized by adopting a softmax function, and the true matching probability is expressed as:
the KL divergence of the probability of the text to the image and the real matching probability is calculated, and a small batch of matching loss function of the text to the image can be obtained, specifically:
is a super parameter, the numerical value of which approaches 0, and is used for preventing the problem of numerical value overflow;
cross-modal projection loss at the general instance level
Hybrid level cross-modal matching loss
Preferably, in step S6, the visual memory module of the class level is updated in the following manner:
wherein the method comprises the steps ofThe updating proportion used for controlling the vision memory module is also a super parameter; />Is the firstiClass center feature of individual image class, +.>Is the first in small batchiSample features of individual image categories;
the text memory module at the class level is updated in the following way:
wherein the method comprises the steps ofThe updating proportion of the text memory module is also used for controlling the updating proportion of the text memory module and is also a super parameter; />Is the firstiClass center feature of individual text class, +.>Is the first in small batchiSample features of the individual text categories.
Preferably, in step S6,
for one small batch of input data, optimizing model parameters by using an Adam optimizer according to the loss calculated in the step S5; repeating the steps S2-S6 after the whole training data set is trained once until the set iteration times are reached; after training is completed, the trained parameters of the image encoder and the text encoder are saved.
The weak supervision text pedestrian retrieval system for class level contrast learning comprises an image text feature extraction module, a separation sample mining module, a class-level multi-modal memory module and a hybrid-level cross-modal matching module;
the image text feature extraction module is used for extracting image features and text features; the image encoder and the text encoder which adopt the CLIP respectively serve as the image encoder and the text feature encoder in the image text feature extraction module; initializing the image encoder and the text encoder using a pre-trained model of CLIP;
the outlier sample mining module comprises an image outlier sample mining module and a text outlier sample mining module; the image outlier sample mining module is used for mining valuable outliers in the images; a text departure sample mining module to mine valuable departure samples in the text;
the multi-mode memory module of class level comprises a visual memory module of class level and a text memory module of class level;
class-level visual memory module: respectively initializing all the calculated image class center features and text class center features by using a clustering algorithmClass-level visual memory module and class-level text memory module, wherein the class-level visual memory module storesClass center feature of each image, class level text memory module stores +.>Class center features of the individual text;
the cross-modal matching module of the mixed level is used for calculating cross-modal comparison matching loss of class levelCross-modal projection loss at the instance level>And hybrid level cross-modal matching loss->And updating the CLIP model parameters in a gradient updating mode, and storing the image encoder and text encoder parameters after training is finished, wherein the parameters are used for calculating cosine similarity between the image characteristics and the text characteristics of the pedestrian image to be retrieved.
Compared with the prior art, the application has the following beneficial effects:
1. according to the application, the corresponding relation between the text and the image is considered from the class level, so that the problem of large difference in pedestrians with the same ID caused by factors such as illumination, visual angle change and the like can be solved, and the rich and strong multi-modal knowledge of the CLIP is effectively utilized.
2. The application further improves the performance of text pedestrian retrieval under the condition of weak supervision, and reduces the difference of the performance between the weak supervision and the supervised text pedestrian retrieval.
Drawings
FIG. 1 is a flow chart of the present application.
FIG. 2 is a schematic diagram of the system of the present application.
Detailed Description
The following detailed description is exemplary and is intended to provide further explanation of the application. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present application.
FIG. 2 shows a class-level contrast learning weak-supervision text pedestrian retrieval system comprising an image text feature extraction module, a outlier sample mining module, a class-level multi-modal memory module and a hybrid-level cross-modal matching module;
the image text feature extraction module is used for extracting image features and text features; the image encoder and the text encoder which adopt the CLIP respectively serve as the image encoder and the text feature encoder in the image text feature extraction module; initializing the image encoder and the text encoder using a pre-trained model of CLIP;
the outlier sample mining module comprises an image outlier sample mining module and a text outlier sample mining module;
the basic principle of the separation sample mining module is that the category of the separation sample is found according to the corresponding relation between the image and the text, namely, one image possibly corresponds to a plurality of text descriptions, and one text description corresponds to one image;
the image outlier sample mining module is used for mining valuable outliers in the images; a text departure sample mining module to mine valuable departure samples in the text;
the multi-mode memory module of class level comprises a visual memory module of class level and a text memory module of class level;
class-level visual memory module: all the calculated image class center features and text class center features are utilized to initialize class-level visual memory modules and class-level text memory modules respectively, wherein the class-level visual memory modules storeClass of individual imagesThe central feature, class level text memory module stores +.>Class center features of the individual text;
the cross-modal matching module of the mixed level is used for calculating cross-modal comparison matching loss of class levelCross-modal projection loss at the instance level>And hybrid level cross-modal matching loss->And updating the CLIP model parameters in a gradient updating mode, and storing the parameters of the image encoder and the text encoder after training is finished, wherein the parameters are used for calculating cosine similarity between image features and text features during retrieval.
The input image is preprocessed and readjusted toPixels, and image data augmentation methods of random horizontal flipping, random cropping, and random erasure are used.
The text description entered during training uses the lower case bytes of 49152 to word and encode the code. The segmented text description inserts [ SOS ] and [ EOS ] embedded vectors at the beginning and end, respectively, to represent the beginning and end of the text description statement. The maximum text description sequence length is 77. To learn word relative positional relationships between sentences, position embeddings are also added to the word vector input sequence.
The learning rate is set asThe iteration number is 60, and the wall-up strategy is adopted in 15 rounds of training, so that the learning rate is from +.>Linear increase to->。/>、/>And->Default initialization is 0.02.
FIG. 1 shows a class level contrast learning weak supervision text pedestrian retrieval method, comprising the following steps:
s1, extracting Image features and text features by using an Image encoder and a text encoder of a CLIP model (Contrastive Language-Image Pre-Training, which are called as CLIPs in short);
specifically, an image encoder and a text encoder which adopt a CLIP respectively serve as a feature encoder and a text feature encoder in the image text feature extraction module; initializing the image encoder and the text encoder using a pre-trained model of CLIP;
s2, clustering the image features and the text features by using a clustering algorithm; clustering the image features and the text features by using a clustering algorithm to obtain clustering labelsAnd->,/>Is the firstiCluster ID of the sheet image; />Is the firstiCluster ID of the individual text; for clustering outliers, the tags are all +.>
S3, excavating valuable samples in the clustering outlier samples according to the corresponding relation between the images and the texts;
the method comprises the following specific steps of:
s31, assume the firstThe outlier samples of the individual images are denoted +.>Finding all and +.>Matching text descriptions, filtering text dissimilarity samples in the matched text descriptions, and obtaining a text description setIndicating +.>Personal text description and->Pairing and the text samples have cluster labels;
s32, traversing according to the corresponding relation between the image and the textFinding the image sample paired with it, obtaining a clustered image set +.>
S33, calculating an image outlier sampleTo the collection->Is the distance of all the image samples in the image,from near to far, for the collection +.>Sequencing all samples in the database;
s34, traversing the collection in turnAll samples of->If the sample is not an alien sample, the image is alien sample +.>Is composed of->Sample change->And ending the traversal; if not, continuing traversing; if the set is traversed +.>After all samples of->Still as a separate sample, say +.>And (3) continuing to excavate the sample without deserving, continuing to excavate the next image rejection sample until all the image rejection samples are tried, and ending the excavation of the image rejection samples.
The text departure samples are mined, and the concrete steps are as follows:
s3-1. Suppose the firstiThe individual text disjunct samples are represented asFinding and combining the text according to the corresponding relation between the image and the textPaired image->The method comprises the steps of carrying out a first treatment on the surface of the If the image->Also a disjunct sample, then end the pair +.>Digging of sample, ++>The state of the dissimilarity sample is still kept, and the next text dissimilarity sample is traversed; if the image->Is a clustered sample, then proceed to the next step;
s3-2, finding all the images according to the condition that one image possibly has corresponding relation with a plurality of textsPaired text descriptions, get a text description set +.>Indicating +.>Personal text description and image->Pairing;
s3-3, calculating a text deviation sampleTo the collection->Distance of all text samples in (1) from near to far, for the collection +.>Sequencing all samples in the database;
s3-4. Traversing the collection in turnAll samples of->If sample->If not, the text is left in the different sample +.>Is composed of->Sample change->And ending the traversal; if not, continuing traversing; if the set is traversed +.>After all samples of->Still as a variant sample, the sample is described +.>And (3) continuing to mine the next text departure sample without deserving the mining of the sample until all the text departure samples are tried, and ending the mining of the text departure samples.
S4, calculating class center features of the images and class center features of texts according to the cluster IDs, and storing the class center features and the class center features of the texts into memory modules of respective modes;
the method comprises the following steps of calculating the class center characteristics of the image and the class center characteristics of the text:
according to the clustering labels, summing the image features of the same category, taking an average value, taking the obtained features as the category center features of the image category, and specifically calculating the following steps:
wherein the method comprises the steps ofRepresent the firstiClass center feature of individual image class, +.>Represent the firstiA set of features for all samples of the individual image categories,/->A number of features used to represent the image samples in the collection;
according to the clustering labels, the text features of the same category are summed and the intermediate value is taken, the obtained features are used as the center-like features of the text category, and the specific calculation mode is as follows:
wherein the method comprises the steps ofRepresent the firstiClass center feature of individual text class, +.>Show the firstiA set of features for all samples of the individual text categories,/->The number of features used to represent the text samples in the collection;
initializing a class-level visual memory module and a class-level text memory module respectively by using all the calculated image class center features and text class center features; wherein the class-level visual memory module storesClass center feature of each image, class level text memory module stores +.>Class center feature of the individual text.
S5, respectively calculating the cross-modal contrast matching loss of the class level and the cross-modal projection loss of the instance level to obtain the cross-modal matching loss of the mixed level,
overall class-level cross-modal contrast matching lossThe calculation steps are as follows,
given a single bySmall batch composed of individual pedestrian image features and text description features, cross-modal contrast matching loss of images to text class center +.>The calculation method comprises the following steps:
wherein the method comprises the steps ofFor the features of a certain image sample in the small lot, +.>Representation and sample->Text class center feature with the same cluster ID, < ->Is a temperature coefficient that the image can learn; />Denoted as the firstjPersonal textClass center features of the class;
cross-modal contrast matching loss of text to image class centerThe calculation method comprises the following steps:
wherein the method comprises the steps ofFor the characteristics of a certain text sample in the small lot, < >>Representation and sample->Image class center feature with same cluster ID, < ->Is a temperature coefficient that text can learn; />Is the firstjClass center features of the individual image classes;
overall class-level cross-modal contrast matching loss
Loss of cross-modal projection loss image feature projection to text feature space at instance levelAnd loss of text feature projection into image feature space +.>
Calculating projection loss of image features to text featuresThe method comprises the following specific steps:
given a single byA small batch of individual pedestrian image features and text description features +.>This small lot can be denoted +.>Wherein->Representation->And->Image features, text features, which belong to the same pedestrian,/->Representation->And->Mismatch; />And->The probability of a match can be defined as:
wherein the method comprises the steps ofIs a learnable parameter;
in a small batch, the number of positive samples matched may be more than one, and the image featuresThe number of positive samples matched may be more than one, so the true probability is normalized with the softmax function, and the true matching probability is expressed as:
calculating the KL (Kullback-Leibler) divergence of the probability of image to text and the true matching probability to obtain a small-batch matching loss function of the image to text, wherein the matching loss function specifically comprises the following steps:
is a super parameter, the numerical value of which approaches 0, and is used for preventing the problem of numerical value overflow;
calculating projection loss of text features to image featuresThe method comprises the following specific steps:
given a single byA small batch of individual pedestrian image features and text description features, for each text feature +.>This small lot can be denoted +.>Wherein->Representation->And->Text features, image features, which belong to the same pedestrian, < ->Representation->And->Mismatch; />And->The probability of a match is defined as:
wherein the method comprises the steps ofIs a learnable parameter;
in a small batch, with text featuresThe number of positive samples matched may be more than one, so the text featuresThe true probability of other image features is normalized by adopting a softmax function, and the true matching probability is expressed as:
the KL (Kullback-Leibler) divergence of the text-to-image probability and the true matching probability is calculated, and a small-batch matching loss function of the text-to-image can be obtained, specifically:
is a super parameter, the numerical value of which approaches 0, and is used for preventing the problem of numerical value overflow;
cross-modal projection loss at the general instance level
Hybrid level cross-modal matching loss
S6, updating the CLIP model parameters in a gradient updating mode, and storing the parameters of the image encoder and the text encoder after training;
in step S6, the visual memory module of class level is updated in the following way:
wherein the method comprises the steps ofThe updating proportion used for controlling the vision memory module is also a super parameter; />Is the firstiClass center feature of individual image class, +.>Is the first in small batchiSample features of individual image categories;
the text memory module at the class level is updated in the following way:
wherein the method comprises the steps ofThe updating proportion of the text memory module is also used for controlling the updating proportion of the text memory module and is also a super parameter; />Is the firstiClass center feature of individual text class, +.>Is the first in small batchiSample features of the individual text categories.
For one small batch of input data, optimizing model parameters by using an Adam optimizer according to the loss calculated in the step S5; repeating the steps S2-S6 after the whole training data set is trained once until the set iteration times are reached; after training is completed, the trained parameters of the image encoder and the text encoder are saved.
S7, extracting image and text features by adopting the parameters of the image encoder and the text encoder in the step S6 when in use, then calculating cosine similarity between the image features and the text features, sorting the pedestrian images to be retrieved according to the similarity, and returning a sorting result.
Table 1 shows the performance test data on the CUHK-PEDES dataset
Method Rank-1 Rank-5 Rank-10 mAP mINP
Supervised text pedestrian retrieval
TIPCB(2022) 64.26 83.19 89.10 - -
CAIBC(2022) 64.43 82.87 88.37 - -
AXM-Net(2022) 64.44 80.52 86.77 58.73 -
LGUR(2022) 65.25 83.12 89.00 - -
IVT(2022) 65.59 83.11 89.21 - -
CFine(2022) 69.57 85.93 91.15 - -
IRRA(2023) 73.38 89.93 93.71 66.13 50.24
Weak supervision text pedestrian retrieval
CMMT(2021) 57.10 78.14 85.23 - -
CAIBC(2022) 58.64 79.02 85.93 - -
Baseline (CLIP-ViT-B/16)} 58.45 78.87 85.3 54.14 39.83
CMMT(CLIP-ViT-B/16)(2021) 59.57 79.53 86.53 54.66 39.78
The application is that 68.76 86.11 91.26 62.2 46.71
Table 2 shows the performance test data on the ICFG-PEDES data set
Method Rank-1 Rank-5 Rank-10 mAP mINP
Supervised text pedestrian retrieval
Dual Path(2020) 38.99 59.44 68.41 - -
ViTAA(2020) 50.98 68.79 75.78 - -
SSAN(2021) 54.23 72.63 79.53 - -
IVT(2022) 56.04 73.6 80.22 - -
ISANet(2022) 57.73 75.42 81.72 - -
CFine(2022) 60.83 76.55 82.42 - -
IRRA(2023) 63.46 80.25 85.82 38.06 7.93
Weak supervision text pedestrian retrieval
Baseline (CLIP-ViT-B/16) 53.83 73.33 80.46 30.6 5.35
CMMT(CLIP-ViT-B/16) 54.27 71.17 77.86 33.17 5.73
The application is that 58.41 76.29 82.54 37.21 8.62
Table 3 shows the performance test data on the RSTPReid dataset
Method Rank-1 Rank-5 Rank-10 mAP mINP
Supervised text pedestrian retrieval
DSSL(2021) 39.05 62.6 73.95 - -
SSAN(2021) 43.5 67.8 77.15 - -
LBUL(2022) 45.55 68.2 77.85 - -
IVT(2022) 46.7 70 78.8 - -
CFine(2022) 50.55 72.5 81.6 - -
IRRA(2023) 60.2 81.3 88.2 47.17 25.28
Weak supervision text pedestrian retrieval
CMMT(CLIP-ViT-B/16) 52.25 76.45 84.55 41.98 22
Baseline (CLIP-ViT-B/16) 53.1 75.7 83.6 38.65 21.06
The application is that 57.25 76.95 86.1 44.96 24.22
The method achieves 68.76%, 58.41% and 57.25% Rank-1 precision on international benchmark data sets CUHK-PEDES, ICFG-PEDES and RSTPReid respectively, and has exceeded the performance of the existing weak supervised learning method and even exceeded some supervised learning methods.
The above description is only of the preferred embodiments of the present application and is not intended to limit the present application, but various modifications and variations can be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims (8)

1. The weak supervision text pedestrian retrieval method for class level comparison learning is characterized by comprising the following steps of:
s1, extracting image features and text features by using an image encoder and a text encoder of a CLIP model;
s2, clustering the image features and the text features by using a clustering algorithm;
s3, excavating valuable samples in the clustering outlier samples according to the corresponding relation between the images and the texts;
s4, calculating class center features of the images and class center features of texts according to the cluster IDs, and storing the class center features and the class center features of the texts into memory modules of respective modes;
s5, respectively calculating the cross-modal contrast matching loss of the class level and the cross-modal projection loss of the instance level to obtain the cross-modal matching loss of the mixed level,
cross-modal contrast match loss at class levelThe calculation steps are as follows,
given a small batch of N pedestrian image features and text description features, cross-modal contrast matching loss of images to text class centersThe calculation method comprises the following steps:
wherein f i v C, for the characteristics of a certain image sample in the small batch t+ Represents sum f i v Text class center feature with same cluster ID, τ v Is a temperature coefficient that the image can learn;class center feature expressed as the j-th text class;
cross-modal contrast matching loss of text to image class centerThe calculation method comprises the following steps:
wherein f i t C, for the characteristics of a certain text sample in the small batch v+ Represents sum f i t Image class center feature with same cluster ID, τ t Is a temperature coefficient that text can learn;class center features for the j-th image class;
cross-modal contrast match loss at class level
Computing cross-modal projection loss at the instance level:
loss of cross-modal projection loss image feature projection to text feature space at instance levelAnd loss of text feature projection into image feature space +.>
Calculating projection loss of image features to text featuresThe method comprises the following specific steps:
given a single image of N pedestriansFeature and text description feature composition small batches, for each image feature f i v This small batch can be expressed asWherein->Represents f i v And f j t Image features, text features, which belong to the same pedestrian,/->Represents f i v And f j t Mismatch; f (f) i v And f j t The probability of a match is defined as:
where τ is a learnable parameter;
in a small batch, the number of positive samples matched may be more than one, and the image feature f i v The number of positive samples matched may be more than one, so the true probability is normalized, and the true matching probability is expressed as:
calculating the KL divergence of the probability of image to text and the true matching probability to obtain a small-batch image to text matching loss function, specifically comprising the following steps:
e is a super parameter, and the value of E approaches 0;
calculating projection loss of text features to image featuresThe method comprises the following specific steps:
given a small batch of N pedestrian image features and text description features, for each text feature f i t This small batch can be expressed asWherein->Represents f i t And f j v Text features and image features belonging to the same pedestrian; />Represents f i t And f j v Mismatch; f (f) i t And f j v The probability of a match is defined as:
where τ is a learnable parameter;
in a small batch, and text feature f i t The number of positive samples matched may be more than one, so the text feature f i t The true probability of other image features is normalized, and the true matching probability is expressed as:
the KL divergence of the probability of the text to the image and the real matching probability is calculated, and a small batch of matching loss function of the text to the image can be obtained, specifically:
e is a super parameter, and the value of E approaches 0;
instance-level cross-modal projection loss
Hybrid level cross-modal matching loss
S6, updating the CLIP model parameters in a gradient updating mode, and storing the parameters of the image encoder and the text encoder after training is finished;
s7, extracting image and text features by adopting the parameters of the image encoder and the text encoder in the step S6 when in use, then calculating cosine similarity between the image features and the text features, sorting the pedestrian images to be retrieved according to the similarity, and returning a sorting result.
2. The weak supervision text pedestrian retrieval method based on class level contrast learning as set forth in claim 1, wherein in step S2, the image features and the text features are clustered by a clustering algorithm to obtain a cluster labelAnd-> A cluster ID for the ith image; />Cluster ID for the i-th text; for cluster outliers, the labels are all-1.
3. The weak supervision text pedestrian retrieval method based on class level contrast learning as defined in claim 1, wherein in step S3, the image outlier is mined, specifically comprising the steps of:
s31, assuming that the outlier sample of the ith image is represented asFinding all and +.>The paired text descriptions are filtered out from the text dissimilarity samples to obtain a text description set P t ={t 1 ,…,t k -representing k text descriptions and +_>Pairing and the text samples have cluster labels;
s32, traversing P according to the corresponding relation between the image and the text t Finding the image sample paired with it to obtain a clustered image set P v ={v 1 ,…,v k };
S33, calculating an image outlier sampleTo set P v Distance of all image samples in the set P v Sequencing all samples in the database;
s34, traversing the set P in sequence v All samples v in (1) i If the sample is not a outlier, the image is outlierChange the clustering label of (1) from-1 to sample v i And ending the traversal; if not, continuing traversing; if the set P is traversed v After all samples of->Still as a separate sample, say +.>And (3) continuing to excavate the sample without deserving, continuing to excavate the next image rejection sample until all the image rejection samples are tried, and ending the excavation of the image rejection samples.
4. The method for pedestrian retrieval of weakly supervised text for class level contrast learning as set forth in claim 1, wherein in step S3, text outliers are mined by:
s3-1 assume that the ith text disagreement sample is represented asFinding out +.>Paired image v i The method comprises the steps of carrying out a first treatment on the surface of the If image v i Also a disjunct sample, then end the pair +.>Digging of sample, ++>The state of the dissimilarity sample is still kept, and the next text dissimilarity sample is traversed; if image v i Is a clustered sample, then proceed to the next step;
s3-2, according to the condition that one image may have corresponding relation with a plurality of textsFind all and image v i Paired text descriptions, a text description set P is obtained t ={t 1 ,…,t q ' q text descriptions and images v i Pairing;
s3-3, calculating a text deviation sampleTo set P t Distance of all text samples in (a) and for set P t Sequencing all samples in the database;
s3-4. Traversing the set P in sequence t All samples t in (1) i If sample t i If the text is not a dissimilarity sample, the text is a dissimilarity sampleChange the clustering label of (1) from-1 to sample t i And ending the traversal; if not, continuing traversing; if the set P is traversed t After all samples of->Still as a variant sample, the sample is described +.>And (3) continuing to mine the next text departure sample without deserving the mining of the sample until all the text departure samples are tried, and ending the mining of the text departure samples.
5. The weak supervision text pedestrian retrieval method based on class level contrast learning as defined in claim 1, wherein in step S4, class center features of the image and class center features of the text are calculated as follows:
according to the clustering labels, summing the image features of the same category, taking an average value, taking the obtained features as the category center features of the image category, and specifically calculating the following steps:
wherein the method comprises the steps ofClass center feature representing the ith image class, Y i v A set of features representing all samples of the ith image class, |·| is used to represent the number of features of the image samples in the set;
according to the clustering labels, the text features of the same category are summed and the intermediate value is taken, the obtained features are used as the center-like features of the text category, and the specific calculation mode is as follows:
wherein the method comprises the steps ofClass center feature representing the ith text class, Y i t Showing a set of features for all samples of the ith text category, |·| is used to represent the number of features for the text samples in the set;
initializing a class-level visual memory module and a class-level text memory module respectively by using all the calculated image class center features and text class center features; wherein the class-level visual memory module stores N v Class center feature of each image, class level text memory module stores N t Class center feature of the individual text.
6. The weak supervision text pedestrian retrieval method based on class level contrast learning as defined in claim 1, wherein the visual memory module of class level is updated in step S6 in the following manner:
wherein m is v The updating proportion used for controlling the vision memory module is also a super parameter;class center feature for the ith image class, < +.>Sample features for the ith image class in the small lot;
the text memory module at the class level is updated in the following way:
wherein m is t The updating proportion of the text memory module is also used for controlling the updating proportion of the text memory module and is also a super parameter;class center feature for the ith text class, < +.>Is a sample feature of the ith text category in the small lot.
7. The weak supervision text pedestrian retrieval method of class level contrast learning of claim 1 wherein in step S6,
for one small batch of input data, optimizing model parameters by using an Adam optimizer according to the loss calculated in the step S5; repeating the steps S2-S6 after the whole training data set is trained once until the set iteration times are reached; after training is completed, the trained parameters of the image encoder and the text encoder are saved.
8. The weak supervision text pedestrian retrieval system for class level comparison learning is characterized by comprising an image text feature extraction module, a departure sample mining module, a class-level multi-mode memory module and a mixed-level cross-mode matching module;
the image text feature extraction module is used for extracting image features and text features; the image encoder and the text encoder which adopt the CLIP respectively serve as the image encoder and the text feature encoder in the image text feature extraction module; initializing the image encoder and the text encoder using a pre-trained model of CLIP;
the outlier sample mining module comprises an image outlier sample mining module and a text outlier sample mining module; the image outlier sample mining module is used for mining valuable outliers in the images; a text departure sample mining module to mine valuable departure samples in the text;
the multi-mode memory module of class level comprises a visual memory module of class level and a text memory module of class level;
class-level visual memory module: all the calculated image class center features and text class center features are utilized to initialize class-level visual memory modules and class-level text memory modules respectively, wherein the class-level visual memory modules store N v Class center feature of each image, class level text memory module stores N t Class center features of the individual text;
the cross-modal matching module of the mixed level is used for calculating cross-modal comparison matching loss of class levelCross-modal projection loss at the instance level>And hybrid level cross-modal matching loss->
Overall class-level cross-modal contrast matching lossThe calculation steps are as follows,
given a small batch of N pedestrian image features and text description features, cross-modal contrast matching loss of images to text class centersThe calculation method comprises the following steps:
wherein f i v C, for the characteristics of a certain image sample in the small batch t+ Represents sum f i v Text class center feature with same cluster ID, τ v Is a temperature coefficient that the image can learn;class center feature expressed as the j-th text class;
cross-modal contrast matching loss of text to image class centerThe calculation method comprises the following steps:
wherein f i t C, for the characteristics of a certain text sample in the small batch v+ Represents sum f i t Image class center feature with same cluster ID, τ t Is a temperature coefficient that text can learn;class center features for the j-th image class;
overall class-level cross-modal contrast matching loss
Computing cross-modal projection loss at the instance level:
loss of cross-modal projection loss image feature projection to text feature space at instance levelAnd loss of text feature projection into image feature space +.>
Calculating projection loss of image features to text featuresThe method comprises the following specific steps:
given a small batch of N pedestrian image features and text description features, for each image feature f i v This small batch can be expressed asWherein->Represents f i v And f j t Image features, text features, which belong to the same pedestrian,/->Represents f i v And f j t Mismatch; f (f) i v And f j t The probability of a match is defined as:
where τ is a learnable parameter;
in a small batch, the number of positive samples matched may be more than one, and the image feature f i v The number of positive samples matched may be more than one, so the true probability is normalized, and the true matching probability is expressed as:
calculating the KL divergence of the probability of image to text and the true matching probability to obtain a small-batch image to text matching loss function, specifically comprising the following steps:
e is a super parameter, and the value of E approaches 0;
calculating projection loss of text features to image featuresThe method comprises the following specific steps:
given a small batch of N pedestrian image features and text description features, for each text feature f i t This small batch can be expressed asWherein->Represents f i t And f j v Text features and image features belonging to the same pedestrian; />Represents f i t And f j v Mismatch; f (f) i t And f j v The probability of a match is defined as:
where τ is a learnable parameter;
in a small batch, and text feature f i t The number of positive samples matched may be more than one, so the text feature f i t The true probability of other image features is normalized, and the true matching probability is expressed as:
the KL divergence of the probability of the text to the image and the real matching probability is calculated, and a small batch of matching loss function of the text to the image can be obtained, specifically:
e is a super parameter, and the value of E approaches 0;
cross-modal projection loss at the general instance level
Hybrid level cross-modal matching loss
And updating the CLIP model parameters in a gradient updating mode, and storing the image encoder and text encoder parameters after training is finished, wherein the parameters are used for calculating cosine similarity between the image characteristics and the text characteristics of the pedestrian image to be retrieved.
CN202311204550.0A 2023-09-19 2023-09-19 Weak supervision text pedestrian retrieval method and system for class-level comparison learning Active CN116935329B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311204550.0A CN116935329B (en) 2023-09-19 2023-09-19 Weak supervision text pedestrian retrieval method and system for class-level comparison learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311204550.0A CN116935329B (en) 2023-09-19 2023-09-19 Weak supervision text pedestrian retrieval method and system for class-level comparison learning

Publications (2)

Publication Number Publication Date
CN116935329A CN116935329A (en) 2023-10-24
CN116935329B true CN116935329B (en) 2023-12-01

Family

ID=88386536

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311204550.0A Active CN116935329B (en) 2023-09-19 2023-09-19 Weak supervision text pedestrian retrieval method and system for class-level comparison learning

Country Status (1)

Country Link
CN (1) CN116935329B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019148898A1 (en) * 2018-02-01 2019-08-08 北京大学深圳研究生院 Adversarial cross-media retrieving method based on restricted text space
CN111753190A (en) * 2020-05-29 2020-10-09 中山大学 Meta learning-based unsupervised cross-modal Hash retrieval method
CN115546831A (en) * 2022-10-11 2022-12-30 同济人工智能研究院(苏州)有限公司 Cross-modal pedestrian searching method and system based on multi-granularity attention mechanism
CN116186328A (en) * 2023-01-05 2023-05-30 厦门大学 Video text cross-modal retrieval method based on pre-clustering guidance
WO2023093574A1 (en) * 2021-11-25 2023-06-01 北京邮电大学 News event search method and system based on multi-level image-text semantic alignment model
CN116759076A (en) * 2023-07-12 2023-09-15 山西大学 Unsupervised disease diagnosis method and system based on medical image

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019148898A1 (en) * 2018-02-01 2019-08-08 北京大学深圳研究生院 Adversarial cross-media retrieving method based on restricted text space
CN111753190A (en) * 2020-05-29 2020-10-09 中山大学 Meta learning-based unsupervised cross-modal Hash retrieval method
WO2023093574A1 (en) * 2021-11-25 2023-06-01 北京邮电大学 News event search method and system based on multi-level image-text semantic alignment model
CN115546831A (en) * 2022-10-11 2022-12-30 同济人工智能研究院(苏州)有限公司 Cross-modal pedestrian searching method and system based on multi-granularity attention mechanism
CN116186328A (en) * 2023-01-05 2023-05-30 厦门大学 Video text cross-modal retrieval method based on pre-clustering guidance
CN116759076A (en) * 2023-07-12 2023-09-15 山西大学 Unsupervised disease diagnosis method and system based on medical image

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
弱监督学习语义分割方法综述;李宾皑;李颖;郝鸣阳;顾书玉;;数字通信世界(第07期);全文 *
曾成斌 ; 刘继乾 ; .基于图切割和密度聚类的视频行人检测算法.模式识别与人工智能.2017,(第07期),全文. *

Also Published As

Publication number Publication date
CN116935329A (en) 2023-10-24

Similar Documents

Publication Publication Date Title
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
WO2020147857A1 (en) Method and system for extracting, storing and retrieving mass video features
CN112084790B (en) Relation extraction method and system based on pre-training convolutional neural network
CN107977361B (en) Chinese clinical medical entity identification method based on deep semantic information representation
CN111966917B (en) Event detection and summarization method based on pre-training language model
CN109471895B (en) Electronic medical record phenotype extraction and phenotype name normalization method and system
CN109344285B (en) Monitoring-oriented video map construction and mining method and equipment
CN112541355B (en) Entity boundary type decoupling few-sample named entity recognition method and system
CN110297931B (en) Image retrieval method
CN108288051B (en) Pedestrian re-recognition model training method and device, electronic equipment and storage medium
CN110598005A (en) Public safety event-oriented multi-source heterogeneous data knowledge graph construction method
Stumm et al. Probabilistic place recognition with covisibility maps
CN111814658B (en) Scene semantic structure diagram retrieval method based on semantics
CN113269070A (en) Pedestrian re-identification method fusing global and local features, memory and processor
CN114547249A (en) Vehicle retrieval method based on natural language and visual features
Thoma et al. Soft contrastive learning for visual localization
CN110533661A (en) Adaptive real-time closed-loop detection method based on characteristics of image cascade
CN116486419A (en) Handwriting word recognition method based on twin convolutional neural network
CN113065409A (en) Unsupervised pedestrian re-identification method based on camera distribution difference alignment constraint
CN114529900A (en) Semi-supervised domain adaptive semantic segmentation method and system based on feature prototype
CN114579794A (en) Multi-scale fusion landmark image retrieval method and system based on feature consistency suggestion
WO2022252089A1 (en) Training method for object detection model, and object detection method and device
CN114492646A (en) Image-text matching method based on cross-modal mutual attention mechanism
CN113902764A (en) Semantic-based image-text cross-modal retrieval method
CN111241326B (en) Image visual relationship indication positioning method based on attention pyramid graph network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant