CN116935329B

CN116935329B - Weak supervision text pedestrian retrieval method and system for class-level comparison learning

Info

Publication number: CN116935329B
Application number: CN202311204550.0A
Authority: CN
Inventors: 郑艳伟; 赵新鹏; 王鹏; 孙恩涛; 杜超; 于东晓
Original assignee: Shandong University; Shanghai Step Electric Corp
Current assignee: Shandong University; Shanghai Step Electric Corp
Priority date: 2023-09-19
Filing date: 2023-09-19
Publication date: 2023-12-01
Anticipated expiration: 2043-09-19
Also published as: CN116935329A

Abstract

The application belongs to the field of image processing, and particularly relates to a weak supervision text pedestrian retrieval method and system for class level contrast learning, which are used for retrieving pedestrian images or videos through inputting natural language description in all scenes with pedestrians including but not limited to elevators, streets, malls and the like. And then constructing a multi-mode memory module at class level according to the cluster ID, wherein the module is dynamically updated in the training process. During training, the cross-mode matching module of the mixed level draws the distance between the similar images and the texts from two angles of the class level and the instance level, and pushes away the distance between the different similar images and the texts. The application greatly improves the accuracy of text pedestrian retrieval under the weak supervision condition.

Description

Weak supervision text pedestrian retrieval method and system for class-level comparison learning

Technical Field

The application belongs to the field of image processing, and particularly relates to a weak supervision text pedestrian retrieval method and system for class level contrast learning.

Background

In recent years, pedestrian retrieval is widely focused, and has important application value in the field of intelligent video monitoring. The goal of this task is to give a query, such as a photograph of a pedestrian or a textual description of the pedestrian, and then retrieve the corresponding pedestrian image from the database. Pedestrian retrieval can be classified into: image-based pedestrian retrieval and text-based pedestrian retrieval. Where at least one pedestrian image of interest is required for image-based pedestrian retrieval as a query, in reality, the pedestrian image of interest is often difficult to obtain.

Currently, text-based pedestrian retrieval tasks often employ a supervised approach to model training. This means that in addition to the pedestrian image and the corresponding pedestrian text description, people need to label the pedestrian ID, which does not differ from increasing the substantial human cost, increasing the application threshold for text-based pedestrian retrieval. The difficult problem faced by the weak supervision text pedestrian retrieval not only comprises the data difference of two different modes of spanning text and images, which is faced by the supervision text pedestrian retrieval, but also comprises how the model retrieves all images of the same pedestrian under the interference of illumination change, shielding, visual angle change, low resolution and the like under different cameras without the guidance of pedestrian ID information. The prior method alleviates the two problems to a certain extent, but has poor effect. First, previous work employed a pre-trained model in a single modality as the backbone network, such as image encoder employing a ResNet network trained on ImageNet, text encoder employing BERT, and so forth. Pretraining is very important for text pedestrian retrieval tasks, and models employing single-modality pretraining lack the necessary cross-modality alignment capability between text and images, which affects the final model performance. Second, previous work mostly used instance-level cross-modal loss functions, ignoring the final objective of retrieving all images of the same person.

Disclosure of Invention

In order to overcome the differences of the prior art, the application provides a weak supervision text pedestrian retrieval method for class level contrast learning, which is used for retrieving pedestrian images or videos by inputting natural language descriptions in all scenes with pedestrians, including but not limited to elevators, streets, malls and the like. The technical proposal is as follows:

a weak supervision text pedestrian retrieval method for category level comparison learning comprises the following steps:

s1, extracting image features and text features by using an image encoder and a text encoder of a CLIP model;

s2, clustering the image features and the text features by using a clustering algorithm;

s3, excavating valuable samples in the clustering outlier samples according to the corresponding relation between the images and the texts;

s4, calculating class center features of the images and class center features of texts according to the cluster IDs, and storing the class center features and the class center features of the texts into memory modules of respective modes;

s5, respectively calculating the cross-modal contrast matching loss of the class level and the cross-modal projection loss of the instance level to obtain the cross-modal matching loss of the mixed level,

s6, updating the CLIP model parameters in a gradient updating mode, and storing the parameters of the image encoder and the text encoder after training;

s7, extracting image and text features by adopting the parameters of the image encoder and the text encoder in the step S6 when in use, then calculating cosine similarity between the image features and the text features, sorting the pedestrian images to be retrieved according to the similarity, and returning a sorting result.

Preferably, in step S2, the image features and the text features are clustered by using a clustering algorithm to obtain a cluster labelAnd->，/>Is the firstiCluster ID of the sheet image; />Is the firstiCluster ID of the individual text; for clustering outliers, the tags are all +.>。

Preferably, in step S3, the image outlier is mined, specifically as follows:

s31, assume the firstThe outlier samples of the individual images are denoted +.>Finding all and +.>Matching text descriptions, filtering text dissimilarity samples in the matched text descriptions, and obtaining a text description setIndicating +.>Personal text description and->Pairing and the text samples have cluster labels;

s32, traversing according to the corresponding relation between the image and the textFinding the image sample paired with it, obtaining a clustered image set +.>；

S33, calculating an image outlier sampleTo the collection->Distance of all image samples in the set, and +.>Sequencing all samples in the database;

s34, traversing the collection in turnAll samples of->If the sample is not an alien sample, the image is alien sample +.>Is composed of->Sample change->And ending the traversal; if not, continuing traversing; if the set is traversed +.>After all samples of->Still as a separate sample, say +.>And (3) continuing to excavate the sample without deserving, continuing to excavate the next image rejection sample until all the image rejection samples are tried, and ending the excavation of the image rejection samples.

Preferably, in step S3, a text departure sample is mined, and the specific steps are as follows:

s3-1. Suppose the firstiThe individual text disjunct samples are represented asFinding and combining the text according to the corresponding relation between the image and the textPaired image->The method comprises the steps of carrying out a first treatment on the surface of the If the image->Also a disjunct sample, then end the pair +.>Digging of sample, ++>The state of the dissimilarity sample is still kept, and the next text dissimilarity sample is traversed; if the image->Is a clustered sample, then proceed to the next step;

s3-2, finding all the images according to the condition that one image possibly has corresponding relation with a plurality of textsPaired text descriptions, get a text description set +.>Indicating +.>Personal text description and image->Pairing;

s3-3, calculating a text deviation sampleTo the collection->Distance of all text samples in the set, and +.>Sequencing all samples in the database;

s3-4. Traversing the collection in turnAll samples of->If sample->If not, the text is left in the different sample +.>Is composed of->Sample change->And ending the traversal; if not, continuing traversing; if the set is traversed +.>After all samples of->Still as a variant sample, the sample is described +.>And (3) continuing to mine the next text departure sample without deserving the mining of the sample until all the text departure samples are tried, and ending the mining of the text departure samples.

Preferably, in step S4, the class center feature of the image and the class center feature of the text are calculated as follows:

according to the clustering labels, summing the image features of the same category, taking an average value, taking the obtained features as the category center features of the image category, and specifically calculating the following steps:

；

wherein the method comprises the steps ofRepresent the firstiClass center feature of individual image class, +.>Represent the firstiA set of features for all samples of the individual image categories,/->A number of features used to represent the image samples in the collection;

according to the clustering labels, the text features of the same category are summed and the intermediate value is taken, the obtained features are used as the center-like features of the text category, and the specific calculation mode is as follows:

；

wherein the method comprises the steps ofRepresent the firstiClass center feature of individual text class, +.>Show the firstiA set of features for all samples of the individual text categories,/->The number of features used to represent the text samples in the collection;

initializing class-level visual memory module and class-level respectively using all the calculated image class center features and text class center featuresA text memory module; wherein the class-level visual memory module storesClass center feature of each image, class level text memory module stores +.>Class center feature of the individual text.

Preferably, in step S5, the total class level cross-modal contrast match is lostThe calculation steps are as follows,

given a single bySmall batch composed of individual pedestrian image features and text description features, cross-modal contrast matching loss of images to text class center +.>The calculation method comprises the following steps:

；

wherein the method comprises the steps ofFor the features of a certain image sample in the small lot, +.>Representation and->Text class center feature with the same cluster ID, < ->Is a temperature coefficient that the image can learn; />Denoted as the firstjClass center feature for individual text classes;

Cross-modal contrast matching loss of text to image class centerThe calculation method comprises the following steps:

；

wherein the method comprises the steps ofFor the characteristics of a certain text sample in the small lot, < >>Representation and->Image class center feature with same cluster ID, < ->Is a temperature coefficient that text can learn; />Is the firstjClass center features of the individual image classes;

overall class-level cross-modal contrast matching loss。

Preferably, in step S5, the cross-modal projection loss at the instance level is calculated:

loss of cross-modal projection loss image feature projection to text feature space at instance levelAnd loss of text feature projection into image feature space +.>；

Calculating projection loss of image features to text featuresThe method comprises the following specific steps:

given a single byA small batch of individual pedestrian image features and text description features +.>This small lot can be denoted +.>Wherein->Representation->And->Image features, text features, which belong to the same pedestrian,/->Representation +.>And->Mismatch; />And->The probability of a match can be defined as:

；

wherein the method comprises the steps ofIs a learnable parameter;

in a small batch, the number of positive samples matched may be more than one, and the image featuresThe number of positive samples matched may be more than one, so the true probability is normalized with the softmax function, and the true matching probability is expressed as:

；

calculating the KL divergence of the probability of image to text and the true matching probability to obtain a small-batch image to text matching loss function, specifically comprising the following steps:

；

is a super parameter, the numerical value of which approaches 0, and is used for preventing the problem of numerical value overflow;

calculating projection loss of text features to image featuresThe method comprises the following specific steps:

given a single byA small batch of individual pedestrian image features and text description features, for each text feature +.>This small lot can be denoted +.>Wherein->Representation->And->Text features, image features, which belong to the same pedestrian, < ->Representation->And->Mismatch; />And->The probability of a match is defined as:

；

wherein the method comprises the steps ofIs a learnable parameter;

in a small batch, with text featuresThe number of positive samples matched may be more than one, so the text featuresThe true probability of other image features is normalized by adopting a softmax function, and the true matching probability is expressed as:

；

the KL divergence of the probability of the text to the image and the real matching probability is calculated, and a small batch of matching loss function of the text to the image can be obtained, specifically:

；

cross-modal projection loss at the general instance level；

Hybrid level cross-modal matching loss。

Preferably, in step S6, the visual memory module of the class level is updated in the following manner:

；

wherein the method comprises the steps ofThe updating proportion used for controlling the vision memory module is also a super parameter; />Is the firstiClass center feature of individual image class, +.>Is the first in small batchiSample features of individual image categories;

the text memory module at the class level is updated in the following way:

；

wherein the method comprises the steps ofThe updating proportion of the text memory module is also used for controlling the updating proportion of the text memory module and is also a super parameter; />Is the firstiClass center feature of individual text class, +.>Is the first in small batchiSample features of the individual text categories.

Preferably, in step S6,

for one small batch of input data, optimizing model parameters by using an Adam optimizer according to the loss calculated in the step S5; repeating the steps S2-S6 after the whole training data set is trained once until the set iteration times are reached; after training is completed, the trained parameters of the image encoder and the text encoder are saved.

The weak supervision text pedestrian retrieval system for class level contrast learning comprises an image text feature extraction module, a separation sample mining module, a class-level multi-modal memory module and a hybrid-level cross-modal matching module;

the image text feature extraction module is used for extracting image features and text features; the image encoder and the text encoder which adopt the CLIP respectively serve as the image encoder and the text feature encoder in the image text feature extraction module; initializing the image encoder and the text encoder using a pre-trained model of CLIP;

the outlier sample mining module comprises an image outlier sample mining module and a text outlier sample mining module; the image outlier sample mining module is used for mining valuable outliers in the images; a text departure sample mining module to mine valuable departure samples in the text;

the multi-mode memory module of class level comprises a visual memory module of class level and a text memory module of class level;

class-level visual memory module: respectively initializing all the calculated image class center features and text class center features by using a clustering algorithmClass-level visual memory module and class-level text memory module, wherein the class-level visual memory module storesClass center feature of each image, class level text memory module stores +.>Class center features of the individual text;

the cross-modal matching module of the mixed level is used for calculating cross-modal comparison matching loss of class levelCross-modal projection loss at the instance level>And hybrid level cross-modal matching loss->And updating the CLIP model parameters in a gradient updating mode, and storing the image encoder and text encoder parameters after training is finished, wherein the parameters are used for calculating cosine similarity between the image characteristics and the text characteristics of the pedestrian image to be retrieved.

Compared with the prior art, the application has the following beneficial effects:

1. according to the application, the corresponding relation between the text and the image is considered from the class level, so that the problem of large difference in pedestrians with the same ID caused by factors such as illumination, visual angle change and the like can be solved, and the rich and strong multi-modal knowledge of the CLIP is effectively utilized.

2. The application further improves the performance of text pedestrian retrieval under the condition of weak supervision, and reduces the difference of the performance between the weak supervision and the supervised text pedestrian retrieval.

Drawings

FIG. 1 is a flow chart of the present application.

FIG. 2 is a schematic diagram of the system of the present application.

Detailed Description

The following detailed description is exemplary and is intended to provide further explanation of the application. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present application.

FIG. 2 shows a class-level contrast learning weak-supervision text pedestrian retrieval system comprising an image text feature extraction module, a outlier sample mining module, a class-level multi-modal memory module and a hybrid-level cross-modal matching module;

the outlier sample mining module comprises an image outlier sample mining module and a text outlier sample mining module;

the basic principle of the separation sample mining module is that the category of the separation sample is found according to the corresponding relation between the image and the text, namely, one image possibly corresponds to a plurality of text descriptions, and one text description corresponds to one image;

the image outlier sample mining module is used for mining valuable outliers in the images; a text departure sample mining module to mine valuable departure samples in the text;

class-level visual memory module: all the calculated image class center features and text class center features are utilized to initialize class-level visual memory modules and class-level text memory modules respectively, wherein the class-level visual memory modules storeClass of individual imagesThe central feature, class level text memory module stores +.>Class center features of the individual text;

the cross-modal matching module of the mixed level is used for calculating cross-modal comparison matching loss of class levelCross-modal projection loss at the instance level>And hybrid level cross-modal matching loss->And updating the CLIP model parameters in a gradient updating mode, and storing the parameters of the image encoder and the text encoder after training is finished, wherein the parameters are used for calculating cosine similarity between image features and text features during retrieval.

The input image is preprocessed and readjusted toPixels, and image data augmentation methods of random horizontal flipping, random cropping, and random erasure are used.

The text description entered during training uses the lower case bytes of 49152 to word and encode the code. The segmented text description inserts [ SOS ] and [ EOS ] embedded vectors at the beginning and end, respectively, to represent the beginning and end of the text description statement. The maximum text description sequence length is 77. To learn word relative positional relationships between sentences, position embeddings are also added to the word vector input sequence.

The learning rate is set asThe iteration number is 60, and the wall-up strategy is adopted in 15 rounds of training, so that the learning rate is from +.>Linear increase to->。/>、/>And->Default initialization is 0.02.

FIG. 1 shows a class level contrast learning weak supervision text pedestrian retrieval method, comprising the following steps:

s1, extracting Image features and text features by using an Image encoder and a text encoder of a CLIP model (Contrastive Language-Image Pre-Training, which are called as CLIPs in short);

specifically, an image encoder and a text encoder which adopt a CLIP respectively serve as a feature encoder and a text feature encoder in the image text feature extraction module; initializing the image encoder and the text encoder using a pre-trained model of CLIP;

s2, clustering the image features and the text features by using a clustering algorithm; clustering the image features and the text features by using a clustering algorithm to obtain clustering labelsAnd->，/>Is the firstiCluster ID of the sheet image; />Is the firstiCluster ID of the individual text; for clustering outliers, the tags are all +.>。

the method comprises the following specific steps of:

S33, calculating an image outlier sampleTo the collection->Is the distance of all the image samples in the image,from near to far, for the collection +.>Sequencing all samples in the database;

The text departure samples are mined, and the concrete steps are as follows:

s3-3, calculating a text deviation sampleTo the collection->Distance of all text samples in (1) from near to far, for the collection +.>Sequencing all samples in the database;

the method comprises the following steps of calculating the class center characteristics of the image and the class center characteristics of the text:

；

initializing a class-level visual memory module and a class-level text memory module respectively by using all the calculated image class center features and text class center features; wherein the class-level visual memory module storesClass center feature of each image, class level text memory module stores +.>Class center feature of the individual text.

overall class-level cross-modal contrast matching lossThe calculation steps are as follows,

；

wherein the method comprises the steps ofFor the features of a certain image sample in the small lot, +.>Representation and sample->Text class center feature with the same cluster ID, < ->Is a temperature coefficient that the image can learn; />Denoted as the firstjPersonal textClass center features of the class;

；

wherein the method comprises the steps ofFor the characteristics of a certain text sample in the small lot, < >>Representation and sample->Image class center feature with same cluster ID, < ->Is a temperature coefficient that text can learn; />Is the firstjClass center features of the individual image classes;

overall class-level cross-modal contrast matching loss。

given a single byA small batch of individual pedestrian image features and text description features +.>This small lot can be denoted +.>Wherein->Representation->And->Image features, text features, which belong to the same pedestrian,/->Representation->And->Mismatch; />And->The probability of a match can be defined as:

；

wherein the method comprises the steps ofIs a learnable parameter;

；

calculating the KL (Kullback-Leibler) divergence of the probability of image to text and the true matching probability to obtain a small-batch matching loss function of the image to text, wherein the matching loss function specifically comprises the following steps:

；

wherein the method comprises the steps ofIs a learnable parameter;

；

the KL (Kullback-Leibler) divergence of the text-to-image probability and the true matching probability is calculated, and a small-batch matching loss function of the text-to-image can be obtained, specifically:

；

cross-modal projection loss at the general instance level；

Hybrid level cross-modal matching loss。

in step S6, the visual memory module of class level is updated in the following way:

；

the text memory module at the class level is updated in the following way:

；

Table 1 shows the performance test data on the CUHK-PEDES dataset

Method	Rank-1	Rank-5	Rank-10	mAP	mINP
						Supervised text pedestrian retrieval
TIPCB(2022)	64.26	83.19	89.10	-	-
						CAIBC(2022)	64.43	82.87	88.37	-	-
AXM-Net(2022)	64.44	80.52	86.77	58.73	-
						LGUR(2022)	65.25	83.12	89.00	-	-
IVT(2022)	65.59	83.11	89.21	-	-
						CFine(2022)	69.57	85.93	91.15	-	-
IRRA(2023)	73.38	89.93	93.71	66.13	50.24
						Weak supervision text pedestrian retrieval
CMMT(2021)	57.10	78.14	85.23	-	-
						CAIBC(2022)	58.64	79.02	85.93	-	-
Baseline (CLIP-ViT-B/16)}	58.45	78.87	85.3	54.14	39.83
						CMMT(CLIP-ViT-B/16)(2021)	59.57	79.53	86.53	54.66	39.78
The application is that	68.76	86.11	91.26	62.2	46.71

Table 2 shows the performance test data on the ICFG-PEDES data set

Method	Rank-1	Rank-5	Rank-10	mAP	mINP
						Supervised text pedestrian retrieval
Dual Path(2020)	38.99	59.44	68.41	-	-
						ViTAA(2020)	50.98	68.79	75.78	-	-
SSAN(2021)	54.23	72.63	79.53	-	-
						IVT(2022)	56.04	73.6	80.22	-	-
ISANet(2022)	57.73	75.42	81.72	-	-
						CFine(2022)	60.83	76.55	82.42	-	-
IRRA(2023)	63.46	80.25	85.82	38.06	7.93
						Weak supervision text pedestrian retrieval
Baseline (CLIP-ViT-B/16)	53.83	73.33	80.46	30.6	5.35
						CMMT(CLIP-ViT-B/16)	54.27	71.17	77.86	33.17	5.73
The application is that	58.41	76.29	82.54	37.21	8.62

Table 3 shows the performance test data on the RSTPReid dataset

Method	Rank-1	Rank-5	Rank-10	mAP	mINP
						Supervised text pedestrian retrieval
DSSL(2021)	39.05	62.6	73.95	-	-
						SSAN(2021)	43.5	67.8	77.15	-	-
LBUL(2022)	45.55	68.2	77.85	-	-
						IVT(2022)	46.7	70	78.8	-	-
CFine(2022)	50.55	72.5	81.6	-	-
						IRRA(2023)	60.2	81.3	88.2	47.17	25.28
Weak supervision text pedestrian retrieval
						CMMT(CLIP-ViT-B/16)	52.25	76.45	84.55	41.98	22
Baseline (CLIP-ViT-B/16)	53.1	75.7	83.6	38.65	21.06
						The application is that	57.25	76.95	86.1	44.96	24.22

The method achieves 68.76%, 58.41% and 57.25% Rank-1 precision on international benchmark data sets CUHK-PEDES, ICFG-PEDES and RSTPReid respectively, and has exceeded the performance of the existing weak supervised learning method and even exceeded some supervised learning methods.

The above description is only of the preferred embodiments of the present application and is not intended to limit the present application, but various modifications and variations can be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. The weak supervision text pedestrian retrieval method for class level comparison learning is characterized by comprising the following steps of:

cross-modal contrast match loss at class levelThe calculation steps are as follows,

given a small batch of N pedestrian image features and text description features, cross-modal contrast matching loss of images to text class centersThe calculation method comprises the following steps:

wherein f _i ^v C, for the characteristics of a certain image sample in the small batch ^t+ Represents sum f _i ^v Text class center feature with same cluster ID, τ ^v Is a temperature coefficient that the image can learn;class center feature expressed as the j-th text class;

wherein f _i ^t C, for the characteristics of a certain text sample in the small batch ^v+ Represents sum f _i ^t Image class center feature with same cluster ID, τ ^t Is a temperature coefficient that text can learn;class center features for the j-th image class;

cross-modal contrast match loss at class level

Computing cross-modal projection loss at the instance level:

loss of cross-modal projection loss image feature projection to text feature space at instance levelAnd loss of text feature projection into image feature space +.>

given a single image of N pedestriansFeature and text description feature composition small batches, for each image feature f _i ^v This small batch can be expressed asWherein->Represents f _i ^v And f _j ^t Image features, text features, which belong to the same pedestrian,/->Represents f _i ^v And f _j ^t Mismatch; f (f) _i ^v And f _j ^t The probability of a match is defined as:

where τ is a learnable parameter;

in a small batch, the number of positive samples matched may be more than one, and the image feature f _i ^v The number of positive samples matched may be more than one, so the true probability is normalized, and the true matching probability is expressed as:

e is a super parameter, and the value of E approaches 0;

given a small batch of N pedestrian image features and text description features, for each text feature f _i ^t This small batch can be expressed asWherein->Represents f _i ^t And f _j ^v Text features and image features belonging to the same pedestrian; />Represents f _i ^t And f _j ^v Mismatch; f (f) _i ^t And f _j ^v The probability of a match is defined as:

where τ is a learnable parameter;

in a small batch, and text feature f _i ^t The number of positive samples matched may be more than one, so the text feature f _i ^t The true probability of other image features is normalized, and the true matching probability is expressed as:

e is a super parameter, and the value of E approaches 0;

instance-level cross-modal projection loss

Hybrid level cross-modal matching loss

S6, updating the CLIP model parameters in a gradient updating mode, and storing the parameters of the image encoder and the text encoder after training is finished;

2. The weak supervision text pedestrian retrieval method based on class level contrast learning as set forth in claim 1, wherein in step S2, the image features and the text features are clustered by a clustering algorithm to obtain a cluster labelAnd-> A cluster ID for the ith image; />Cluster ID for the i-th text; for cluster outliers, the labels are all-1.

3. The weak supervision text pedestrian retrieval method based on class level contrast learning as defined in claim 1, wherein in step S3, the image outlier is mined, specifically comprising the steps of:

s31, assuming that the outlier sample of the ith image is represented asFinding all and +.>The paired text descriptions are filtered out from the text dissimilarity samples to obtain a text description set P ^t ＝{t ₁ ,…,t _k -representing k text descriptions and +_>Pairing and the text samples have cluster labels;

s32, traversing P according to the corresponding relation between the image and the text ^t Finding the image sample paired with it to obtain a clustered image set P ^v ＝{v ₁ ,…,v _k }；

S33, calculating an image outlier sampleTo set P ^v Distance of all image samples in the set P ^v Sequencing all samples in the database;

s34, traversing the set P in sequence ^v All samples v in (1) _i If the sample is not a outlier, the image is outlierChange the clustering label of (1) from-1 to sample v _i And ending the traversal; if not, continuing traversing; if the set P is traversed ^v After all samples of->Still as a separate sample, say +.>And (3) continuing to excavate the sample without deserving, continuing to excavate the next image rejection sample until all the image rejection samples are tried, and ending the excavation of the image rejection samples.

4. The method for pedestrian retrieval of weakly supervised text for class level contrast learning as set forth in claim 1, wherein in step S3, text outliers are mined by:

s3-1 assume that the ith text disagreement sample is represented asFinding out +.>Paired image v _i The method comprises the steps of carrying out a first treatment on the surface of the If image v _i Also a disjunct sample, then end the pair +.>Digging of sample, ++>The state of the dissimilarity sample is still kept, and the next text dissimilarity sample is traversed; if image v _i Is a clustered sample, then proceed to the next step;

s3-2, according to the condition that one image may have corresponding relation with a plurality of textsFind all and image v _i Paired text descriptions, a text description set P is obtained ^t ＝{t ₁ ,…,t _q ' q text descriptions and images v _i Pairing;

s3-3, calculating a text deviation sampleTo set P ^t Distance of all text samples in (a) and for set P ^t Sequencing all samples in the database;

s3-4. Traversing the set P in sequence ^t All samples t in (1) _i If sample t _i If the text is not a dissimilarity sample, the text is a dissimilarity sampleChange the clustering label of (1) from-1 to sample t _i And ending the traversal; if not, continuing traversing; if the set P is traversed ^t After all samples of->Still as a variant sample, the sample is described +.>And (3) continuing to mine the next text departure sample without deserving the mining of the sample until all the text departure samples are tried, and ending the mining of the text departure samples.

5. The weak supervision text pedestrian retrieval method based on class level contrast learning as defined in claim 1, wherein in step S4, class center features of the image and class center features of the text are calculated as follows:

wherein the method comprises the steps ofClass center feature representing the ith image class, Y _i ^v A set of features representing all samples of the ith image class, |·| is used to represent the number of features of the image samples in the set;

wherein the method comprises the steps ofClass center feature representing the ith text class, Y _i ^t Showing a set of features for all samples of the ith text category, |·| is used to represent the number of features for the text samples in the set;

initializing a class-level visual memory module and a class-level text memory module respectively by using all the calculated image class center features and text class center features; wherein the class-level visual memory module stores N ^v Class center feature of each image, class level text memory module stores N ^t Class center feature of the individual text.

6. The weak supervision text pedestrian retrieval method based on class level contrast learning as defined in claim 1, wherein the visual memory module of class level is updated in step S6 in the following manner:

wherein m is ^v The updating proportion used for controlling the vision memory module is also a super parameter;class center feature for the ith image class, < +.>Sample features for the ith image class in the small lot;

the text memory module at the class level is updated in the following way:

wherein m is ^t The updating proportion of the text memory module is also used for controlling the updating proportion of the text memory module and is also a super parameter;class center feature for the ith text class, < +.>Is a sample feature of the ith text category in the small lot.

7. The weak supervision text pedestrian retrieval method of class level contrast learning of claim 1 wherein in step S6,

8. The weak supervision text pedestrian retrieval system for class level comparison learning is characterized by comprising an image text feature extraction module, a departure sample mining module, a class-level multi-mode memory module and a mixed-level cross-mode matching module;

class-level visual memory module: all the calculated image class center features and text class center features are utilized to initialize class-level visual memory modules and class-level text memory modules respectively, wherein the class-level visual memory modules store N ^v Class center feature of each image, class level text memory module stores N ^t Class center features of the individual text;

the cross-modal matching module of the mixed level is used for calculating cross-modal comparison matching loss of class levelCross-modal projection loss at the instance level>And hybrid level cross-modal matching loss->

overall class-level cross-modal contrast matching loss

Computing cross-modal projection loss at the instance level:

given a small batch of N pedestrian image features and text description features, for each image feature f _i ^v This small batch can be expressed asWherein->Represents f _i ^v And f _j ^t Image features, text features, which belong to the same pedestrian,/->Represents f _i ^v And f _j ^t Mismatch; f (f) _i ^v And f _j ^t The probability of a match is defined as:

where τ is a learnable parameter;

e is a super parameter, and the value of E approaches 0;

where τ is a learnable parameter;

e is a super parameter, and the value of E approaches 0;

cross-modal projection loss at the general instance level

Hybrid level cross-modal matching loss

And updating the CLIP model parameters in a gradient updating mode, and storing the image encoder and text encoder parameters after training is finished, wherein the parameters are used for calculating cosine similarity between the image characteristics and the text characteristics of the pedestrian image to be retrieved.