CN117152851B

CN117152851B - Face and human body collaborative clustering method based on large model pre-training

Info

Publication number: CN117152851B
Application number: CN202311303822.2A
Authority: CN
Inventors: 温峻峰; 李鑫; 罗海涛; 林群雄; 孙全忠; 洪小龙
Original assignee: Zhongke Tianwang Guangdong Technology Co ltd
Current assignee: Zhongke Tianwang Guangdong Technology Co ltd
Priority date: 2023-10-09
Filing date: 2023-10-09
Publication date: 2024-03-08
Anticipated expiration: 2043-10-09
Also published as: CN117152851A

Abstract

The invention provides a face and human body collaborative clustering method based on large model pre-training, which comprises the following steps: acquiring original video data of pedestrians in various scenes to construct a data set, and preprocessing the data set; designing a pre-training model to pre-train the preprocessed data set; the method comprises the steps of using two feature extraction networks to extract features of a face and a pedestrian picture respectively, and finally outputting two feature vectors to perform fusion calculation loss functions, wherein the face feature, the pedestrian feature and the fusion feature are used for joint optimization calculation of various loss functions; semi-supervised training is carried out on the model, and model parameters are further obtained and updated; and inputting the human face and the human body picture into the model, calling the parameters of the model obtained in the previous step, and comparing the parameters with the network model to obtain the pedestrian retrieval result. The network framework of the invention realizes the two-channel input of human faces and human bodies, fully combines the advantages of the human face recognition technology and the pedestrian re-recognition technology, and makes up the defect of a single technology.

Description

Face and human body collaborative clustering method based on large model pre-training

Technical Field

The invention relates to the technical field of computer vision, in particular to a face and human body collaborative clustering method based on large model pre-training.

Background

Today, with increasingly perfect urban security, surveillance videos have been widely used in aspects of human society. The identification and location of specific pedestrians is of great significance to some security problems, especially for criminal investigation and search and rescue tasks. The most widely applied and mature technology for searching specific pedestrians is face recognition, but due to the limitation of cameras, such as resolution, shooting angle and the like, high-quality face images are difficult to obtain, and face recognition is difficult to carry out. The human face recognition method and the human body recognition system can find out by comparing the advantages and disadvantages of the human face recognition method and the human body recognition method, and simultaneously, the human face recognition method and the human body recognition method are used, so that the information of the human face recognition method and the human body recognition method are complementary, the human face and the human body collaborative clustering is realized, and the problem of pedestrian retrieval in most scenes can be effectively solved.

For face recognition and pedestrian re-recognition technologies, particularly pedestrian re-recognition, most of the existing methods use a pre-training model to accelerate the training process and improve the training effect, and because of the high manual labeling cost, large-scale pedestrian data sets are lacking in the past, most of the methods only use a pre-training model on an manually labeled image net data set, so that only limited improvement is caused, and because of the huge field difference between a general image in the image net and an image with human centers required by a human body recognition task. In view of this problem, research methods based in part on unsupervised pre-training are now emerging, but they are still limited by limitations on the size of the data set and the implementation is not significant.

Disclosure of Invention

The invention aims to provide a face and human body collaborative clustering method based on large model pre-training, which combines the advantages of two technologies of face recognition and pedestrian re-recognition by designing a relevant pre-training model and a retrieval strategy based on a deep learning network frame, thereby improving the precision of pedestrian retrieval under a complex scene.

In order to achieve the above purpose, the invention provides a face and human body collaborative clustering method based on big model pre-training, which comprises the following steps:

step 1, acquiring original video data of pedestrians in various scenes to construct a data set, and preprocessing the data set;

step 2, designing a pre-training model to pre-train the preprocessed data set;

step 3, extracting features of the face and the pedestrian pictures respectively by using two ResNet50 backbone networks, and finally outputting two feature vectors to perform fusion calculation loss functions, wherein the face features, the pedestrian features and the fusion features are used for joint optimization calculation of various loss functions;

step 4, performing semi-supervised training on the model, and further acquiring and updating model parameters;

and 5, inputting the human face and the human body picture into the model, calling the parameters of the model obtained in the step 4, and comparing the parameters with the network model to obtain a pedestrian retrieval result.

Further, the collecting the original video data of pedestrians in various scenes to construct a data set comprises:

using the spatial and temporal correlation in the original video as weak supervision, and tracking a person's track in time by any multi-objective tracking algorithm, generating a unique pseudo Re-ID tag from the track, creating a new ultra-large scale dataset with noisy tags;

the preprocessing of the data set comprises:

step 1.1, firstly deleting personnel images with the occurrence frame number less than 200 times;

step 2.2, in the trajectory of each identity, sampling of 1 sample/20 frames is performed for reducing the number of duplicate images.

Further, the designing the pre-training model to pre-train the preprocessed data set includes:

step 2.1, initializing parameters: representing all data samples in a data set constructed in a pre-training stage asWhere n is the size of the dataset and x _i Is a person image, y _i E { 1..k } is its associated identity tag, where K represents the number of all identities of the dataset record;

an input character image x is set _i Two randomly selected enhancements (T, T') are performed, yielding two enhanced imagesOne of them is +.>Feed-in encoder E _q Obtaining a query feature q _i Another->Is fed into another encoder E _k Obtaining a key feature k _i Let E be _k For E _q Momentum version, E _k The weight in (a) is E _q Exponential moving average of medium weights, E _k By the weights from E _q Momentum update is carried out to refresh;

step 2.2, supervision classification: is provided withFor image x _i Adding a classifier to the correction tag from encoder E _q Is converted into a probability p of having K categories _i I.e. p _i ∈R ^K And applying a cross entropy loss of classification +.>

Step 2.3, performing label correction through a prototype: prototype { c ₁ ，c ₂ ，…，c _k Maintained as a feature vector dictionary, c _k ∈R ^d ，R ^d Is a prototype representing the centroid of the class feature, in each training step, first by computing the query feature qi with each prototype c at present _k The similarity between the two results in a similarity score

Where τ is a super parameter representing temperature, K represents the total number of prototypes, K represents the kth prototype in the formula, let p be _i For the classification probability given by the classifier that updated the weights in the previous step,

then, the correction label of this stepBy giving prototype scores->And classification probability p _i The combination of the two components is generated,

here, a soft pseudo tag l is calculated _i And converts it into correction labels according to the threshold TWherein the result of argmax is to make the function +.>The j point set taking the maximum value, j represents the number of traversals from 0 to M, if the highest score in li +.>If the number is greater than T, selecting the corresponding category as +.>Otherwise the original label y will remain _i ；

Step 2.4, model-based comparative learning: from new correction tagsCross entropy loss to update supervised classificationAt the same time propose prototype-based contrast loss->The features used to constrain each sample are close to the belonging prototype,

all prototypes were maintained as a dictionary, with stepwise updates of the momentum mechanism, expressed as follows:

wherein,the representation is based on correction tag->Feature vector dictionary, q _i Is from E _q M represents the momentum of the prototype;

step 2.5, contrast learning under label guidance: self-supervised learning using instance level contrast learning, wherein instance level contrast loss functionsThe following is shown:

wherein q _i Representing a query feature; τ is a super parameter representing temperature;is a driven quantity encoder E _k The positive sample generated is critical in that it is identical to q _i Share the same instance, and ∈>Equal to k _i ；/>Then the remaining features stored in the queue representing negative samples; m represents the size of the queue, j represents the number of traversals from 0 to M; at the end of each training step, the queue will be updated by enqueuing new key features and dequeuing oldest features, a tag-guided contrast learning module is used, correction tags are used->To ensure a more rational comparison of the packets, τ representing the hyper-parameter of the temperature;

step 2.6, redesign the queue: record correction labelUse->The new queue receives not only the key feature ki but also the correction tag +.>The model was pre-trained on the dataset using the following loss function:

wherein,representing a cross entropy loss function, ">And lambda (lambda) _pro Representing a prototype-based contrast loss function and its weight, +.>And lambda (lambda) _lgc Representing the label guided contrast loss function and its weight.

Further, the prototype-based contrast lossThe expression is as follows:

wherein c _j Representing the j-th prototype vector.

Further, the step 2.6 further includes: the key feature ki and the correction tagFor distinguishing positive and negative pairs, let->For a new positive feature set, +.>Becomes a new negative feature set, +.>Features of (a) and (b) of the current exampleWith the same revision tag, and->The characteristics of (a) are not provided with

And

Wherein k is _i Andis the key feature and correction tag for the current instance i.

Further, the feature extraction network is a ResNet50 backbone network, a global average pooling layer of an original ResNet50 backbone network is deleted, and finally an adaptive global average pooling layer is added, wherein the ResNet50 backbone network adopts deep mutual learning loss and WasSE-Rstein distance measurement to restrict a feature map extracted by a model learning human body image and a feature map common feature extracted from a human face image, and then has a deep mutual learning loss function L _dml The following is shown:

L _dml ＝W_distabce(v _f ,v _b )

wherein W_distance is a WasSE-Rstein distance metric, v _f Is the face characteristics of ResNet50 backbone network output, v _b Is a human body characteristic of ResNet50 backbone network output.

Further, in order to distinguish different pedestrians, the output characteristics only contain all the common information, and redundant information is removed, and a triple loss function and a cross entropy loss function are adopted to respectively treat the pedestriansFeature and pedestrian identity information, wherein the triplet loss function L _tri The following is shown:

L _tri ＝(d _a，p -d _a，n +α) ₊

wherein d _a，p Distance d between positive sample pair _a，n For the distance between negative samples, α is a artificially set threshold, (z) ₊ Represents max (z, 0);

cross entropy loss function L _id The following is shown:

L _id ＝E[-log(p(y _i |z _i ))]

wherein y is _i Identity tag for ith input image, z _i For the predictive class vector of the ith input image, p (y _i |z _i ) To calculate z _i Belonging to identity tag y _i Is used to predict the probability of a final overall loss function L _tota1 The following are provided:

L _total ＝λ ₁ L _dml +λ ₂ L _tri +λ ₃ L _id

wherein lambda is ₁ 、λ ₂ 、λ ₃ The weight values are represented to balance the effect of the different losses in the training process.

Further, the specific steps of the step 4 include:

step 4.1, marking data according to a proportion of 1:9, respectively forming a training group and a control group, firstly training by using the marking data of the training group, preliminarily obtaining model parameters and updating;

step 4.2, inputting unlabeled data of a control group, and clustering unlabeled target domain image features by using a clustering algorithm to generate pseudo labels;

step 4.3, combining and expanding the pseudo tag data and the tag data into a new data set, training the model in the step 3 by using the data set, and respectively calculating the losses of the tag data and the pseudo tag data;

and 4.4, training to obtain new model parameters, substituting the new parameters into the network, and continuously repeating the steps from the step 4.3 until the total loss value calculated by the training period loss function is unchanged.

Further, the total loss expression of the tagged data and the pseudo-tagged data is:

wherein the method comprises the steps ofIs tagged data, is->The pseudo tag data, n is the number of pseudo tag samples in each sub-batch, n' is the number of pseudo tag samples in each sub-batch, C is the number of categories, and α (t) is the penalty of pseudo tag data.

Further, α (t) is a weight parameter that gradually increases with the increase of the number of iterations, that is, with the increase of the number of iterations of the model training rounds, the weight of the loss function without the labeling data increases, and the model parameter is further obtained.

The beneficial technical effects of the invention are at least as follows:

1. the large-scale pre-training method can construct a large-scale noisy label-free data set, and uses a pre-training framework combined by a plurality of methods to perform effective training, so that a network backbone learns the characteristics of preliminary pedestrians, the large difference between a source data set and a target data set is reduced, and a common pre-trained network model on an ImageNet can be well replaced.

2. The network framework of the invention realizes the two-channel input of the human face and the human body, respectively extracts and fuses the characteristics of the human face and the human body, fully combines the advantages of the human face recognition technology and the pedestrian re-recognition technology, makes up the defect of a single technology, and effectively improves the precision of pedestrian retrieval and the robustness of coping with complex scenes.

3. The training stage of the invention uses a clustering method to generate the pseudo tag, and can perform quite-scale semi-supervised training with only a small amount of tag data, thereby greatly reducing the manual labeling cost and effectively improving the performance of the network model.

Drawings

The invention will be further described with reference to the accompanying drawings, in which embodiments do not constitute any limitation of the invention, and other drawings can be obtained by one of ordinary skill in the art without inventive effort from the following drawings.

FIG. 1 is a flow chart of a face and human collaborative clustering method based on big model pre-training.

FIG. 2 is a block diagram of a pre-training framework according to an embodiment of the present invention;

FIG. 3 is a flow chart of semi-supervised training in accordance with an embodiment of the present invention;

fig. 4 is a general structure diagram of a face and human body collaborative search method according to an embodiment of the present invention.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.

In one or more embodiments, as shown in fig. 1, a face and human body collaborative clustering method based on big model pre-training is disclosed, the method comprising:

step 2, designing a pre-training model to pre-train the preprocessed data set;

step 3, using two feature extraction networks to extract the features of the face and the pedestrian picture respectively, and carrying out fusion calculation on the two feature vectors finally output to obtain a loss function, wherein the face feature, the pedestrian feature and the fusion feature are used for joint optimization calculation of various loss functions;

Specifically, the specific steps of the step 1 are as follows:

the spatial and temporal correlation in the original video is used as weak supervision, and is realized by any multi-objective tracking algorithm to track a person's track in time, a unique pseudo Re-ID label is generated from the track, and a new ultra-large scale data set with noise labels is created.

Wherein each person in the video is tracked frame by frame, the data set is preprocessed, and the method comprises the following steps:

For a preprocessed dataset, it can be ensured that at least 10 images per identity are associated. This results in a noisy large-scale dataset, where there are two types of annotation errors in addition to the correctly annotated identity. One noise is the same person being marked as different identities and another noise is the different person being marked as the same identity.

In step 2, a new pre-training framework using noise labels is designed based on the convolutional neural network in deep learning. Wherein fig. 2 is an overall structural design diagram of the model, including a simple supervised learning module, a prototype-based contrast learning module, and a label-guided contrast learning module, the framework combines supervised learning, prototype-based contrast learning, label-guided contrast learning, and noise label correction. The pre-training model is designed to pre-train the preprocessed data set, and the specific steps include:

the pre-training framework employs a twin network for comparison to represent learning. An input character image x is set _i Two randomly selected enhancements (T, T') are performed, yielding two enhanced imagesOne of them is +.>Feed-in encoder E _q Obtaining a query feature q _i Another->Is fed into another encoder E _k Obtaining a key feature k _i Let E be _k For E _q Momentum version, E _k The weight in (a) is E _q Exponential moving average of medium weights, E _k By the weights from E _q Momentum update is carried out to refresh;

ObtainingIt is not a direct matter. A prototype is employed here to accomplish this task, namely moving the centroid of the average feature from the training instance.

Step 2.3, performing label correction through a prototype: prototype is used as a feature vector dictionary { c } ₁ ，c ₂ ，…，c _k Maintenance, c _k ∈R ^d ，R ^d Is a prototype representing the centroid of the class feature, in each training step, first by computing the query feature qi with each prototype c at present _k The similarity between the two results in a similarity score

Where τ is a super-parameter indicating temperature, K is the total number of prototypes, let p be _i For the classification probability given by the classifier that updated the weights in the previous step,

here, a soft pseudo tag is calculatedAnd converts it into a correction tag according to the threshold T>If->The highest score in (2) is greater than T, the corresponding category is selected as +.>Otherwise the original label y will remain _i ；

Step 2.4, model-based comparative learning: from new correction tagsCross entropy loss to update supervised classificationAt the same time propose prototype-based contrast loss->The features used to constrain each sample are the same as the belonging prototype,

in particular, this instance-based contrast learning is far from perfect, as it ignores the relationships between different instances. For example, even though two instances depict the samePeople, it also strengthens the gap between them. Thus, a tag-guided contrast learning module is used herein, utilizing corrected tagsTo ensure more reasonable contrast packets.

Step 2.6, redesign the queue: recording correction labelUse->The new queue receives not only the key feature ki but also the correction tag +.>The model was pre-trained on the dataset using the following loss function:

wherein,representing a prototype-based contrast loss function, +.>Representing the label guided contrast loss function.

In particular, the prototype-based contrast lossThe expression is as follows:

wherein c _j Representing the j-th prototype vector.

Specifically, the key feature ki and the modified tagFor distinguishing positive and negative pairs, let->For a new positive feature set, +.>Becomes a new negative feature set, +.>Features in (a) have the same revision tag as the current example i, andthe characteristics of (a) are not provided with

And

During training, λ can be measured _pro And lambda (lambda) _lgc Are set to 1. The super parameter τ is set to 0.1 and T is set to 0.8. Updating momentum encoder E _k And the momentum m of the prototype was set to 0.999.

In step 3, two ResNet50 are used to extract the characteristics of face and pedestrian pictures respectively, and the two characteristic vectors output by the network are fused, so that the face characteristics, the pedestrian characteristics and the fusion characteristics are used for the joint optimization calculation of multiple loss functions. The feature extraction network takes ResNet50 as a backbone, deletes the global average pooling layer of the original ResNet50 backbone network, and finally adds a layer of self-adaptive global average pooling layer.

One learning goal of the ResNet50 backbone network in the model is to make the distribution of the output face features and the human body features as similar as possible, so that the model is constrained to learn the common features of the two feature graphs by adopting deep mutual learning loss and WasSE-Rstein distance measurement, and the deep mutual learning loss function is as follows:

L _dml ＝W_distance(v _f ，v _b )

wherein W_distance is a WasSE-Rstein distance metric, v _f Is the face feature output by ResNet50, v _b Is a human body feature output by ResNet 50;

in order to distinguish different pedestrians as far as possible and remove redundant information by enabling the output characteristics to only contain all common information as far as possible, the pedestrian characteristics and the pedestrian identity information are respectively restrained by adopting a triplet loss function and a cross entropy loss function, wherein the triplet loss function is as follows:

L _tri ＝(d _a，p -d _a，n +α) ₊

the cross entropy loss function is as follows:

L _id ＝E[-log(p(y _i |x _i ))]

wherein y is _i For the true class of the ith input image, x _i For the predictive class vector of the ith input image, p (y _i |x _i ) To calculate x _i Belonging to category y _i Is used for the prediction probability of (1). Final total loss functionThe following are provided:

L _total ＝λ ₁ L _dml +λ ₂ L _tri +λ ₃ L _id

wherein different lambda _i The weight values are used to balance the effect of the different losses in the training process.

6. The large model pretraining-based human face and human body collaborative clustering method according to claim 1, wherein the feature extraction network uses a ResNet50 as a backbone, a global average pooling layer of an original Resnet-50 network is deleted, and finally an adaptive global average pooling layer is added, wherein the ResNet network adopts deep mutual learning loss and WasSE-Rstein distance measurement to restrict the feature graph extracted by model learning human body images and the feature graph common features extracted from human face images, and has a deep mutual learning loss function L _dml The following is shown:

L _dml ＝W_distance(v _f ，v _b )

wherein W_distance is a WasSE-Rstein distance metric, v _f Is the face feature output by ResNet50, v _b Is a human body characteristic of the ResNet50 output.

As shown in fig. 3, in step 4, specific steps include:

step 4.3, combining and expanding the pseudo-tag data and the tag data into a new data set, and performing semi-supervised training on the model in the step 3 by using the data set to calculate the losses of the tag data and the pseudo-tag data respectively;

Specifically, the total loss expression of the tagged data and the pseudo-tagged data is:

wherein the method comprises the steps ofIs tagged data, is->The pseudo tag data, n is the number of pseudo tag samples in each sub-batch, n' is the number of pseudo tag samples in each sub-batch, C is the number of categories, and α (t) is the penalty of pseudo tag data. Training to obtain new model parameters, substituting the new parameters into the network, and continuously repeating the steps from the step 4.3 until the network performance is not obviously improved. Alpha (t) is a weight parameter which gradually increases along with the increase of the iteration times, namely the weight of the loss function without marked data gradually increases along with the iteration of the model training round number. Model parameters are further obtained by this method.

In step 5, as shown in fig. 4, a general structure diagram of the face and human body collaborative search method in this embodiment is shown. And (3) inputting the face and the human body picture of the same pedestrian into the model, calling the model parameters obtained by training in the step (4), and obtaining the pedestrian retrieval result closest to the target in the image through calculation and comparison of the network model.

Specifically, the network firstly performs feature extraction and preservation on all image pairs in the base, when a pedestrian needs to be searched, the human face and human body image pair of the pedestrian is input, the model extracts the feature vector of the pedestrian, the previously calculated and preserved feature base is called, similarity calculation and comparison are performed on the feature to be searched and all the features in the base, and a search result is given according to the similarity.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

While embodiments of the invention have been shown and described, it will be understood by those skilled in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the invention, the scope of which is defined by the claims and their equivalents.

The various illustrative logical blocks, modules, and circuits described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.

In one or more exemplary embodiments, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software as a computer program product, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a web site, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk (disk) and disc (disk) as used herein include Compact Disc (CD), laser disc, optical disc, digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks (disk) usually reproduce data magnetically, while discs (disk) reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the spirit or scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. The human face and human body collaborative clustering method based on the large model pre-training is characterized by comprising the following steps:

step 2, designing a pre-training model to pre-train the preprocessed data set;

step 5, inputting the human face and the human body picture into a model, calling the parameters of the model obtained in the step 4, and comparing the parameters with a network model to obtain a pedestrian retrieval result;

the design pre-training model pre-trains the preprocessed data set, and comprises the following steps:

an input character image x is set _i To the twin network, two randomly selected enhancements (T, T') are performed, resulting in two enhanced imagesOne of them is +.>Feed-in encoder E _q Obtaining a query feature q _i Another->Is fed into another encoder E _k Obtaining a key feature k _i Let E be _k For E _q Momentum version, E _k The weight in (a) is E _q Exponential moving average of medium weights, E _k By the weights from E _q Momentum update is carried out to refresh;

Where τ is a super-parameter representing temperature, K represents the total of the prototypeThe number, k, represents the kth prototype in the formula, let p be _i For the classification probability given by the classifier that updated the weights in the previous step,

Step 2.4, model-based comparative learning: from new correction tagsCross entropy loss of update supervision class->At the same time propose prototype-based contrast loss->The features used to constrain each sample are the same as the belonging prototype,

wherein,representing a cross entropy loss function, ">And lambda (lambda) _pro Representing a prototype-based contrast loss function and its weights,and lambda (lambda) _lgc Representing the label guided contrast loss function and its weight.

2. The face and human body collaborative clustering method based on large model pre-training according to claim 1, wherein the acquiring the original video data of pedestrians in various scenes to construct a dataset comprises:

using the spatial and temporal correlation in the original video as weak supervision information, and tracking a person's track in time by any multi-objective tracking algorithm, generating a unique pseudo Re-ID tag from the track, creating a new ultra-large scale dataset with noisy tags;

the preprocessing of the data set comprises:

3. The face and human body collaborative clustering method based on big model pre-training according to claim 1, wherein the prototype-based contrast lossThe expression is as follows:

wherein c _j Representing the j-th prototype vector.

4. The face and human body collaborative clustering method based on big model pretraining according to claim 1, wherein the step 2.6 further comprises: the key feature ki and the correction tagFor distinguishing positive and negative pairs, let->For a new positive feature set, +.>Becomes a new negative feature set, +.>Features in (a) have the same revision tag as the current example i, andthe characteristics of (a) are not provided with

And

5. The large model pretraining-based human face and human body collaborative clustering method according to claim 1, wherein the ResNet50 backbone network deletes a global average pooling layer of an original ResNet50 backbone network, and finally adds a layer of self-adaptive global average pooling layer, the ResNet50 backbone network adopts deep mutual learning loss and WasSE-Rstein distance measurement to restrict the feature map extracted by model learning human body images and the feature map extracted from human face images to share features, and then has a deep mutual learning loss function L _dml The following is shown:

L _dml ＝W_distance(v _f ,v _b )

6. The face and human body collaborative clustering method based on big model pre-training according to claim 5, wherein in order to distinguish different pedestrians, the output features only contain all common information and remove redundant information, a triplet loss function and a cross entropy loss function are adopted to respectively restrict the pedestrian features and the pedestrian identity information, wherein the triplet loss function L _tri The following is shown:

L _tri ＝(d _a,p -d _a,n +α) ₊

wherein d _a,p Distance d between positive sample pair _a,n For the distance between negative samples, α is a artificially set threshold, (z) ₊ Represents max (z, 0);

cross entropy loss function L _id The following is shown:

L _id ＝E[-log(p(y _i |z _i ))]

wherein y is _i Identity tag for ith input image, z _i Is the ithPredictive class vector of input image, p (y _i |z _i ) To calculate z _i Belonging to identity tag y _i Is used to predict the probability of a final overall loss function L _total The following are provided:

L _total ＝λ ₁ L _dml +λ ₂ L _tri +λ ₃ L _id

7. The face and human body collaborative clustering method based on big model pre-training according to claim 1, wherein the specific steps of the step 4 include:

8. The face and human body collaborative clustering method based on big model pretraining according to claim 7, wherein the total loss expression of the tagged data and the pseudo-tagged data is:

9. The face and human body collaborative clustering method based on large model pre-training according to claim 1, wherein α (t) is a weight parameter gradually increasing with the number of iterations, that is, the weight of the loss function without labeling data increases with the number of iterations of model training rounds, so as to further obtain model parameters.