CN117152851B - Face and human body collaborative clustering method based on large model pre-training - Google Patents

Face and human body collaborative clustering method based on large model pre-training Download PDF

Info

Publication number
CN117152851B
CN117152851B CN202311303822.2A CN202311303822A CN117152851B CN 117152851 B CN117152851 B CN 117152851B CN 202311303822 A CN202311303822 A CN 202311303822A CN 117152851 B CN117152851 B CN 117152851B
Authority
CN
China
Prior art keywords
training
model
face
tag
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311303822.2A
Other languages
Chinese (zh)
Other versions
CN117152851A (en
Inventor
温峻峰
李鑫
罗海涛
林群雄
孙全忠
洪小龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongke Tianwang Guangdong Technology Co ltd
Original Assignee
Zhongke Tianwang Guangdong Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongke Tianwang Guangdong Technology Co ltd filed Critical Zhongke Tianwang Guangdong Technology Co ltd
Priority to CN202311303822.2A priority Critical patent/CN117152851B/en
Publication of CN117152851A publication Critical patent/CN117152851A/en
Application granted granted Critical
Publication of CN117152851B publication Critical patent/CN117152851B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/70Multimodal biometrics, e.g. combining information from different biometric modalities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/103Static body considered as a whole, e.g. static pedestrian or occupant recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/23Recognition of whole body movements, e.g. for sport training
    • G06V40/25Recognition of walking or running movements, e.g. gait recognition
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The invention provides a face and human body collaborative clustering method based on large model pre-training, which comprises the following steps: acquiring original video data of pedestrians in various scenes to construct a data set, and preprocessing the data set; designing a pre-training model to pre-train the preprocessed data set; the method comprises the steps of using two feature extraction networks to extract features of a face and a pedestrian picture respectively, and finally outputting two feature vectors to perform fusion calculation loss functions, wherein the face feature, the pedestrian feature and the fusion feature are used for joint optimization calculation of various loss functions; semi-supervised training is carried out on the model, and model parameters are further obtained and updated; and inputting the human face and the human body picture into the model, calling the parameters of the model obtained in the previous step, and comparing the parameters with the network model to obtain the pedestrian retrieval result. The network framework of the invention realizes the two-channel input of human faces and human bodies, fully combines the advantages of the human face recognition technology and the pedestrian re-recognition technology, and makes up the defect of a single technology.

Description

Face and human body collaborative clustering method based on large model pre-training
Technical Field
The invention relates to the technical field of computer vision, in particular to a face and human body collaborative clustering method based on large model pre-training.
Background
Today, with increasingly perfect urban security, surveillance videos have been widely used in aspects of human society. The identification and location of specific pedestrians is of great significance to some security problems, especially for criminal investigation and search and rescue tasks. The most widely applied and mature technology for searching specific pedestrians is face recognition, but due to the limitation of cameras, such as resolution, shooting angle and the like, high-quality face images are difficult to obtain, and face recognition is difficult to carry out. The human face recognition method and the human body recognition system can find out by comparing the advantages and disadvantages of the human face recognition method and the human body recognition method, and simultaneously, the human face recognition method and the human body recognition method are used, so that the information of the human face recognition method and the human body recognition method are complementary, the human face and the human body collaborative clustering is realized, and the problem of pedestrian retrieval in most scenes can be effectively solved.
For face recognition and pedestrian re-recognition technologies, particularly pedestrian re-recognition, most of the existing methods use a pre-training model to accelerate the training process and improve the training effect, and because of the high manual labeling cost, large-scale pedestrian data sets are lacking in the past, most of the methods only use a pre-training model on an manually labeled image net data set, so that only limited improvement is caused, and because of the huge field difference between a general image in the image net and an image with human centers required by a human body recognition task. In view of this problem, research methods based in part on unsupervised pre-training are now emerging, but they are still limited by limitations on the size of the data set and the implementation is not significant.
Disclosure of Invention
The invention aims to provide a face and human body collaborative clustering method based on large model pre-training, which combines the advantages of two technologies of face recognition and pedestrian re-recognition by designing a relevant pre-training model and a retrieval strategy based on a deep learning network frame, thereby improving the precision of pedestrian retrieval under a complex scene.
In order to achieve the above purpose, the invention provides a face and human body collaborative clustering method based on big model pre-training, which comprises the following steps:
step 1, acquiring original video data of pedestrians in various scenes to construct a data set, and preprocessing the data set;
step 2, designing a pre-training model to pre-train the preprocessed data set;
step 3, extracting features of the face and the pedestrian pictures respectively by using two ResNet50 backbone networks, and finally outputting two feature vectors to perform fusion calculation loss functions, wherein the face features, the pedestrian features and the fusion features are used for joint optimization calculation of various loss functions;
step 4, performing semi-supervised training on the model, and further acquiring and updating model parameters;
and 5, inputting the human face and the human body picture into the model, calling the parameters of the model obtained in the step 4, and comparing the parameters with the network model to obtain a pedestrian retrieval result.
Further, the collecting the original video data of pedestrians in various scenes to construct a data set comprises:
using the spatial and temporal correlation in the original video as weak supervision, and tracking a person's track in time by any multi-objective tracking algorithm, generating a unique pseudo Re-ID tag from the track, creating a new ultra-large scale dataset with noisy tags;
the preprocessing of the data set comprises:
step 1.1, firstly deleting personnel images with the occurrence frame number less than 200 times;
step 2.2, in the trajectory of each identity, sampling of 1 sample/20 frames is performed for reducing the number of duplicate images.
Further, the designing the pre-training model to pre-train the preprocessed data set includes:
step 2.1, initializing parameters: representing all data samples in a data set constructed in a pre-training stage asWhere n is the size of the dataset and x i Is a person image, y i E { 1..k } is its associated identity tag, where K represents the number of all identities of the dataset record;
an input character image x is set i Two randomly selected enhancements (T, T') are performed, yielding two enhanced imagesOne of them is +.>Feed-in encoder E q Obtaining a query feature q i Another->Is fed into another encoder E k Obtaining a key feature k i Let E be k For E q Momentum version, E k The weight in (a) is E q Exponential moving average of medium weights, E k By the weights from E q Momentum update is carried out to refresh;
step 2.2, supervision classification: is provided withFor image x i Adding a classifier to the correction tag from encoder E q Is converted into a probability p of having K categories i I.e. p i ∈R K And applying a cross entropy loss of classification +.>
Step 2.3, performing label correction through a prototype: prototype { c 1 ,c 2 ,…,c k Maintained as a feature vector dictionary, c k ∈R d ,R d Is a prototype representing the centroid of the class feature, in each training step, first by computing the query feature qi with each prototype c at present k The similarity between the two results in a similarity score
Where τ is a super parameter representing temperature, K represents the total number of prototypes, K represents the kth prototype in the formula, let p be i For the classification probability given by the classifier that updated the weights in the previous step,
then, the correction label of this stepBy giving prototype scores->And classification probability p i The combination of the two components is generated,
here, a soft pseudo tag l is calculated i And converts it into correction labels according to the threshold TWherein the result of argmax is to make the function +.>The j point set taking the maximum value, j represents the number of traversals from 0 to M, if the highest score in li +.>If the number is greater than T, selecting the corresponding category as +.>Otherwise the original label y will remain i
Step 2.4, model-based comparative learning: from new correction tagsCross entropy loss to update supervised classificationAt the same time propose prototype-based contrast loss->The features used to constrain each sample are close to the belonging prototype,
all prototypes were maintained as a dictionary, with stepwise updates of the momentum mechanism, expressed as follows:
wherein,the representation is based on correction tag->Feature vector dictionary, q i Is from E q M represents the momentum of the prototype;
step 2.5, contrast learning under label guidance: self-supervised learning using instance level contrast learning, wherein instance level contrast loss functionsThe following is shown:
wherein q i Representing a query feature; τ is a super parameter representing temperature;is a driven quantity encoder E k The positive sample generated is critical in that it is identical to q i Share the same instance, and ∈>Equal to k i ;/>Then the remaining features stored in the queue representing negative samples; m represents the size of the queue, j represents the number of traversals from 0 to M; at the end of each training step, the queue will be updated by enqueuing new key features and dequeuing oldest features, a tag-guided contrast learning module is used, correction tags are used->To ensure a more rational comparison of the packets, τ representing the hyper-parameter of the temperature;
step 2.6, redesign the queue: record correction labelUse->The new queue receives not only the key feature ki but also the correction tag +.>The model was pre-trained on the dataset using the following loss function:
wherein,representing a cross entropy loss function, ">And lambda (lambda) pro Representing a prototype-based contrast loss function and its weight, +.>And lambda (lambda) lgc Representing the label guided contrast loss function and its weight.
Further, the prototype-based contrast lossThe expression is as follows:
wherein c j Representing the j-th prototype vector.
Further, the step 2.6 further includes: the key feature ki and the correction tagFor distinguishing positive and negative pairs, let->For a new positive feature set, +.>Becomes a new negative feature set, +.>Features of (a) and (b) of the current exampleWith the same revision tag, and->The characteristics of (a) are not provided with
And
Wherein k is i Andis the key feature and correction tag for the current instance i.
Further, the feature extraction network is a ResNet50 backbone network, a global average pooling layer of an original ResNet50 backbone network is deleted, and finally an adaptive global average pooling layer is added, wherein the ResNet50 backbone network adopts deep mutual learning loss and WasSE-Rstein distance measurement to restrict a feature map extracted by a model learning human body image and a feature map common feature extracted from a human face image, and then has a deep mutual learning loss function L dml The following is shown:
L dml =W_distabce(v f ,v b )
wherein W_distance is a WasSE-Rstein distance metric, v f Is the face characteristics of ResNet50 backbone network output, v b Is a human body characteristic of ResNet50 backbone network output.
Further, in order to distinguish different pedestrians, the output characteristics only contain all the common information, and redundant information is removed, and a triple loss function and a cross entropy loss function are adopted to respectively treat the pedestriansFeature and pedestrian identity information, wherein the triplet loss function L tri The following is shown:
L tri =(d a,p -d a,n +α) +
wherein d a,p Distance d between positive sample pair a,n For the distance between negative samples, α is a artificially set threshold, (z) + Represents max (z, 0);
cross entropy loss function L id The following is shown:
L id =E[-log(p(y i |z i ))]
wherein y is i Identity tag for ith input image, z i For the predictive class vector of the ith input image, p (y i |z i ) To calculate z i Belonging to identity tag y i Is used to predict the probability of a final overall loss function L tota1 The following are provided:
L total =λ 1 L dml2 L tri3 L id
wherein lambda is 1 、λ 2 、λ 3 The weight values are represented to balance the effect of the different losses in the training process.
Further, the specific steps of the step 4 include:
step 4.1, marking data according to a proportion of 1:9, respectively forming a training group and a control group, firstly training by using the marking data of the training group, preliminarily obtaining model parameters and updating;
step 4.2, inputting unlabeled data of a control group, and clustering unlabeled target domain image features by using a clustering algorithm to generate pseudo labels;
step 4.3, combining and expanding the pseudo tag data and the tag data into a new data set, training the model in the step 3 by using the data set, and respectively calculating the losses of the tag data and the pseudo tag data;
and 4.4, training to obtain new model parameters, substituting the new parameters into the network, and continuously repeating the steps from the step 4.3 until the total loss value calculated by the training period loss function is unchanged.
Further, the total loss expression of the tagged data and the pseudo-tagged data is:
wherein the method comprises the steps ofIs tagged data, is->The pseudo tag data, n is the number of pseudo tag samples in each sub-batch, n' is the number of pseudo tag samples in each sub-batch, C is the number of categories, and α (t) is the penalty of pseudo tag data.
Further, α (t) is a weight parameter that gradually increases with the increase of the number of iterations, that is, with the increase of the number of iterations of the model training rounds, the weight of the loss function without the labeling data increases, and the model parameter is further obtained.
The beneficial technical effects of the invention are at least as follows:
1. the large-scale pre-training method can construct a large-scale noisy label-free data set, and uses a pre-training framework combined by a plurality of methods to perform effective training, so that a network backbone learns the characteristics of preliminary pedestrians, the large difference between a source data set and a target data set is reduced, and a common pre-trained network model on an ImageNet can be well replaced.
2. The network framework of the invention realizes the two-channel input of the human face and the human body, respectively extracts and fuses the characteristics of the human face and the human body, fully combines the advantages of the human face recognition technology and the pedestrian re-recognition technology, makes up the defect of a single technology, and effectively improves the precision of pedestrian retrieval and the robustness of coping with complex scenes.
3. The training stage of the invention uses a clustering method to generate the pseudo tag, and can perform quite-scale semi-supervised training with only a small amount of tag data, thereby greatly reducing the manual labeling cost and effectively improving the performance of the network model.
Drawings
The invention will be further described with reference to the accompanying drawings, in which embodiments do not constitute any limitation of the invention, and other drawings can be obtained by one of ordinary skill in the art without inventive effort from the following drawings.
FIG. 1 is a flow chart of a face and human collaborative clustering method based on big model pre-training.
FIG. 2 is a block diagram of a pre-training framework according to an embodiment of the present invention;
FIG. 3 is a flow chart of semi-supervised training in accordance with an embodiment of the present invention;
fig. 4 is a general structure diagram of a face and human body collaborative search method according to an embodiment of the present invention.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.
In one or more embodiments, as shown in fig. 1, a face and human body collaborative clustering method based on big model pre-training is disclosed, the method comprising:
step 1, acquiring original video data of pedestrians in various scenes to construct a data set, and preprocessing the data set;
step 2, designing a pre-training model to pre-train the preprocessed data set;
step 3, using two feature extraction networks to extract the features of the face and the pedestrian picture respectively, and carrying out fusion calculation on the two feature vectors finally output to obtain a loss function, wherein the face feature, the pedestrian feature and the fusion feature are used for joint optimization calculation of various loss functions;
step 4, performing semi-supervised training on the model, and further acquiring and updating model parameters;
and 5, inputting the human face and the human body picture into the model, calling the parameters of the model obtained in the step 4, and comparing the parameters with the network model to obtain a pedestrian retrieval result.
Specifically, the specific steps of the step 1 are as follows:
the spatial and temporal correlation in the original video is used as weak supervision, and is realized by any multi-objective tracking algorithm to track a person's track in time, a unique pseudo Re-ID label is generated from the track, and a new ultra-large scale data set with noise labels is created.
Wherein each person in the video is tracked frame by frame, the data set is preprocessed, and the method comprises the following steps:
step 1.1, firstly deleting personnel images with the occurrence frame number less than 200 times;
step 2.2, in the trajectory of each identity, sampling of 1 sample/20 frames is performed for reducing the number of duplicate images.
For a preprocessed dataset, it can be ensured that at least 10 images per identity are associated. This results in a noisy large-scale dataset, where there are two types of annotation errors in addition to the correctly annotated identity. One noise is the same person being marked as different identities and another noise is the different person being marked as the same identity.
In step 2, a new pre-training framework using noise labels is designed based on the convolutional neural network in deep learning. Wherein fig. 2 is an overall structural design diagram of the model, including a simple supervised learning module, a prototype-based contrast learning module, and a label-guided contrast learning module, the framework combines supervised learning, prototype-based contrast learning, label-guided contrast learning, and noise label correction. The pre-training model is designed to pre-train the preprocessed data set, and the specific steps include:
step 2.1, initializing parameters: representing all data samples in a data set constructed in a pre-training stage asWhere n is the size of the dataset and x i Is a person image, y i E { 1..k } is its associated identity tag, where K represents the number of all identities of the dataset record;
the pre-training framework employs a twin network for comparison to represent learning. An input character image x is set i Two randomly selected enhancements (T, T') are performed, yielding two enhanced imagesOne of them is +.>Feed-in encoder E q Obtaining a query feature q i Another->Is fed into another encoder E k Obtaining a key feature k i Let E be k For E q Momentum version, E k The weight in (a) is E q Exponential moving average of medium weights, E k By the weights from E q Momentum update is carried out to refresh;
step 2.2, supervision classification: is provided withFor image x i Adding a classifier to the correction tag from encoder E q Is converted into a probability p of having K categories i I.e. p i ∈R K And applying a cross entropy loss of classification +.>
ObtainingIt is not a direct matter. A prototype is employed here to accomplish this task, namely moving the centroid of the average feature from the training instance.
Step 2.3, performing label correction through a prototype: prototype is used as a feature vector dictionary { c } 1 ,c 2 ,…,c k Maintenance, c k ∈R d ,R d Is a prototype representing the centroid of the class feature, in each training step, first by computing the query feature qi with each prototype c at present k The similarity between the two results in a similarity score
Where τ is a super-parameter indicating temperature, K is the total number of prototypes, let p be i For the classification probability given by the classifier that updated the weights in the previous step,
then, the correction label of this stepBy giving prototype scores->And classification probability p i The combination of the two components is generated,
here, a soft pseudo tag is calculatedAnd converts it into a correction tag according to the threshold T>If->The highest score in (2) is greater than T, the corresponding category is selected as +.>Otherwise the original label y will remain i
Step 2.4, model-based comparative learning: from new correction tagsCross entropy loss to update supervised classificationAt the same time propose prototype-based contrast loss->The features used to constrain each sample are the same as the belonging prototype,
all prototypes were maintained as a dictionary, with stepwise updates of the momentum mechanism, expressed as follows:
wherein,the representation is based on correction tag->Feature vector dictionary, q i Is from E q M represents the momentum of the prototype;
step 2.5, contrast learning under label guidance: self-supervised learning using instance level contrast learning, wherein instance level contrast loss functionsThe following is shown:
wherein q i Representing a query feature; τ is a super parameter representing temperature;is a driven quantity encoder E k The positive sample generated is critical in that it is identical to q i Share the same instance, and ∈>Equal to k i ;/>Then the remaining features stored in the queue representing negative samples; m represents the size of the queue, j represents the number of traversals from 0 to M; at the end of each training step, the queue will be updated by enqueuing new key features and dequeuing oldest features, a tag-guided contrast learning module is used, correction tags are used->To ensure a more rational comparison of the packets, τ representing the hyper-parameter of the temperature;
in particular, this instance-based contrast learning is far from perfect, as it ignores the relationships between different instances. For example, even though two instances depict the samePeople, it also strengthens the gap between them. Thus, a tag-guided contrast learning module is used herein, utilizing corrected tagsTo ensure more reasonable contrast packets.
Step 2.6, redesign the queue: recording correction labelUse->The new queue receives not only the key feature ki but also the correction tag +.>The model was pre-trained on the dataset using the following loss function:
wherein,representing a prototype-based contrast loss function, +.>Representing the label guided contrast loss function.
In particular, the prototype-based contrast lossThe expression is as follows:
wherein c j Representing the j-th prototype vector.
Specifically, the key feature ki and the modified tagFor distinguishing positive and negative pairs, let->For a new positive feature set, +.>Becomes a new negative feature set, +.>Features in (a) have the same revision tag as the current example i, andthe characteristics of (a) are not provided with
And
Wherein k is i Andis the key feature and correction tag for the current instance i.
During training, λ can be measured pro And lambda (lambda) lgc Are set to 1. The super parameter τ is set to 0.1 and T is set to 0.8. Updating momentum encoder E k And the momentum m of the prototype was set to 0.999.
In step 3, two ResNet50 are used to extract the characteristics of face and pedestrian pictures respectively, and the two characteristic vectors output by the network are fused, so that the face characteristics, the pedestrian characteristics and the fusion characteristics are used for the joint optimization calculation of multiple loss functions. The feature extraction network takes ResNet50 as a backbone, deletes the global average pooling layer of the original ResNet50 backbone network, and finally adds a layer of self-adaptive global average pooling layer.
One learning goal of the ResNet50 backbone network in the model is to make the distribution of the output face features and the human body features as similar as possible, so that the model is constrained to learn the common features of the two feature graphs by adopting deep mutual learning loss and WasSE-Rstein distance measurement, and the deep mutual learning loss function is as follows:
L dml =W_distance(v f ,v b )
wherein W_distance is a WasSE-Rstein distance metric, v f Is the face feature output by ResNet50, v b Is a human body feature output by ResNet 50;
in order to distinguish different pedestrians as far as possible and remove redundant information by enabling the output characteristics to only contain all common information as far as possible, the pedestrian characteristics and the pedestrian identity information are respectively restrained by adopting a triplet loss function and a cross entropy loss function, wherein the triplet loss function is as follows:
L tri =(d a,p -d a,n +α) +
wherein d a,p Distance d between positive sample pair a,n For the distance between negative samples, α is a artificially set threshold, (z) + Represents max (z, 0);
the cross entropy loss function is as follows:
L id =E[-log(p(y i |x i ))]
wherein y is i For the true class of the ith input image, x i For the predictive class vector of the ith input image, p (y i |x i ) To calculate x i Belonging to category y i Is used for the prediction probability of (1). Final total loss functionThe following are provided:
L total =λ 1 L dml2 L tri3 L id
wherein different lambda i The weight values are used to balance the effect of the different losses in the training process.
6. The large model pretraining-based human face and human body collaborative clustering method according to claim 1, wherein the feature extraction network uses a ResNet50 as a backbone, a global average pooling layer of an original Resnet-50 network is deleted, and finally an adaptive global average pooling layer is added, wherein the ResNet network adopts deep mutual learning loss and WasSE-Rstein distance measurement to restrict the feature graph extracted by model learning human body images and the feature graph common features extracted from human face images, and has a deep mutual learning loss function L dml The following is shown:
L dml =W_distance(v f ,v b )
wherein W_distance is a WasSE-Rstein distance metric, v f Is the face feature output by ResNet50, v b Is a human body characteristic of the ResNet50 output.
As shown in fig. 3, in step 4, specific steps include:
step 4.1, marking data according to a proportion of 1:9, respectively forming a training group and a control group, firstly training by using the marking data of the training group, preliminarily obtaining model parameters and updating;
step 4.2, inputting unlabeled data of a control group, and clustering unlabeled target domain image features by using a clustering algorithm to generate pseudo labels;
step 4.3, combining and expanding the pseudo-tag data and the tag data into a new data set, and performing semi-supervised training on the model in the step 3 by using the data set to calculate the losses of the tag data and the pseudo-tag data respectively;
and 4.4, training to obtain new model parameters, substituting the new parameters into the network, and continuously repeating the steps from the step 4.3 until the total loss value calculated by the training period loss function is unchanged.
Specifically, the total loss expression of the tagged data and the pseudo-tagged data is:
wherein the method comprises the steps ofIs tagged data, is->The pseudo tag data, n is the number of pseudo tag samples in each sub-batch, n' is the number of pseudo tag samples in each sub-batch, C is the number of categories, and α (t) is the penalty of pseudo tag data. Training to obtain new model parameters, substituting the new parameters into the network, and continuously repeating the steps from the step 4.3 until the network performance is not obviously improved. Alpha (t) is a weight parameter which gradually increases along with the increase of the iteration times, namely the weight of the loss function without marked data gradually increases along with the iteration of the model training round number. Model parameters are further obtained by this method.
In step 5, as shown in fig. 4, a general structure diagram of the face and human body collaborative search method in this embodiment is shown. And (3) inputting the face and the human body picture of the same pedestrian into the model, calling the model parameters obtained by training in the step (4), and obtaining the pedestrian retrieval result closest to the target in the image through calculation and comparison of the network model.
Specifically, the network firstly performs feature extraction and preservation on all image pairs in the base, when a pedestrian needs to be searched, the human face and human body image pair of the pedestrian is input, the model extracts the feature vector of the pedestrian, the previously calculated and preserved feature base is called, similarity calculation and comparison are performed on the feature to be searched and all the features in the base, and a search result is given according to the similarity.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.
While embodiments of the invention have been shown and described, it will be understood by those skilled in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the invention, the scope of which is defined by the claims and their equivalents.
The various illustrative logical blocks, modules, and circuits described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.
In one or more exemplary embodiments, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software as a computer program product, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a web site, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk (disk) and disc (disk) as used herein include Compact Disc (CD), laser disc, optical disc, digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks (disk) usually reproduce data magnetically, while discs (disk) reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the spirit or scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (9)

1. The human face and human body collaborative clustering method based on the large model pre-training is characterized by comprising the following steps:
step 1, acquiring original video data of pedestrians in various scenes to construct a data set, and preprocessing the data set;
step 2, designing a pre-training model to pre-train the preprocessed data set;
step 3, extracting features of the face and the pedestrian pictures respectively by using two ResNet50 backbone networks, and finally outputting two feature vectors to perform fusion calculation loss functions, wherein the face features, the pedestrian features and the fusion features are used for joint optimization calculation of various loss functions;
step 4, performing semi-supervised training on the model, and further acquiring and updating model parameters;
step 5, inputting the human face and the human body picture into a model, calling the parameters of the model obtained in the step 4, and comparing the parameters with a network model to obtain a pedestrian retrieval result;
the design pre-training model pre-trains the preprocessed data set, and comprises the following steps:
step 2.1, initializing parameters: representing all data samples in a data set constructed in a pre-training stage asWhere n is the size of the dataset and x i Is a person image, y i E { 1..k } is its associated identity tag, where K represents the number of all identities of the dataset record;
an input character image x is set i To the twin network, two randomly selected enhancements (T, T') are performed, resulting in two enhanced imagesOne of them is +.>Feed-in encoder E q Obtaining a query feature q i Another->Is fed into another encoder E k Obtaining a key feature k i Let E be k For E q Momentum version, E k The weight in (a) is E q Exponential moving average of medium weights, E k By the weights from E q Momentum update is carried out to refresh;
step 2.2, supervision classification: is provided withFor image x i Adding a classifier to the correction tag from encoder E q Is converted into a probability p of having K categories i I.e. p i ∈R K And applying a cross entropy loss of classification +.>
Step 2.3, performing label correction through a prototype: prototype { c 1 ,c 2 ,…,c k Maintained as a feature vector dictionary, c k ∈R d ,R d Is a prototype representing the centroid of the class feature, in each training step, first by computing the query feature qi with each prototype c at present k The similarity between the two results in a similarity score
Where τ is a super-parameter representing temperature, K represents the total of the prototypeThe number, k, represents the kth prototype in the formula, let p be i For the classification probability given by the classifier that updated the weights in the previous step,
then, the correction label of this stepBy giving prototype scores->And classification probability p i The combination of the two components is generated,
here, a soft pseudo tag l is calculated i And converts it into correction labels according to the threshold TWherein the result of argmax is to make the function +.>The j point set taking the maximum value, j represents the number of traversals from 0 to M, if the highest score in li +.>If the number is greater than T, selecting the corresponding category as +.>Otherwise the original label y will remain i
Step 2.4, model-based comparative learning: from new correction tagsCross entropy loss of update supervision class->At the same time propose prototype-based contrast loss->The features used to constrain each sample are the same as the belonging prototype,
all prototypes were maintained as a dictionary, with stepwise updates of the momentum mechanism, expressed as follows:
wherein,the representation is based on correction tag->Feature vector dictionary, q i Is from E q M represents the momentum of the prototype;
step 2.5, contrast learning under label guidance: self-supervised learning using instance level contrast learning, wherein instance level contrast loss functionsThe following is shown:
wherein q i Representing a query feature; τ is a super parameter representing temperature;is a driven quantity encoder E k The positive sample generated is critical in that it is identical to q i Share the same instance, and ∈>Equal to k i ;/>Then the remaining features stored in the queue representing negative samples; m represents the size of the queue, j represents the number of traversals from 0 to M; at the end of each training step, the queue will be updated by enqueuing new key features and dequeuing oldest features, a tag-guided contrast learning module is used, correction tags are used->To ensure a more rational comparison of the packets, τ representing the hyper-parameter of the temperature;
step 2.6, redesign the queue: record correction labelUse->The new queue receives not only the key feature ki but also the correction tag +.>The model was pre-trained on the dataset using the following loss function:
wherein,representing a cross entropy loss function, ">And lambda (lambda) pro Representing a prototype-based contrast loss function and its weights,and lambda (lambda) lgc Representing the label guided contrast loss function and its weight.
2. The face and human body collaborative clustering method based on large model pre-training according to claim 1, wherein the acquiring the original video data of pedestrians in various scenes to construct a dataset comprises:
using the spatial and temporal correlation in the original video as weak supervision information, and tracking a person's track in time by any multi-objective tracking algorithm, generating a unique pseudo Re-ID tag from the track, creating a new ultra-large scale dataset with noisy tags;
the preprocessing of the data set comprises:
step 1.1, firstly deleting personnel images with the occurrence frame number less than 200 times;
step 2.2, in the trajectory of each identity, sampling of 1 sample/20 frames is performed for reducing the number of duplicate images.
3. The face and human body collaborative clustering method based on big model pre-training according to claim 1, wherein the prototype-based contrast lossThe expression is as follows:
wherein c j Representing the j-th prototype vector.
4. The face and human body collaborative clustering method based on big model pretraining according to claim 1, wherein the step 2.6 further comprises: the key feature ki and the correction tagFor distinguishing positive and negative pairs, let->For a new positive feature set, +.>Becomes a new negative feature set, +.>Features in (a) have the same revision tag as the current example i, andthe characteristics of (a) are not provided with
And
Wherein k is i Andis the key feature and correction tag for the current instance i.
5. The large model pretraining-based human face and human body collaborative clustering method according to claim 1, wherein the ResNet50 backbone network deletes a global average pooling layer of an original ResNet50 backbone network, and finally adds a layer of self-adaptive global average pooling layer, the ResNet50 backbone network adopts deep mutual learning loss and WasSE-Rstein distance measurement to restrict the feature map extracted by model learning human body images and the feature map extracted from human face images to share features, and then has a deep mutual learning loss function L dml The following is shown:
L dml =W_distance(v f ,v b )
wherein W_distance is a WasSE-Rstein distance metric, v f Is the face characteristics of ResNet50 backbone network output, v b Is a human body characteristic of ResNet50 backbone network output.
6. The face and human body collaborative clustering method based on big model pre-training according to claim 5, wherein in order to distinguish different pedestrians, the output features only contain all common information and remove redundant information, a triplet loss function and a cross entropy loss function are adopted to respectively restrict the pedestrian features and the pedestrian identity information, wherein the triplet loss function L tri The following is shown:
L tri =(d a,p -d a,n +α) +
wherein d a,p Distance d between positive sample pair a,n For the distance between negative samples, α is a artificially set threshold, (z) + Represents max (z, 0);
cross entropy loss function L id The following is shown:
L id =E[-log(p(y i |z i ))]
wherein y is i Identity tag for ith input image, z i Is the ithPredictive class vector of input image, p (y i |z i ) To calculate z i Belonging to identity tag y i Is used to predict the probability of a final overall loss function L total The following are provided:
L total =λ 1 L dml2 L tri3 L id
wherein lambda is 1 、λ 2 、λ 3 The weight values are represented to balance the effect of the different losses in the training process.
7. The face and human body collaborative clustering method based on big model pre-training according to claim 1, wherein the specific steps of the step 4 include:
step 4.1, marking data according to a proportion of 1:9, respectively forming a training group and a control group, firstly training by using the marking data of the training group, preliminarily obtaining model parameters and updating;
step 4.2, inputting unlabeled data of a control group, and clustering unlabeled target domain image features by using a clustering algorithm to generate pseudo labels;
step 4.3, combining and expanding the pseudo tag data and the tag data into a new data set, training the model in the step 3 by using the data set, and respectively calculating the losses of the tag data and the pseudo tag data;
and 4.4, training to obtain new model parameters, substituting the new parameters into the network, and continuously repeating the steps from the step 4.3 until the total loss value calculated by the training period loss function is unchanged.
8. The face and human body collaborative clustering method based on big model pretraining according to claim 7, wherein the total loss expression of the tagged data and the pseudo-tagged data is:
wherein the method comprises the steps ofIs tagged data, is->The pseudo tag data, n is the number of pseudo tag samples in each sub-batch, n' is the number of pseudo tag samples in each sub-batch, C is the number of categories, and α (t) is the penalty of pseudo tag data.
9. The face and human body collaborative clustering method based on large model pre-training according to claim 1, wherein α (t) is a weight parameter gradually increasing with the number of iterations, that is, the weight of the loss function without labeling data increases with the number of iterations of model training rounds, so as to further obtain model parameters.
CN202311303822.2A 2023-10-09 2023-10-09 Face and human body collaborative clustering method based on large model pre-training Active CN117152851B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311303822.2A CN117152851B (en) 2023-10-09 2023-10-09 Face and human body collaborative clustering method based on large model pre-training

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311303822.2A CN117152851B (en) 2023-10-09 2023-10-09 Face and human body collaborative clustering method based on large model pre-training

Publications (2)

Publication Number Publication Date
CN117152851A CN117152851A (en) 2023-12-01
CN117152851B true CN117152851B (en) 2024-03-08

Family

ID=88898952

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311303822.2A Active CN117152851B (en) 2023-10-09 2023-10-09 Face and human body collaborative clustering method based on large model pre-training

Country Status (1)

Country Link
CN (1) CN117152851B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344787A (en) * 2018-10-15 2019-02-15 浙江工业大学 A kind of specific objective tracking identified again based on recognition of face and pedestrian
CN111488804A (en) * 2020-03-19 2020-08-04 山西大学 Labor insurance product wearing condition detection and identity identification method based on deep learning
CN114511878A (en) * 2022-01-05 2022-05-17 南京航空航天大学 Visible light infrared pedestrian re-identification method based on multi-modal relational polymerization
CN115205890A (en) * 2022-05-13 2022-10-18 南京博雅集智智能技术有限公司 Method and system for re-identifying pedestrians of non-motor vehicles

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220067506A1 (en) * 2020-08-28 2022-03-03 Salesforce.Com, Inc. Systems and methods for partially supervised learning with momentum prototypes

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344787A (en) * 2018-10-15 2019-02-15 浙江工业大学 A kind of specific objective tracking identified again based on recognition of face and pedestrian
CN111488804A (en) * 2020-03-19 2020-08-04 山西大学 Labor insurance product wearing condition detection and identity identification method based on deep learning
CN114511878A (en) * 2022-01-05 2022-05-17 南京航空航天大学 Visible light infrared pedestrian re-identification method based on multi-modal relational polymerization
CN115205890A (en) * 2022-05-13 2022-10-18 南京博雅集智智能技术有限公司 Method and system for re-identifying pedestrians of non-motor vehicles

Also Published As

Publication number Publication date
CN117152851A (en) 2023-12-01

Similar Documents

Publication Publication Date Title
CN110717411A (en) Pedestrian re-identification method based on deep layer feature fusion
CN110796057A (en) Pedestrian re-identification method and device and computer equipment
CN111259786A (en) Pedestrian re-identification method based on synchronous enhancement of appearance and motion information of video
CN112446342B (en) Key frame recognition model training method, recognition method and device
WO2021243947A1 (en) Object re-identification method and apparatus, and terminal and storage medium
CN112861976B (en) Sensitive image identification method based on twin graph convolution hash network
CN113642482B (en) Video character relation analysis method based on video space-time context
CN111460824A (en) Unmarked named entity identification method based on anti-migration learning
CN114896434B (en) Hash code generation method and device based on center similarity learning
CN111695531B (en) Cross-domain pedestrian re-identification method based on heterogeneous convolution network
CN114547249A (en) Vehicle retrieval method based on natural language and visual features
CN111291705B (en) Pedestrian re-identification method crossing multiple target domains
CN111462173B (en) Visual tracking method based on twin network discrimination feature learning
CN116682144A (en) Multi-modal pedestrian re-recognition method based on multi-level cross-modal difference reconciliation
CN107220597B (en) Key frame selection method based on local features and bag-of-words model human body action recognition process
CN113297936A (en) Volleyball group behavior identification method based on local graph convolution network
CN111259197B (en) Video description generation method based on pre-coding semantic features
CN110674265B (en) Unstructured information oriented feature discrimination and information recommendation system
CN117152851B (en) Face and human body collaborative clustering method based on large model pre-training
CN115797642A (en) Self-adaptive image semantic segmentation algorithm based on consistency regularization and semi-supervision field
CN115100694A (en) Fingerprint quick retrieval method based on self-supervision neural network
Zhang The literature review of action recognition in traffic context
CN114155403A (en) Image segmentation Hash sorting method based on deep learning
CN113920170A (en) Pedestrian trajectory prediction method and system combining scene context and pedestrian social relationship and storage medium
CN112487927A (en) Indoor scene recognition implementation method and system based on object associated attention

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant