CN115601791B

CN115601791B - Unsupervised pedestrian re-identification method based on multi-former and outlier sample re-distribution

Info

Publication number: CN115601791B
Application number: CN202211404730.9A
Authority: CN
Inventors: 蒋敏; 张千; 孔军; 陶雪峰
Original assignee: Jiangnan University
Current assignee: Jiangnan University
Priority date: 2022-11-10
Filing date: 2022-11-10
Publication date: 2023-05-02
Anticipated expiration: 2042-11-10
Also published as: CN115601791A

Abstract

The invention relates to an unsupervised pedestrian re-identification method based on multiplex and outlier sample re-distribution. The multi-branch network identification model multi-former is built based on a trans-former network and comprises a single-camera-domain intra-former network and a multi-camera-domain inter-former network, and all the single-camera-domain intra-former networks share backbone network parameters, so that generalization capability is enhanced, inter-domain differences caused by the background, illumination and the like of different camera domains are relieved to a certain extent, robustness of the model to noise pseudo labels is improved, and further, accuracy of unsupervised pedestrian re-identification is improved. The number of pseudo tags can be expanded by utilizing the self-adaptive outlier sample redistribution, so that the characteristic representation capability of a multi-branch network identification model Multiformer is enhanced. When the model is trained, the combined learning consisting of the example level comparison learning and the cluster level comparison learning is utilized, so that the clustering accuracy can be greatly improved, the problem of noise pseudo labels is relieved, and the accuracy and the robustness of unsupervised pedestrian re-identification are effectively improved.

Description

Unsupervised pedestrian re-identification method based on multi-former and outlier sample re-distribution

Technical Field

The invention relates to an unsupervised pedestrian re-identification method, in particular to an unsupervised pedestrian re-identification method based on multiplex and outlier sample re-distribution.

Background

With extensive research in computer vision both theoretically and practically, pedestrian re-recognition has also gradually become an important branch thereof, with the purpose of identifying a target pedestrian in a non-overlapping camera. Pedestrian re-identification has a wide range of real-world applications such as criminal searches, multi-camera tracking, missing person searches, and the like.

At present, the research of the traditional pedestrian re-identification depends on a large number of manually marked images, the method is low in efficiency and high in cost, the problem is thoroughly solved by the unsupervised pedestrian re-identification, the technology does not need to additionally mark the identity of the pedestrian, and compared with the traditional pedestrian re-identification, the unsupervised pedestrian re-identification has wider application space.

Because of diversity of objective environments and subjective complexity of pedestrian actions, at present, unsupervised pedestrian re-recognition still has many problems to be solved, wherein the problems to be solved mainly include: 1) The lack of a true identity tag to supervise feature representation learning, if there is no true identity tag, the model must first determine a false identity tag associated with the training data; at present, similar images are mainly distributed with the same labels through clustering or KNN searching and the like, so that pseudo labels are generated for training, but if estimated identities are incorrect, model learning is hindered; 2) Because of factors such as shielding, different visual angles, background interference and the like of pedestrian images, the estimated false labels are noisy, the main task of the pedestrian re-recognition model is to learn different pedestrian characteristic representations from different pedestrian images, how to minimize the influence of the noisy false labels, and maximizing the discriminant of the model is also a great challenge of unsupervised pedestrian re-recognition; 3) The pedestrian re-recognition essence is a multi-camera retrieval task, and the problem that how to fully learn the characteristics of pedestrians which are unchanged across cameras is also solved because of the differences of the background, the visual angle, the light and the like among different cameras.

In addition, the traditional unsupervised pedestrian re-identification task mainly adopts CNN as a backbone network to extract features, the CNN can only process one local neighborhood at a time, the receptive field is limited, global information can not be well captured, and the convolution and downsampling operations of the CNN can cause larger detail information and space information loss, so that the unsupervised pedestrian re-identification requirement can not be effectively met.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, and provides an unsupervised pedestrian re-identification method based on multiplex and outlier sample re-distribution, which effectively improves the accuracy and the robustness of unsupervised pedestrian re-identification.

According to the technical scheme provided by the invention, the unsupervised pedestrian re-identification method based on multiplex and outlier sample re-distribution comprises the following steps:

constructing a multi-branch network recognition model multi-former based on a trans-former network to perform required unsupervised pedestrian re-recognition on pedestrian images acquired by m cameras using the constructed multi-branch network recognition model multi-former, wherein,

identifying a model multi-former for the constructed multi-branch network, wherein the model multi-former comprises a single-camera domain intra-former network constructed based on a trans-former network for each camera and a multi-camera domain inter-former network constructed based on the trans-former network for all cameras;

When a multi-branch network identification model multi-former is constructed, the single-camera-domain intra-former networks and multi-camera-domain inter-former networks of all cameras adopt the same backbone network, and the single-camera-domain intra-former networks of all cameras share backbone network parameters during training;

and when the pedestrians are re-identified, carrying out feature extraction on an identification image containing pedestrians to be identified by utilizing a multi-camera domain inter-feature network so as to find and determine pedestrian images matched with the extracted pedestrian features from the pedestrian images acquired by m cameras according to the extracted pedestrian features.

When constructing the multi-branch network identification model multi-former, the construction steps comprise:

constructing a multi-branch network identification basic model based on a trans-former network, wherein the multi-branch network identification basic model comprises a multi-camera domain basic network based on the trans-former network and m single-camera domain basic networks based on the trans-former network, and a classifier is configured in the multi-camera domain basic network and all the single-camera domain basic networks, and the configured classifier is adaptively connected with the multi-camera domain basic network or a corresponding backbone network in the single-camera domain basic network;

When a multi-branch network identification basic model is built, pre-training a backbone network for building a multi-camera domain basic network based on an ImageNet data set to obtain multi-camera domain backbone network pre-training parameters of the multi-camera domain basic network;

when training the constructed single-camera-domain basic network, the obtained multi-camera-domain backbone network pre-training parameters are loaded to backbone networks of all single-camera-domain basic networks, so that the single-camera-domain basic networks of all cameras share the network backbone parameters;

performing required training on the constructed multi-branch network identification basic model so as to form a corresponding single-phase domain intra-former network based on the trained single-phase domain basic network and form a multi-phase domain inter-former network based on the trained multi-phase domain basic network when the target training state is reached;

and forming a multi-branch network identification model multi-former by using the multi-camera domain interframe network and m single-camera domain Intraformer networks.

When training the constructed multi-branch network identification basic model, the training process comprises the following steps:

step 1, performing feature extraction on a training data set by utilizing a multi-branch network identification basic model to obtain multi-camera domain picture features F _mc Single-camera-domain picture feature F of the ith camera _{c_i} ，i＝1,…,m；

Step 2, regarding the obtained multi-camera domain picture characteristic F _mc Single-camera-domain picture feature F of the ith camera _{c_i} Clustering, wherein successfully clustered pictures form cluster points Inliers, and distributing cluster point pseudo labels to the pictures in the cluster points Inliers, and unsuccessfully clustered pictures form Outliers Outlers;

step 3, generating a cluster point pseudo tag clustering center based on the cluster point pseudo tags, performing self-adaptive outlier sample reassignment on Outliers by using the generated cluster point pseudo tag clustering center, so that after the self-adaptive outlier samples are reassigned, corresponding cluster point pseudo tags are assigned to Outliers in the Outliers, and a pseudo tag training set is formed by using all the cluster point pseudo tags;

step 4, carrying out joint contrast learning on the multi-branch network identification basic model to carry out model network parameter optimization based on the joint contrast learning on the multi-branch network identification basic model, wherein,

based on the training data set and the single-camera-domain picture characteristic F of the ith camera for the ith single-camera-domain base network _{c_i} The clustering point pseudo-label clustering center performs joint contrast learning;

For the multi-camera domain basic network, based on the training data set and the multi-camera domain picture characteristic F _mc The clustering point pseudo-label clustering center performs joint contrast learning;

the joint contrast learning comprises cluster-level contrast learning and instance-level contrast learning;

step 5, the multi-branch network identification basic model based on the joint contrast learning optimization is subjected to the collaborative training of the single-camera domain basic network and the multi-camera domain basic network, wherein,

using multi-camera domain picture feature F _mc Training the multi-camera domain basic network by the pseudo tag training set;

for the ith single-camera-domain base network, utilizing the single-camera-domain picture feature F of the ith camera _{c_i} Training the pseudo tag training set;

and step 6, repeating the training processes of the steps 1 to 5 until reaching the target training state.

For step 1, extracting multi-camera domain picture feature F _mc When in use, any training picture in a training data set is subjected to the skip processing, and an image block obtained by each skip processing is connected with a parameter Cls token, and the position information of each image block and the camera information code of the training picture are embedded to configure and form multi-camera domain feature extraction information of the training picture;

Processing the multi-camera domain feature extraction information of the training picture by utilizing the multi-camera domain basic network to extract and obtain multi-camera domain picture features F _mc ；

Extracting single-camera-domain picture feature F of ith camera _{c_i} The training pictures acquired by the ith camera are subjected to the Spilt processing, and an image block obtained by each Spilt processing is connected with a parameter Cls token and is embedded with the position information of each image block to form training picture single-camera domain feature extraction information;

processing the single-camera-domain feature extraction information of the training picture by utilizing a single-camera-domain basic network corresponding to the ith camera to extract and obtain single-camera-domain picture features F _{c_i} 。

In step 2, the obtained multi-camera domain picture feature F _mc All single-camera-domain picture features F _c When clustering is carried out, the clustering method comprises a DBSCAN clustering method.

In the step 3, for the pseudo tag clustering center of the clustering point, the following steps are:

wherein Y is the category number of pseudo labels of the cluster points, phi _i For the cluster center feature of the i-th class, f _j For the features of the ith and jth pictures, num _i The number of pictures contained for the class i;

the generated cluster point pseudo tag cluster center is stored in a cluster center feature repository Center Memory Bank;

An affinity matrix between the outlier samples within the outlier Outliers and the cluster point pseudo tag cluster center is calculated, wherein,

the affinity matrix between the Outliers and the cluster point pseudo tag cluster center is:

AFM (i, j) is the i-th cluster center feature phi in the affinity matrix AFM _i Mutual similarity relation value with j-th outlier sample, O _j Features that are the j-th outlier sample; phi _{i_r} Representing the ith cluster center feature Φ _i R-th value of (2), O _{j_r} Representing the j-th outlier sample feature O _j N represents a feature dimension;

and when the self-adaptive outlier samples are redistributed based on the calculated affinity matrix AFM, the outlier samples are distributed to the cluster point pseudo-label cluster center with the strongest mutual similarity relationship.

A mutual similarity relation threshold v is configured for the mutual similarity relation between the outlier sample and the pseudo tag clustering center of the clustering point, wherein,

Num _O v is the number of outlier samples within the outlier Outliers _start Is the initial value of the mutual similarity relation threshold v, gamma is the threshold attenuation rate, epoch is the training round, e _peak Training rounds for representing when the threshold v reaches a peak, wherein II (·) is an indication function, and when the training rounds are smaller than e _peak When 1, i.e. II (·) =II { epoch)<e _peak }；

When the outlier samples are distributed based on the configured mutual similarity relation threshold v, the j-th outlier sample with the mutual similarity relation value AFM (i, j) larger than the mutual similarity relation threshold v is distributed to the cluster point pseudo label cluster center with the strongest mutual similarity relation.

In step 4, during the combined contrast learning, the clustering contrast loss l is obtained after the clustering level contrast learning _c The method comprises the steps of carrying out a first treatment on the surface of the After example level comparison learning, obtaining example comparison loss l _t, wherein ,

for cluster contrast loss l _c The following steps are:

wherein ,Φ₊ Positive for sample picture qSample Γ is a set parameter, f (q) is a query example feature of sample picture q;

comparative example loss l _t The following steps are:

wherein P is the number of different pedestrians selected from a given sample, K is the number of sample pictures selected for each pedestrian in the given sample, a is one picture of the K sample pictures,

for an anchor picture with identity i +.>

For a positive sample with identity i +.>

For a negative sample with identity j +.>

For an anchor picture with identity i +.>

The minimum gap between the similarity of the extracted image features, positive pair of beta samples and the similarity of the negative pair of samples.

When the model network parameters based on the joint contrast learning are optimized, determining the model network parameters theta so as to minimize the loss function of NH training samples under the determined model network parameters theta, wherein,

In the optimizing process, the multi-camera domain basic network and all the single-camera domain basic networks are simultaneously optimized, and the following steps are:

f(x _a ) To pair(s)Anchor point image x _a Extracted image features.

For co-training identity loss, there are:

/>

wherein ,l_id In order to co-train the identity loss,

is x _i Nz is the number of training samples in the training dataset, +.>

To train sample x _i The multi-branch network identification basic model outputs a true identity label as +.>

Is a probability of (2).

The invention has the advantages that: the multi-branch network identification model multi-former is constructed based on a trans-former network, and comprises a single-camera-domain intra-former network and a multi-camera-domain inter-former network, and all the single-camera-domain intra-former networks share backbone network parameters, so that generalization capability is enhanced, inter-domain differences caused by the background, illumination and the like of different camera domains are relieved to a certain extent, robustness of the model to noise pseudo labels is improved, and accuracy of unsupervised pedestrian re-identification is further improved.

The number of pseudo tags can be expanded by utilizing the self-adaptive outlier sample redistribution, so that the characteristic representation capability of a multi-branch network identification model Multiformer is enhanced. During model training, the combined learning of instance level comparison learning and cluster level comparison learning is utilized, the clustering accuracy is greatly improved through the combined learning, the problem of noise pseudo labels is relieved, and therefore the accuracy and the robustness of unsupervised pedestrian re-identification are effectively improved.

Drawings

FIG. 1 is a flow chart of an embodiment of the present invention.

Fig. 2 is a flowchart of an embodiment of constructing a multi-branch network identification model according to the present invention.

Fig. 3 is a schematic diagram of an embodiment of a multi-branch network identification model of the present invention.

Fig. 4 is a schematic diagram of an embodiment of the single camera domain intra-former network and multi-camera domain inter-former network according to the present invention.

Fig. 5 is a diagram of the visual effect of the multi-branch network according to the present invention.

FIG. 6 is a diagram of one embodiment of the present invention for counting the distribution of Outliers after clustering.

FIG. 7 is a schematic diagram of adaptive outlier sample reassignment according to the present invention.

FIG. 8 is a schematic diagram of a joint comparison of the present invention.

Fig. 9 is a schematic view showing the effect of visualization in the comparative example of the present invention.

Detailed Description

The invention will be further described with reference to the following specific drawings and examples.

In order to effectively improve accuracy and robustness of unsupervised pedestrian re-recognition, in an unsupervised pedestrian re-recognition method based on multi-former and outlier sample re-distribution, in one embodiment of the present invention, the unsupervised pedestrian re-recognition method includes:

Fig. 1 shows an implementation flow chart of unsupervised pedestrian re-recognition, when unsupervised pedestrian re-recognition is implemented, a multi-branch network recognition model multi-former based on a trans-former network needs to be built, that is, the multi-branch network recognition model multi-former is built based on the trans-former network, wherein a scene range of pedestrian image acquisition is determined by m cameras, that is, a pedestrian re-recognition area range of the multi-branch network recognition model multi-former is formed, at this time, the built multi-branch network recognition model multi-former can perform unsupervised pedestrian re-recognition on pedestrian images acquired by m cameras, the cameras are devices capable of acquiring pedestrian images, such as commonly used cameras, cameras and the like, and the specific types of the cameras and the number of the cameras can be selected according to needs so as to meet the requirement of unsupervised pedestrian re-recognition. In addition, m cameras are generally installed in different areas, that is, pedestrian images in scenes of m different areas can be acquired by using the m cameras.

In order to improve accuracy and robustness of unsupervised pedestrian re-recognition, in an embodiment of the present invention, the multi-branch network recognition model multi-former needs to include a single-camera domain intra-former network constructed based on a trans-former network for each camera and a multi-camera domain inter-former network constructed based on a trans-former network for all cameras, wherein the single-camera domain specifically refers to a camera collecting pedestrian images in a collection range, and the multi-camera domain specifically refers to m cameras collecting pedestrian images in a collection range. Because the single-camera-domain intra-camera network and the multi-camera-domain inter-camera network are constructed and obtained based on the trans-camera network, global information and picture details can be better obtained by utilizing the characteristics of the trans-camera network, and the utilization rate of the global effective information is enhanced.

Because the unsupervised pedestrian re-identification task has a plurality of different camera domains, and the pedestrian pictures shot by different cameras can generate larger inter-domain differences due to the interference of external factors such as angles, backgrounds and illumination, in one embodiment of the invention, the single-camera domain intra-former networks of all cameras adopt the same backbone network, and the single-camera domain intra-former networks of all cameras share the backbone parameters of the network, thereby enhancing the generalization capability of a multi-branch network identification model multi-former, relieving the inter-domain differences caused by the backgrounds, illumination and the like of different camera domains to a certain extent, improving the robustness of noise pseudo labels and further improving the accuracy of unsupervised pedestrian re-identification.

FIG. 5 is a T-SNE diagram, plotted on a public dataset-based Market-1501. Fig. (a) shows a feature distribution diagram of a multi-branch network identification model multi-former process not proposed by the present invention, and fig. (b) shows a feature distribution diagram extracted by using the multi-branch network identification model multi-former. Wherein the same color dots represent the same camera and the mark-1501 dataset extracts photos from the 6 cameras, so there are 6 colors in the figure. In the image (a), due to the influence of the domain difference between different cameras, the image characteristics of the same camera have higher similarity, which means that the network attention is not on the pedestrian but is disturbed by noise. In the graph (b), the image features of each camera are uniformly distributed, and it can be seen that after the multi-branch network recognition model multiframe is introduced, the domain difference among different cameras is obviously relieved.

In one embodiment of the present invention, when constructing the multi-branch network identification model multiframe, the constructing step includes:

As can be seen from the above description, since the multi-branch network identification model multi-former includes a single-camera-domain intra-former network and a multi-camera-domain inter-former network, the multi-branch network identification basic model is constructed to include at least m single-camera-domain basic networks for forming the single-camera-domain intra-former network and multi-camera-domain basic networks for forming the multi-camera-domain inter-former network, that is, the m single-camera-domain basic networks and the m single-camera-domain inter-former networks which are finally formed are in one-to-one correspondence, and the multi-camera-domain basic networks and the multi-camera-domain inter-former networks are in correspondence.

In one embodiment of the invention, the single camera domain base network and the multi-camera domain base network both use the same backbone network, e.g., both use the Encoder in the fransformer network as the backbone network. In addition, a classifier is configured in the multi-camera domain basic network and all the single-camera domain basic networks, and at this time, a multi-branch classifier is formed.

Fig. 3 shows that after reaching the target training state, a corresponding architecture diagram of a multi-camera domain inter-domain network and a single-camera domain intra-domain network is formed, and since only the corresponding network parameters are optimally adjusted during training, the constructed architecture situation of the single-camera domain base network and the multi-camera domain base network can refer to the diagram of fig. 3.

In fig. 3 and fig. 4, for the multi-camera domain inter-camera network, the skip is to cut the input picture, so as to obtain a plurality of image blocks after cutting. Liner Projection of Flattened Patches is linear projection and dimension transformation, and Embedding is data, feature Extraction is feature extraction, wherein E is obtained sequentially during feature extraction _mc Block and Token. In FIG. 4, the Branch-1 to Branch-m are m single-camera-domain Intraformer networks.

Affinity Matrix is Affinity Matrix, pseudolabel is Pseudo Label, AORA is adaptive outlier sample reassignment strategy, joint Contrastive Learning (JCL) is joint contrast learning, MLP is multi-layer perceptron (Multilayer Perceptron), classifier is Classifier. Joint Contrast Learning (JCL) includes Instance Contrastive learing (instance-level contrast learning) and Cluster Contrastive learing (cluster-level contrast learning).

In fig. 4, an implementation of a multi-camera domain inter-and single-camera domain intra-former corresponding backbone network is shown, where the backbone network in fig. 4 includes the above-mentioned Spilt, linear projection, dimension transformation processing, and the like, and the specific situation of forming the multi-camera domain inter-and single-camera domain intra-former corresponding backbone network based on a trans-former is consistent with the prior art.

In one embodiment of the invention, the Classifier is adaptively connected with the backbone network through the MLP, and at the moment, the information added by the Classifier can be determined. In specific implementation, all Classifier classes adopt the same Classifier, and normal initialization can be adopted for all Classifier classes. After the constructed multi-branch network identification basic model is trained to reach a target state, corresponding Classifier classes can be obtained respectively.

For the single-camera-domain intra-former network, since the same backbone network is adopted as the multi-camera-domain inter-former network, in fig. 3, the specific case of the m single-camera-domain intra-former networks may refer to the corresponding description of the multi-camera-domain inter-former network, which is not repeated herein.

In order to realize sharing of network backbone network parameters, in one embodiment of the invention, a backbone network in a multi-camera domain foundation network is pre-trained based on an ImageNet data set to obtain multi-camera domain backbone network pre-training parameters of the multi-camera domain foundation network;

and when the built single-camera-domain basic network is constructed, the obtained multi-camera-domain backbone network pre-training parameters of the multi-camera-domain backbone network are loaded to the backbone networks of all the single-camera-domain basic networks, so that the single-camera-domain basic networks of all the cameras share the network backbone parameters.

In specific implementation, the ImageNet data set is an existing common public data set, and the mode and the process for pre-training the backbone network in the multi-camera domain base network by using the ImageNet data set are consistent with the existing method and the process. And after the multi-camera domain network pre-training parameters are obtained through pre-training in the multi-camera domain base network, adding a Classifier class in the multi-camera domain base network. And after the pretraining parameters of the multi-camera domain network are loaded in all single-camera domain base networks, adding a Classifier class in each single-camera domain base network.

The Classifier can be in the existing common classification form, and the mode of adding the Classifier and the specific form of the Classifier can be selected according to actual needs, so that the classification requirement of using the Classifier can be met. After all Classifier classifiers are added, the construction of the multi-branch network identification basic model is realized, and then the multi-branch network identification basic model is required to be trained.

As can be seen from the above description, the network parameters are shared by the backbone network of the single-phase camera domain base network constructed based on the transducer network for each camera, but the Classifier corresponding parameters are not shared. In the implementation, after the backbone network of the single-phase machine domain basic network shares network parameters, the corresponding backbone network parameters are basically consistent.

In one embodiment of the present invention, the required training is performed on the multi-branch network identification basic model constructed as described above, specifically, after the single-camera domain basic network sharing network backbone parameters of all cameras are configured, the obtained multi-branch network identification basic model is trained until the target training state is reached.

for the multi-camera domain base network, based on the aboveTraining dataset, multi-camera Domain Picture feature F _mc The clustering point pseudo-label clustering center performs joint contrast learning;

during cooperative training, network parameters based on cooperative training are optimized for the multi-branch network identification basic model by using the calculated cooperative training identity loss;

Fig. 2 shows an embodiment of a training process for a multi-branch network identification basic model, namely, during training, steps of feature extraction, clustering to generate partial pseudo labels and outliers, self-adaptive outlier sample distribution, joint contrast learning, multi-branch network collaborative training and the like are generally required, the training termination condition is whether the model is converged, after judging that the model is converged after training, the training is terminated, and a target training state is reached at the moment, otherwise, the training is repeated. In specific implementation, the conditions for judging that the model is in a convergence state are as follows: during training, the precision of the model is not increased any more, and the model loss is not reduced any more.

The training procedure is described in detail below.

Specifically, during training, a training data set needs to be provided or configured, wherein the training data set is an image shot and collected by m cameras, and the size of the training data set can be selected according to actual needs so as to meet required training requirements.

In one embodiment of the present invention, for step 1, the multi-camera domain picture feature F is extracted _mc When in use, any training picture in a training data set is subjected to the skip processing, a parameter Cls token is connected to an image block obtained by each skip processing, and the position information of each image block and the camera information of the training picture are encoded to form multi-camera domain feature extraction information of the training picture;

Extracting single-camera-domain picture feature F of ith camera _{c_i} Performing the skip processing on the training pictures acquired by the ith camera, connecting a parameter Cls token for each image block obtained by the skip processing, and forming single-camera domain feature extraction information of the training pictures for the position information of each image block;

In particular, the training data input is a set X of all camera pictures (i.e. pictures collected by m cameras) _mc ∈R ^B×C×H×W Wherein, h×w is the resolution of the input picture, C is the channel number (for RGB pictures, the channel number C is 3), B is the size of each batch, and each batch size B can be selectively determined according to the actual application scenario, etc. Dividing (space) the input picture and flattening the space dimension to obtain an image block Patch sequence

Wherein N is the number of Patches obtained by segmentation, P _h ×P _w And the cut image block Patch size.

Image block Patch sequence X _p Obtaining an image block Patch code E after linear projection and dimensional transformation _mc ∈R ^B ^×N×D Wherein D is the generated feature dimension.Encoding E for a block Patch _mc Connecting a parameter Cls token for representing global feature, and embedding position code and camera information code to obtain E _{mc_cls} ∈R ^B×N′×D . After training, the parameters Cls token will contain a feature representation of the input picture for classification. The Cls token size is R ^B×1×D Image block Patch code E _mc After the parameter Cls token is connected, the dimension of the Patch number N is increased by 1.

The parameter Cls token is a learnable parameter, and the size is R ^B×1×D The method comprises the steps of carrying out a first treatment on the surface of the The position code is the position information of each divided image block Patch in the original picture, the camera information code is formed by the camera number information of the picture, and the size of the position code and the camera information code is R ^1×N′×D The initial values are all 0.

Will E _{mc_cls} Sending the multi-camera domain feature extraction information into a Block network of a transform network, and processing the multi-camera domain feature extraction information of the training picture by using the Block network to extract and obtain a multi-camera domain picture feature F _mc Extracting the obtained multi-camera domain picture characteristic F _mc Namely Token generated by the intersomer network in fig. 3 and 4.

Classifying the input training pictures according to the camera labels, and respectively sending the input training pictures into the corresponding single-camera-domain basic networks, for example, for the ith camera and the single-camera-domain basic network corresponding to the ith camera, sending the training pictures sent into the single-camera-domain basic network corresponding to the ith camera into the pictures collected by the ith camera. Specifically, the input data for each single camera domain base network is a collection X of single camera pictures _{c_i} ∈R ^B×C×H×W Where c_i represents the ith camera, after the same segmentation and dimensional transformation as the multi-camera domain base network, an image block Patch code E is obtained _{c_i} ∈R ^B×N×D . Encoding E for a block Patch _{c_i} A parameter Cls token is concatenated into which the position code of the image block patch is also embedded. Subsequently, the image block Patch is encoded E _{c_i} Transmitting the image into a Block network of a transducer network to extract and obtain single-camera-domain image characteristics F _{c_i} Extracting to obtain a singleCamera domain picture feature F _{c_i} Namely Token generated by the Intraformer network in fig. 3 and 4.

The Block network E-based is shown in FIG. 4 _{mc_cls} In one embodiment, the method and the process of obtaining Token by processing the location information and the camera information refer to the process of fig. 4, and the method and the process of obtaining Token by the Block network are specifically similar to the method of processing the Block network in the existing transducer network, which will not be described in detail herein.

In the multi-camera domain base network for forming the multi-camera domain inter network and the single-camera domain base network for forming the single-camera domain intra network, parameters of the Block network are determined by the steps, so that the picture features can be directly extracted by using the Block network, and are particularly related to the Block network, which is well known to those skilled in the art and will not be described herein.

In one embodiment of the present invention, in step 2, the obtained multi-camera domain picture feature F _mc All single-camera-domain picture features F _c When clustering is carried out, the clustering method comprises a DBSCAN clustering method.

In the clustering process, the image features extracted partially are interfered by noise such as the gesture and the background of pedestrians, so that the image features are far away from a clustering center point and cannot be clustered successfully, the samples are called as outlier samples, and all the outlier samples form Outliers. In one embodiment of the invention, unsupervised pedestrian re-recognition relies on clustering derived pseudo tags for collaborative training, whereas outlier samples cannot be utilized in training due to missing tags.

In specific implementation, the clustering method can adopt a DBSCAN clustering method, the DBSCAN clustering method does not need to specify the clustering number, the clustering type number can be learned autonomously, after clustering, the clustering point pseudo labels are distributed to successfully clustered pictures, and unsuccessfully clustered pictures form Outliers. Of course, in specific implementation, the clustering method may also adopt other commonly used clustering forms, specifically, the clustering method is based on meeting the actual clustering requirement. When the DBSCAN clustering method is adopted for clustering, the specific conditions of clustering forming the clustering points Inliers and the outlier Outliers can be selected and determined according to actual needs, and the actual clustering requirements can be met.

FIG. 6 is a graph of statistics of the number of outlier samples of the pedestrian re-identification dataset Market-1501 after DBSCAN clustering, with the outlier samples occupying more than 60% of the total samples at the initial stage of training, and still occupying more than 10% after multiple iterations of the model. While a transducer network has less bias on the structural generalization of the input data, such as correlation and translational invariance, compared to a convolutional neural network, more data is needed to train the transducer network, especially at the beginning of model training. For better results, full use is required for outlier samples.

In step 3, in one embodiment of the present invention, for the pseudo tag clustering center of the clustering point, there are:

wherein Y is the category number of pseudo labels of the cluster points, phi _i For the cluster center feature of the i-th class, f _j For the features of the ith and jth pictures, num _i The number of pictures contained in the i-th class;

AFM (i, j) is the i-th cluster center feature phi in the affinity matrix AFM _i Mutual similarity relation value with j-th outlier sample, O _j Features that are the j-th outlier sample; phi _{i_r} Representing the ith cluster center feature Φ _i Is the (r) thNumerical value, O _{j_r} Representing the j-th outlier sample feature O _j N represents a feature dimension;

and when the self-adaptive outlier samples are redistributed based on the calculated affinity matrix AFM, distributing an outlier sample to a cluster point pseudo-label cluster center with the strongest mutual similarity relationship.

In specific implementation, after clustering is performed based on the DBSCAN clustering method, the category number Y of the pseudo labels of the clustering points can be obtained according to the formation of the clustering points Inliers, and of course, the characteristic f of the ith and jth pictures can also be obtained _j Number num of pictures included in class i category _i . Thus, after clustering, a cluster point pseudo tag cluster center can be generated, the generated cluster point pseudo tag cluster center is { phi } ₁ ，Φ ₂ ，...Φ _i ，…Φ _Y }。

For feature dimension N, then, it is associated with the constructed multi-branch network identification model multiframe, and for a determined multi-branch network identification model multiframe, feature dimension N remains fixed. At this time, the ith cluster center feature Φ _i And the same characteristic dimension as the j-th outlier sample. For the affinity matrix AFM, the method is consistent with the prior art, namely, based on the affinity matrix AFM, the mutual similarity relation value between the corresponding outlier sample and the clustering center feature can be obtained.

From the above description, it is generally necessary to train the constructed multi-branch network identification basic model a plurality of times when determining whether to converge. In one embodiment of the invention, the clustering center feature repository Center Memory Bank is utilized to store the clustered pseudo tag clustering centers of the clustered points in each training process.

After the cluster center feature repository Center Memory Bank is used for storing the cluster point pseudo tag cluster centers, an affinity matrix between the outlier samples in the outlier Outliers and the cluster point pseudo tag cluster centers can be calculated, so that self-adaptive outlier sample redistribution is performed based on the calculated affinity matrix, the data volume of a training model is expanded, the feature representation capability of the model is enhanced, and better performance is achieved.

Meter for measuring timeCalculating the i-th clustering center characteristic phi in an affinity matrix AFM _i When the mutual similarity relation value AFM (i, j) between the ith outlier sample and the jth outlier sample is obtained, the ith clustering center characteristic phi _i Specific case of feature dimension N of (c) and the i-th cluster center feature Φ _i And (5) correlation. In specific implementation, when self-adaptive outlier sample reassignment is carried out, the j-th outlier sample is assigned to the cluster point pseudo-label cluster center with the strongest mutual similarity relationship. The mutual similarity relationship is strongest, specifically, for j outlier samples, the mutual similarity relationship is the same as the i-th clustering center feature phi _i The corresponding mutual similarity relation value AFM (i, j) is largest.

In one embodiment of the invention, a mutual similarity relation threshold v is configured for the mutual similarity relation between the outlier sample and the pseudo tag clustering center of the clustering point, wherein,

In specific implementation, the multi-branch network identification basic model has poor feature extraction capability in the initial training stage, and the extracted feature accuracy is relatively low, so that a smaller mutual similarity relation threshold v is adopted. Along with the training, the model feature extraction capability is gradually enhanced, and the threshold v of the mutual similarity relationship is adaptively increased; however, because some pictures in the data have a plurality of pedestrians, shielding, blurring and the like, and part of outlier sample points are always in the condition of oscillation or incapability of clustering, which is called as strong noise points, when iteration is performed to a certain round, the mutual similarity relation threshold v is adaptively reduced so as to ignore the interference of the strong noise points on the model, as shown in fig. 7.

Based on the above conditions in different training phases, the above mutual similarity relation threshold v is configured. Initial value v of mutual similarity relation threshold v _start Can be set to 0.6, and the threshold attenuation rate gamma can be set to 0.9, e _peak For an empirical value, it can be set to 10 in general. epoch is the round of model training.

After the mutual similarity relation threshold v is configured, when the mutual similarity relation value AFM (i, j) is larger than the mutual similarity relation threshold v, the j-th outlier sample is distributed to the cluster point pseudo-label cluster center with the strongest mutual similarity relation, otherwise, the j-th outlier sample is not distributed.

In step 4, the clustering contrast loss l is obtained after clustering level contrast learning during joint contrast learning _c The method comprises the steps of carrying out a first treatment on the surface of the After example level comparison learning, obtaining example comparison loss l _t, wherein ,

for cluster contrast loss l _c The following steps are:

wherein ,Φ₊ For a positive sample of the sample picture q, Γ is a set parameter, and f (q) is a query example feature of the sample picture q;

comparative example loss l _t The following steps are:

wherein P is the number of different pedestrians selected from a given sample, and K is the number of different pedestrians selected from the given sampleThe number of sample pictures is selected for each pedestrian, a is one picture in the K sample pictures,

For an anchor picture with identity i +.>

For a positive sample with identity i +.>

For a negative sample with identity j +.>

For an anchor picture with identity i +.>

The minimum gap between the similarity of the extracted image features, positive pair of beta samples and the similarity of the negative pair of samples. />

For an anchor picture with identity i +.>

Extracted image features, < >>

For an anchor picture with identity j +.>

Extracted image features.

In the implementation, in addition to the contrast learning at the conventional example level, in order to improve the clustering effect of the model, the distance between the outlier sample and the clustering center of the pseudo tag of the clustering point is reduced. Compared with example level contrast learning, the method has the advantages that the similarity between the sample and the positive clustering is mainly pulled up, the similarity between the sample and the negative clustering is pushed away, and the calculated amount of the model can be greatly reduced. The clustering is facilitated by contrast samples, and this cluster-oriented contrast learning paradigm helps the model minimize similarity between clusters to separate different clusters. The case of joint contrast learning is shown in fig. 8.

In specific implementation, during training, the single-camera-domain basic network and the multi-camera-domain basic network are subjected to joint contrast learning, wherein in the joint contrast learning, the purpose of cluster-level contrast learning is to minimize the distance between the sample picture q and the positive cluster thereof and maximize the distance between the sample picture q and the negative cluster thereof, so that after the cluster-level contrast learning, the cluster contrast loss l can be obtained _c 。

The samples of each batch of the cluster level contrast learning only need to be subjected to contrast learning with the cluster point pseudo tag clustering center features. In the implementation, cluster level comparison learning is carried out on a single-camera-domain basic network, and the sample picture q is a picture shot and collected by a camera corresponding to the single-camera-domain basic network; and carrying out cluster-level comparison learning on the multi-camera domain basic network, wherein the sample picture q is any picture in the training data set.

From the above description, after clustering, the pictures in the training data set are configured with a pseudo-label of the clustering point after successful clustering, so that after determining the sample picture q, the pictures belong to the positive sample in the same category as the sample picture q, and belong to the negative sample when the pictures belong to different categories as the sample picture q, namely the positive sample phi of the sample picture q ₊ And the query example features f (q) of the sample picture q can be determined by the technical means commonly used in the technical field. Thus, the clustering contrast loss l corresponding to the single-camera-domain basic network and the multi-camera-domain basic network can be obtained respectively _c 。

In practice, the parameter Γ may be in the range of [0,1], for example, the parameter Γ may be 0.5. Further, the minimum gap β between the similarity of the positive sample pair and the similarity of the negative sample pair is an empirical value, for example, 0.3 is preferable.

In specific implementation, the purpose of example level contrast learning is to reduce the similarity between similar samples and expand the similarity between dissimilar samples. For a given set of samples, the given samples are sample pictures selected from the training data set, P different pedestrians are selected from the given samples, and K pictures are selected for each pedestrian. For each image a, a least similar positive sample p and a most similar negative sample n are selected for example level contrast learning. Generally, P may be selected as 8,K and 32 in a given sample. The identity i or the identity j specifically refers to one of P different pedestrians.

Instance level contrast learning is beneficial to learning salient features among different samples by the multi-branch identification network basic model, and feature representation capability of the multi-branch identification network basic model is enhanced. The two are combined for contrast learning, so that the clustering accuracy of the model is greatly improved, and the problem of noise pseudo labels is solved.

In one embodiment of the present invention, query instance features f (q) are used to perform contrast learning with the cluster center feature repository Center Memory Bank, and update the cluster center feature repository Center Memory Bank, where the update method is as follows:

In the calculation of cluster contrast loss l _c Then, the clustering center features are updated by using the query instance features f (q), and the specific calculation formula is as follows:

Φ ₊ ←(1-u)f(q)+uΦ ₊

where u is a parameter that acts to slowly update the features of the cluster center feature repository Center Memory Bank, avoiding losing feature consistency due to severe vibration. The value range of the parameter u is generally [0,1], and the specific value of the parameter u can be selected according to actual needs, so that the actual needs can be met.

In one embodiment of the invention, when the model network parameters based on the joint contrast learning are optimized, the model network parameters theta are determined so as to minimize the loss function of NH training samples under the determined model network parameters theta,

wherein ,f(x_a ) For anchor image x _a Extracted image features.

In specific implementation, NH training samples, which are samples selected from the training dataset, may be selected according to the need. maxd _a,p Namely, is

mind _a,n Namely, is

Namely, performing a binary norm operation.

In one embodiment of the present invention, for co-training identity loss, there is:

wherein ,l_id In order to co-train the identity loss,

is x _i Nz is the number of training samples in the training dataset, +.>

Is a probability of (2).

In particular, the number Nz of training samples in the training data set may be rootedFrom the above description, it can be seen that during training, the clustering point pseudo labels and the reassignment of outlier samples are assigned to training sample x _i The real identity label output by the multi-branch network identification basic model can be obtained as

Probability of (2)

In practice, it can be seen from the above description that during training, a cluster contrast loss l can be obtained _c Example comparative loss l _t Co-training identity loss l _id Then, the total loss after one training can be obtained,

as can be seen from the above description, when judging whether the multi-branch network identification basic model is converged, the main judgment indexes include the model accuracy and model loss, wherein the model accuracy can be generally the average accuracy average mAP, and the model loss is the total loss

After training, the specific calculation of the average precision mean value mAP can be consistent with the prior art, so that whether the multi-branch network identification basic model is converged or not can be effectively judged. When the multi-branch network identification basic model is trained and converged, the target training state is reached, and at the moment, the multi-branch network identification basic model is utilized to form a multi-branch network identification model Multiformer.

After the multi-branch network recognition model multi-former is obtained, when unsupervised pedestrian re-recognition is performed, a query picture R needs to be provided to find pedestrians with similar characteristics to the query picture R in a picture set shot by m cameras. As can be seen from the above description, for the image set captured by the m cameras, all the features in the image set are extracted by using the multi-branch network recognition model multiframe, and the specific manner and process for extracting the image features in the image set can refer to the above description, for example, for any image, after adopting the Spilt segmentation, the linear projection, the dimension transformation, and the like, the image set is extracted by using the multi-camera domain interframe network.

And (3) after the query picture R is processed by adopting the same technical means, extracting corresponding characteristics by utilizing a multi-camera domain interframe network. After obtaining the features of the query picture R, calculating the feature similarity with the features extracted from the picture set, where the specific way and process of calculating the feature similarity may be the existing common way. When the feature similarity is calculated, the pedestrian images matched with the extracted pedestrian features can be selected and determined according to actual requirements, for example, a feature similarity threshold can be set, and all the pedestrian images meeting the feature similarity threshold are selected and determined to be pedestrian images matched with the query picture R. The selection and determination of the feature similarity threshold and the like can be selected according to the needs, and the requirements of actual application scenes can be met.

To verify the accuracy and robustness of the present invention, experiments were performed with three published data sets: market-1501, MSMT17 and DukeMTMC-reiD, in particular, dukeMTMC-reiD datasets contained 36411 images of 1812 identities taken by 8 cameras, wherein the training set had 702 identities, 16522 images, and the test set had 702 identities. The mark-1501 dataset contains 1501 pedestrians shot by 6 cameras, with the training set having 751 identities, containing 12936 images, and the test set having 750 identities, containing 19732 images. The MSMT17 dataset contained 4101 pedestrians and 126441 bounding boxes, taken by 15 cameras. The training set contained 1041 pedestrians, a total of 32621 bounding boxes, and the test set contained 3060 pedestrians, a total of 93820 bounding boxes.

Since these datasets are acquired under multiple imaging devices, there are multiple gestures, viewing angles, and illumination variations in the datasets, while there are a large amount of cluttered background and occlusions between people in different scenes, which are very challenging.

Table 1 data set introduction

Data set	Number of categories	Training class number	Number of test categories	Picture size
					DukeMTMC-reID	1812	702	1110	256*128
Market-1501	1501	751	750	256*128
					MSMT17	4101	1041	3060	256*128

Table 1 is the total number of categories, training category and test category for three data sets, which may be typically set to 256 x 128 for picture size.

Table 2 accuracy of model on three pedestrian re-recognition tasks

Data set	Market-1501	DukeMTMC-reID	MSMT17
				mAP	79.1％	68.9％	36.0％

Table 2 shows the test results of the unsupervised pedestrian re-recognition method on three unsupervised pedestrian re-recognition tasks of Market-1501, dukeMTMC-reID and MSMT17, and the average precision average mAP is used as an evaluation index.

The invention obtains higher recognition rate on all three data sets. Although the three data sets have the difficulties of shielding, deformation, background confusion, low resolution and the like, the multi-former robust feature representation capability provided by the invention is beneficial, and the clustering representation optimization capability and the high-efficiency data utilization capability of the self-adaptive outlier sample redistribution policy are combined by the combination contrast learning strategy, so that the multi-former robust feature representation method has good robustness to the difficulties and excellent performance.

In order to verify the performance improvement effect of the multi-branch network identification model multi-former, the self-adaptive outlier sample reassignment strategy and the joint contrast learning strategy on the whole unsupervised pedestrian reassignment task, an ablation experiment is carried out on a Market-1501 data set as shown in table 4, specifically, VIT is taken as a base line network, namely, the multi-branch network identification model multi-former is shown as the multi-branch network identification model multi-former, JCL is shown as the joint contrast learning module Joint Contrast Learning, and AORA is shown as the self-adaptive outlier sample reassignment strategy.

It can be seen from Table 3 that the accuracy of using the baseline network alone is only 59.6% for the unsupervised pedestrian re-recognition task of Market-1501. In the base network, the network model structure is modified into a multi-branch network identification model Multiformer, and the precision can reach 69.2%, which indicates that the multi-branch network identification model Multiformer can improve the characteristic representation capability of the model.

After the cluster center features are established to perform joint contrast learning, the precision of the model can reach 77.1%, which means that the cluster level contrast learning can effectively lead the model to learn the similarity between positive clusters and the difference between negative clusters. On the basis, after the self-adaptive outlier sample redistribution strategy is added, the accuracy of the model can reach 79.1%, which indicates that the module can more fully utilize limited data samples, so that the training effect of the model is more sufficient.

TABLE 3 influence of different modules on the Market-1501 unsupervised pedestrian re-recognition task

To better demonstrate the effects of the multi former, adaptive outlier sample reassignment strategy, and joint contrast learning strategy designed in the present invention, the visualization results are presented in fig. 9.

In summary, the multi-branch network identification model multi-former is constructed based on the trans-former network, the constructed multi-branch network identification model multi-former comprises a single-camera-domain intra-former network and a multi-camera-domain inter-former network, and all the single-camera-domain intra-former networks share backbone network parameters, so that generalization capability is enhanced, inter-domain differences caused by the background, illumination and the like of different camera domains are relieved to a certain extent, robustness of the model to noise pseudo labels is improved, and accuracy of unsupervised pedestrian re-identification is further improved.

The number of pseudo tags can be expanded by utilizing the self-adaptive outlier sample redistribution, so that the characteristic representation capability of a multi-branch network identification model Multiformer is enhanced. During model training, the problem of noise pseudo labels is relieved by utilizing the joint learning of instance level comparison learning and cluster level comparison learning and greatly improving the clustering accuracy through the joint learning.

Claims

1. An unsupervised pedestrian re-identification method based on multi-former and outlier sample re-distribution is characterized by comprising the following steps:

When a pedestrian is re-identified, carrying out feature extraction on an identification image containing the pedestrian to be identified by utilizing a multi-camera domain inter-feature network so as to find and determine a pedestrian image matched with the extracted pedestrian features from pedestrian images acquired by m cameras according to the extracted pedestrian features;

2. The method for unsupervised pedestrian re-recognition based on multi-former and outlier sample re-distribution according to claim 1, wherein the training process comprises:

step 1, extracting characteristics of a training data set by utilizing a multi-branch network identification basic model to the training data set so as toObtaining multi-camera domain picture feature F _mc Single-camera-domain picture feature F of the ith camera _{c_i} ，i＝1,…,m；

for the ith single-camera domain base network, utilizing the ith cameraSingle camera domain picture feature F _{c_i} Training the pseudo tag training set;

3. The unsupervised pedestrian re-recognition method based on multi-former and outlier sample re-distribution of claim 2, wherein for step 1, multi-camera domain picture feature F is extracted _mc When in use, any training picture in a training data set is subjected to the skip processing, and an image block obtained by each skip processing is connected with a parameter Cls token, and the position information of each image block and the camera information code of the training picture are embedded to configure and form multi-camera domain feature extraction information of the training picture;

Extracting single-camera-domain picture feature F of ith camera _{c_i} When the training pictures acquired by the ith camera are subjected to the Spilt processing, and the image blocks obtained by each Spilt processing are connected with a parameter Cls token and are embedded with the position information of each image block to form training picture single-camera domain feature extraction information;

4. The unsupervised pedestrian re-recognition method based on multi-former and outlier sample re-distribution according to claim 2, wherein in step 2, the obtained multi-camera domain picture feature F _mc All single-camera-domain picture features F _c When clustering is carried out, the clustering method comprises a DBSCAN clustering method.

5. The unsupervised pedestrian re-recognition method based on multi-former and outlier sample re-distribution according to claim 2, wherein in step 3, for the pseudo-label clustering center of the clustering point, there are:

6. The method for unsupervised pedestrian re-recognition based on multi-former and outlier sample re-distribution as set forth in claim 5, wherein a mutual similarity relationship threshold v is configured for the mutual similarity relationship between outlier samples and the cluster point pseudo-label cluster center, wherein,

7. The method for unsupervised pedestrian re-recognition based on multi-former and outlier sample re-distribution of claim 5, wherein in step 4, cluster contrast loss is obtained after cluster level contrast learning during joint contrast learning

After example level comparison learning, example comparison loss is obtained>

wherein ,

for cluster contrast loss

Then there are:

comparative example loss

Then there are:

For an anchor picture with identity i +.>

For a positive sample of identity i,

for a negative sample with identity j +.>

For an anchor picture with identity i +.>

The extracted image features, β, are the minimum gap between the similarity of the positive pair of samples and the similarity of the negative pair of samples. />

8. The method for unsupervised pedestrian re-recognition based on multi-former and outlier sample re-distribution as set forth in claim 7, wherein the model network parameters θ are determined to minimize a loss function of NH training samples at the determined model network parameters θ when the model network parameters based on joint contrast learning are optimized,

f(x _a ) For anchor image x _a Extracted image features.

9. The method for unsupervised pedestrian re-recognition based on multi-former and outlier sample re-distribution of claim 7, wherein for co-training identity loss, there is:

wherein ,

for co-training identity loss->

Is x _i Nz is the number of training samples in the training dataset, +.>

To train sample x _i The multi-branch network identification basic model outputs a true identity label as +. >

Is a probability of (2). />