CN115601791B - Unsupervised pedestrian re-identification method based on multi-former and outlier sample re-distribution - Google Patents

Unsupervised pedestrian re-identification method based on multi-former and outlier sample re-distribution Download PDF

Info

Publication number
CN115601791B
CN115601791B CN202211404730.9A CN202211404730A CN115601791B CN 115601791 B CN115601791 B CN 115601791B CN 202211404730 A CN202211404730 A CN 202211404730A CN 115601791 B CN115601791 B CN 115601791B
Authority
CN
China
Prior art keywords
camera
network
domain
training
former
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211404730.9A
Other languages
Chinese (zh)
Other versions
CN115601791A (en
Inventor
蒋敏
张千
孔军
陶雪峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangnan University
Original Assignee
Jiangnan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangnan University filed Critical Jiangnan University
Priority to CN202211404730.9A priority Critical patent/CN115601791B/en
Publication of CN115601791A publication Critical patent/CN115601791A/en
Application granted granted Critical
Publication of CN115601791B publication Critical patent/CN115601791B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/762Arrangements for image or video recognition or understanding using pattern recognition or machine learning using clustering, e.g. of similar faces in social networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The invention relates to an unsupervised pedestrian re-identification method based on multiplex and outlier sample re-distribution. The multi-branch network identification model multi-former is built based on a trans-former network and comprises a single-camera-domain intra-former network and a multi-camera-domain inter-former network, and all the single-camera-domain intra-former networks share backbone network parameters, so that generalization capability is enhanced, inter-domain differences caused by the background, illumination and the like of different camera domains are relieved to a certain extent, robustness of the model to noise pseudo labels is improved, and further, accuracy of unsupervised pedestrian re-identification is improved. The number of pseudo tags can be expanded by utilizing the self-adaptive outlier sample redistribution, so that the characteristic representation capability of a multi-branch network identification model Multiformer is enhanced. When the model is trained, the combined learning consisting of the example level comparison learning and the cluster level comparison learning is utilized, so that the clustering accuracy can be greatly improved, the problem of noise pseudo labels is relieved, and the accuracy and the robustness of unsupervised pedestrian re-identification are effectively improved.

Description

Unsupervised pedestrian re-identification method based on multi-former and outlier sample re-distribution
Technical Field
The invention relates to an unsupervised pedestrian re-identification method, in particular to an unsupervised pedestrian re-identification method based on multiplex and outlier sample re-distribution.
Background
With extensive research in computer vision both theoretically and practically, pedestrian re-recognition has also gradually become an important branch thereof, with the purpose of identifying a target pedestrian in a non-overlapping camera. Pedestrian re-identification has a wide range of real-world applications such as criminal searches, multi-camera tracking, missing person searches, and the like.
At present, the research of the traditional pedestrian re-identification depends on a large number of manually marked images, the method is low in efficiency and high in cost, the problem is thoroughly solved by the unsupervised pedestrian re-identification, the technology does not need to additionally mark the identity of the pedestrian, and compared with the traditional pedestrian re-identification, the unsupervised pedestrian re-identification has wider application space.
Because of diversity of objective environments and subjective complexity of pedestrian actions, at present, unsupervised pedestrian re-recognition still has many problems to be solved, wherein the problems to be solved mainly include: 1) The lack of a true identity tag to supervise feature representation learning, if there is no true identity tag, the model must first determine a false identity tag associated with the training data; at present, similar images are mainly distributed with the same labels through clustering or KNN searching and the like, so that pseudo labels are generated for training, but if estimated identities are incorrect, model learning is hindered; 2) Because of factors such as shielding, different visual angles, background interference and the like of pedestrian images, the estimated false labels are noisy, the main task of the pedestrian re-recognition model is to learn different pedestrian characteristic representations from different pedestrian images, how to minimize the influence of the noisy false labels, and maximizing the discriminant of the model is also a great challenge of unsupervised pedestrian re-recognition; 3) The pedestrian re-recognition essence is a multi-camera retrieval task, and the problem that how to fully learn the characteristics of pedestrians which are unchanged across cameras is also solved because of the differences of the background, the visual angle, the light and the like among different cameras.
In addition, the traditional unsupervised pedestrian re-identification task mainly adopts CNN as a backbone network to extract features, the CNN can only process one local neighborhood at a time, the receptive field is limited, global information can not be well captured, and the convolution and downsampling operations of the CNN can cause larger detail information and space information loss, so that the unsupervised pedestrian re-identification requirement can not be effectively met.
Disclosure of Invention
The invention aims to overcome the defects in the prior art, and provides an unsupervised pedestrian re-identification method based on multiplex and outlier sample re-distribution, which effectively improves the accuracy and the robustness of unsupervised pedestrian re-identification.
According to the technical scheme provided by the invention, the unsupervised pedestrian re-identification method based on multiplex and outlier sample re-distribution comprises the following steps:
constructing a multi-branch network recognition model multi-former based on a trans-former network to perform required unsupervised pedestrian re-recognition on pedestrian images acquired by m cameras using the constructed multi-branch network recognition model multi-former, wherein,
identifying a model multi-former for the constructed multi-branch network, wherein the model multi-former comprises a single-camera domain intra-former network constructed based on a trans-former network for each camera and a multi-camera domain inter-former network constructed based on the trans-former network for all cameras;
When a multi-branch network identification model multi-former is constructed, the single-camera-domain intra-former networks and multi-camera-domain inter-former networks of all cameras adopt the same backbone network, and the single-camera-domain intra-former networks of all cameras share backbone network parameters during training;
and when the pedestrians are re-identified, carrying out feature extraction on an identification image containing pedestrians to be identified by utilizing a multi-camera domain inter-feature network so as to find and determine pedestrian images matched with the extracted pedestrian features from the pedestrian images acquired by m cameras according to the extracted pedestrian features.
When constructing the multi-branch network identification model multi-former, the construction steps comprise:
constructing a multi-branch network identification basic model based on a trans-former network, wherein the multi-branch network identification basic model comprises a multi-camera domain basic network based on the trans-former network and m single-camera domain basic networks based on the trans-former network, and a classifier is configured in the multi-camera domain basic network and all the single-camera domain basic networks, and the configured classifier is adaptively connected with the multi-camera domain basic network or a corresponding backbone network in the single-camera domain basic network;
When a multi-branch network identification basic model is built, pre-training a backbone network for building a multi-camera domain basic network based on an ImageNet data set to obtain multi-camera domain backbone network pre-training parameters of the multi-camera domain basic network;
when training the constructed single-camera-domain basic network, the obtained multi-camera-domain backbone network pre-training parameters are loaded to backbone networks of all single-camera-domain basic networks, so that the single-camera-domain basic networks of all cameras share the network backbone parameters;
performing required training on the constructed multi-branch network identification basic model so as to form a corresponding single-phase domain intra-former network based on the trained single-phase domain basic network and form a multi-phase domain inter-former network based on the trained multi-phase domain basic network when the target training state is reached;
and forming a multi-branch network identification model multi-former by using the multi-camera domain interframe network and m single-camera domain Intraformer networks.
When training the constructed multi-branch network identification basic model, the training process comprises the following steps:
step 1, performing feature extraction on a training data set by utilizing a multi-branch network identification basic model to obtain multi-camera domain picture features F mc Single-camera-domain picture feature F of the ith camera c_i ,i=1,…,m;
Step 2, regarding the obtained multi-camera domain picture characteristic F mc Single-camera-domain picture feature F of the ith camera c_i Clustering, wherein successfully clustered pictures form cluster points Inliers, and distributing cluster point pseudo labels to the pictures in the cluster points Inliers, and unsuccessfully clustered pictures form Outliers Outlers;
step 3, generating a cluster point pseudo tag clustering center based on the cluster point pseudo tags, performing self-adaptive outlier sample reassignment on Outliers by using the generated cluster point pseudo tag clustering center, so that after the self-adaptive outlier samples are reassigned, corresponding cluster point pseudo tags are assigned to Outliers in the Outliers, and a pseudo tag training set is formed by using all the cluster point pseudo tags;
step 4, carrying out joint contrast learning on the multi-branch network identification basic model to carry out model network parameter optimization based on the joint contrast learning on the multi-branch network identification basic model, wherein,
based on the training data set and the single-camera-domain picture characteristic F of the ith camera for the ith single-camera-domain base network c_i The clustering point pseudo-label clustering center performs joint contrast learning;
For the multi-camera domain basic network, based on the training data set and the multi-camera domain picture characteristic F mc The clustering point pseudo-label clustering center performs joint contrast learning;
the joint contrast learning comprises cluster-level contrast learning and instance-level contrast learning;
step 5, the multi-branch network identification basic model based on the joint contrast learning optimization is subjected to the collaborative training of the single-camera domain basic network and the multi-camera domain basic network, wherein,
using multi-camera domain picture feature F mc Training the multi-camera domain basic network by the pseudo tag training set;
for the ith single-camera-domain base network, utilizing the single-camera-domain picture feature F of the ith camera c_i Training the pseudo tag training set;
and step 6, repeating the training processes of the steps 1 to 5 until reaching the target training state.
For step 1, extracting multi-camera domain picture feature F mc When in use, any training picture in a training data set is subjected to the skip processing, and an image block obtained by each skip processing is connected with a parameter Cls token, and the position information of each image block and the camera information code of the training picture are embedded to configure and form multi-camera domain feature extraction information of the training picture;
Processing the multi-camera domain feature extraction information of the training picture by utilizing the multi-camera domain basic network to extract and obtain multi-camera domain picture features F mc
Extracting single-camera-domain picture feature F of ith camera c_i The training pictures acquired by the ith camera are subjected to the Spilt processing, and an image block obtained by each Spilt processing is connected with a parameter Cls token and is embedded with the position information of each image block to form training picture single-camera domain feature extraction information;
processing the single-camera-domain feature extraction information of the training picture by utilizing a single-camera-domain basic network corresponding to the ith camera to extract and obtain single-camera-domain picture features F c_i
In step 2, the obtained multi-camera domain picture feature F mc All single-camera-domain picture features F c When clustering is carried out, the clustering method comprises a DBSCAN clustering method.
In the step 3, for the pseudo tag clustering center of the clustering point, the following steps are:
Figure BDA0003936500690000031
wherein Y is the category number of pseudo labels of the cluster points, phi i For the cluster center feature of the i-th class, f j For the features of the ith and jth pictures, num i The number of pictures contained for the class i;
the generated cluster point pseudo tag cluster center is stored in a cluster center feature repository Center Memory Bank;
An affinity matrix between the outlier samples within the outlier Outliers and the cluster point pseudo tag cluster center is calculated, wherein,
the affinity matrix between the Outliers and the cluster point pseudo tag cluster center is:
Figure BDA0003936500690000041
AFM (i, j) is the i-th cluster center feature phi in the affinity matrix AFM i Mutual similarity relation value with j-th outlier sample, O j Features that are the j-th outlier sample; phi i_r Representing the ith cluster center feature Φ i R-th value of (2), O j_r Representing the j-th outlier sample feature O j N represents a feature dimension;
and when the self-adaptive outlier samples are redistributed based on the calculated affinity matrix AFM, the outlier samples are distributed to the cluster point pseudo-label cluster center with the strongest mutual similarity relationship.
A mutual similarity relation threshold v is configured for the mutual similarity relation between the outlier sample and the pseudo tag clustering center of the clustering point, wherein,
Figure BDA0003936500690000042
Num O v is the number of outlier samples within the outlier Outliers start Is the initial value of the mutual similarity relation threshold v, gamma is the threshold attenuation rate, epoch is the training round, e peak Training rounds for representing when the threshold v reaches a peak, wherein II (·) is an indication function, and when the training rounds are smaller than e peak When 1, i.e. II (·) =II { epoch)<e peak };
When the outlier samples are distributed based on the configured mutual similarity relation threshold v, the j-th outlier sample with the mutual similarity relation value AFM (i, j) larger than the mutual similarity relation threshold v is distributed to the cluster point pseudo label cluster center with the strongest mutual similarity relation.
In step 4, during the combined contrast learning, the clustering contrast loss l is obtained after the clustering level contrast learning c The method comprises the steps of carrying out a first treatment on the surface of the After example level comparison learning, obtaining example comparison loss l t, wherein ,
for cluster contrast loss l c The following steps are:
Figure BDA0003936500690000043
wherein ,Φ+ Positive for sample picture qSample Γ is a set parameter, f (q) is a query example feature of sample picture q;
comparative example loss l t The following steps are:
Figure BDA0003936500690000044
wherein P is the number of different pedestrians selected from a given sample, K is the number of sample pictures selected for each pedestrian in the given sample, a is one picture of the K sample pictures,
Figure BDA0003936500690000051
for an anchor picture with identity i +.>
Figure BDA0003936500690000052
For a positive sample with identity i +.>
Figure BDA0003936500690000053
For a negative sample with identity j +.>
Figure BDA0003936500690000054
For an anchor picture with identity i +.>
Figure BDA0003936500690000055
The minimum gap between the similarity of the extracted image features, positive pair of beta samples and the similarity of the negative pair of samples.
When the model network parameters based on the joint contrast learning are optimized, determining the model network parameters theta so as to minimize the loss function of NH training samples under the determined model network parameters theta, wherein,
In the optimizing process, the multi-camera domain basic network and all the single-camera domain basic networks are simultaneously optimized, and the following steps are:
Figure BDA0003936500690000056
f(x a ) To pair(s)Anchor point image x a Extracted image features.
For co-training identity loss, there are:
Figure BDA0003936500690000057
/>
wherein ,lid In order to co-train the identity loss,
Figure BDA0003936500690000058
is x i Nz is the number of training samples in the training dataset, +.>
Figure BDA0003936500690000059
To train sample x i The multi-branch network identification basic model outputs a true identity label as +.>
Figure BDA00039365006900000510
Is a probability of (2).
The invention has the advantages that: the multi-branch network identification model multi-former is constructed based on a trans-former network, and comprises a single-camera-domain intra-former network and a multi-camera-domain inter-former network, and all the single-camera-domain intra-former networks share backbone network parameters, so that generalization capability is enhanced, inter-domain differences caused by the background, illumination and the like of different camera domains are relieved to a certain extent, robustness of the model to noise pseudo labels is improved, and accuracy of unsupervised pedestrian re-identification is further improved.
The number of pseudo tags can be expanded by utilizing the self-adaptive outlier sample redistribution, so that the characteristic representation capability of a multi-branch network identification model Multiformer is enhanced. During model training, the combined learning of instance level comparison learning and cluster level comparison learning is utilized, the clustering accuracy is greatly improved through the combined learning, the problem of noise pseudo labels is relieved, and therefore the accuracy and the robustness of unsupervised pedestrian re-identification are effectively improved.
Drawings
FIG. 1 is a flow chart of an embodiment of the present invention.
Fig. 2 is a flowchart of an embodiment of constructing a multi-branch network identification model according to the present invention.
Fig. 3 is a schematic diagram of an embodiment of a multi-branch network identification model of the present invention.
Fig. 4 is a schematic diagram of an embodiment of the single camera domain intra-former network and multi-camera domain inter-former network according to the present invention.
Fig. 5 is a diagram of the visual effect of the multi-branch network according to the present invention.
FIG. 6 is a diagram of one embodiment of the present invention for counting the distribution of Outliers after clustering.
FIG. 7 is a schematic diagram of adaptive outlier sample reassignment according to the present invention.
FIG. 8 is a schematic diagram of a joint comparison of the present invention.
Fig. 9 is a schematic view showing the effect of visualization in the comparative example of the present invention.
Detailed Description
The invention will be further described with reference to the following specific drawings and examples.
In order to effectively improve accuracy and robustness of unsupervised pedestrian re-recognition, in an unsupervised pedestrian re-recognition method based on multi-former and outlier sample re-distribution, in one embodiment of the present invention, the unsupervised pedestrian re-recognition method includes:
constructing a multi-branch network recognition model multi-former based on a trans-former network to perform required unsupervised pedestrian re-recognition on pedestrian images acquired by m cameras using the constructed multi-branch network recognition model multi-former, wherein,
Identifying a model multi-former for the constructed multi-branch network, wherein the model multi-former comprises a single-camera domain intra-former network constructed based on a trans-former network for each camera and a multi-camera domain inter-former network constructed based on the trans-former network for all cameras;
when a multi-branch network identification model multi-former is constructed, the single-camera-domain intra-former networks and multi-camera-domain inter-former networks of all cameras adopt the same backbone network, and the single-camera-domain intra-former networks of all cameras share backbone network parameters during training;
and when the pedestrians are re-identified, carrying out feature extraction on an identification image containing pedestrians to be identified by utilizing a multi-camera domain inter-feature network so as to find and determine pedestrian images matched with the extracted pedestrian features from the pedestrian images acquired by m cameras according to the extracted pedestrian features.
Fig. 1 shows an implementation flow chart of unsupervised pedestrian re-recognition, when unsupervised pedestrian re-recognition is implemented, a multi-branch network recognition model multi-former based on a trans-former network needs to be built, that is, the multi-branch network recognition model multi-former is built based on the trans-former network, wherein a scene range of pedestrian image acquisition is determined by m cameras, that is, a pedestrian re-recognition area range of the multi-branch network recognition model multi-former is formed, at this time, the built multi-branch network recognition model multi-former can perform unsupervised pedestrian re-recognition on pedestrian images acquired by m cameras, the cameras are devices capable of acquiring pedestrian images, such as commonly used cameras, cameras and the like, and the specific types of the cameras and the number of the cameras can be selected according to needs so as to meet the requirement of unsupervised pedestrian re-recognition. In addition, m cameras are generally installed in different areas, that is, pedestrian images in scenes of m different areas can be acquired by using the m cameras.
In order to improve accuracy and robustness of unsupervised pedestrian re-recognition, in an embodiment of the present invention, the multi-branch network recognition model multi-former needs to include a single-camera domain intra-former network constructed based on a trans-former network for each camera and a multi-camera domain inter-former network constructed based on a trans-former network for all cameras, wherein the single-camera domain specifically refers to a camera collecting pedestrian images in a collection range, and the multi-camera domain specifically refers to m cameras collecting pedestrian images in a collection range. Because the single-camera-domain intra-camera network and the multi-camera-domain inter-camera network are constructed and obtained based on the trans-camera network, global information and picture details can be better obtained by utilizing the characteristics of the trans-camera network, and the utilization rate of the global effective information is enhanced.
Because the unsupervised pedestrian re-identification task has a plurality of different camera domains, and the pedestrian pictures shot by different cameras can generate larger inter-domain differences due to the interference of external factors such as angles, backgrounds and illumination, in one embodiment of the invention, the single-camera domain intra-former networks of all cameras adopt the same backbone network, and the single-camera domain intra-former networks of all cameras share the backbone parameters of the network, thereby enhancing the generalization capability of a multi-branch network identification model multi-former, relieving the inter-domain differences caused by the backgrounds, illumination and the like of different camera domains to a certain extent, improving the robustness of noise pseudo labels and further improving the accuracy of unsupervised pedestrian re-identification.
FIG. 5 is a T-SNE diagram, plotted on a public dataset-based Market-1501. Fig. (a) shows a feature distribution diagram of a multi-branch network identification model multi-former process not proposed by the present invention, and fig. (b) shows a feature distribution diagram extracted by using the multi-branch network identification model multi-former. Wherein the same color dots represent the same camera and the mark-1501 dataset extracts photos from the 6 cameras, so there are 6 colors in the figure. In the image (a), due to the influence of the domain difference between different cameras, the image characteristics of the same camera have higher similarity, which means that the network attention is not on the pedestrian but is disturbed by noise. In the graph (b), the image features of each camera are uniformly distributed, and it can be seen that after the multi-branch network recognition model multiframe is introduced, the domain difference among different cameras is obviously relieved.
In one embodiment of the present invention, when constructing the multi-branch network identification model multiframe, the constructing step includes:
constructing a multi-branch network identification basic model based on a trans-former network, wherein the multi-branch network identification basic model comprises a multi-camera domain basic network based on the trans-former network and m single-camera domain basic networks based on the trans-former network, and a classifier is configured in the multi-camera domain basic network and all the single-camera domain basic networks, and the configured classifier is adaptively connected with the multi-camera domain basic network or a corresponding backbone network in the single-camera domain basic network;
When a multi-branch network identification basic model is built, pre-training a backbone network for building a multi-camera domain basic network based on an ImageNet data set to obtain multi-camera domain backbone network pre-training parameters of the multi-camera domain basic network;
when training the constructed single-camera-domain basic network, the obtained multi-camera-domain backbone network pre-training parameters are loaded to backbone networks of all single-camera-domain basic networks, so that the single-camera-domain basic networks of all cameras share the network backbone parameters;
performing required training on the constructed multi-branch network identification basic model so as to form a corresponding single-phase domain intra-former network based on the trained single-phase domain basic network and form a multi-phase domain inter-former network based on the trained multi-phase domain basic network when the target training state is reached;
and forming a multi-branch network identification model multi-former by using the multi-camera domain interframe network and m single-camera domain Intraformer networks.
As can be seen from the above description, since the multi-branch network identification model multi-former includes a single-camera-domain intra-former network and a multi-camera-domain inter-former network, the multi-branch network identification basic model is constructed to include at least m single-camera-domain basic networks for forming the single-camera-domain intra-former network and multi-camera-domain basic networks for forming the multi-camera-domain inter-former network, that is, the m single-camera-domain basic networks and the m single-camera-domain inter-former networks which are finally formed are in one-to-one correspondence, and the multi-camera-domain basic networks and the multi-camera-domain inter-former networks are in correspondence.
In one embodiment of the invention, the single camera domain base network and the multi-camera domain base network both use the same backbone network, e.g., both use the Encoder in the fransformer network as the backbone network. In addition, a classifier is configured in the multi-camera domain basic network and all the single-camera domain basic networks, and at this time, a multi-branch classifier is formed.
Fig. 3 shows that after reaching the target training state, a corresponding architecture diagram of a multi-camera domain inter-domain network and a single-camera domain intra-domain network is formed, and since only the corresponding network parameters are optimally adjusted during training, the constructed architecture situation of the single-camera domain base network and the multi-camera domain base network can refer to the diagram of fig. 3.
In fig. 3 and fig. 4, for the multi-camera domain inter-camera network, the skip is to cut the input picture, so as to obtain a plurality of image blocks after cutting. Liner Projection of Flattened Patches is linear projection and dimension transformation, and Embedding is data, feature Extraction is feature extraction, wherein E is obtained sequentially during feature extraction mc Block and Token. In FIG. 4, the Branch-1 to Branch-m are m single-camera-domain Intraformer networks.
Affinity Matrix is Affinity Matrix, pseudolabel is Pseudo Label, AORA is adaptive outlier sample reassignment strategy, joint Contrastive Learning (JCL) is joint contrast learning, MLP is multi-layer perceptron (Multilayer Perceptron), classifier is Classifier. Joint Contrast Learning (JCL) includes Instance Contrastive learing (instance-level contrast learning) and Cluster Contrastive learing (cluster-level contrast learning).
In fig. 4, an implementation of a multi-camera domain inter-and single-camera domain intra-former corresponding backbone network is shown, where the backbone network in fig. 4 includes the above-mentioned Spilt, linear projection, dimension transformation processing, and the like, and the specific situation of forming the multi-camera domain inter-and single-camera domain intra-former corresponding backbone network based on a trans-former is consistent with the prior art.
In one embodiment of the invention, the Classifier is adaptively connected with the backbone network through the MLP, and at the moment, the information added by the Classifier can be determined. In specific implementation, all Classifier classes adopt the same Classifier, and normal initialization can be adopted for all Classifier classes. After the constructed multi-branch network identification basic model is trained to reach a target state, corresponding Classifier classes can be obtained respectively.
For the single-camera-domain intra-former network, since the same backbone network is adopted as the multi-camera-domain inter-former network, in fig. 3, the specific case of the m single-camera-domain intra-former networks may refer to the corresponding description of the multi-camera-domain inter-former network, which is not repeated herein.
In order to realize sharing of network backbone network parameters, in one embodiment of the invention, a backbone network in a multi-camera domain foundation network is pre-trained based on an ImageNet data set to obtain multi-camera domain backbone network pre-training parameters of the multi-camera domain foundation network;
and when the built single-camera-domain basic network is constructed, the obtained multi-camera-domain backbone network pre-training parameters of the multi-camera-domain backbone network are loaded to the backbone networks of all the single-camera-domain basic networks, so that the single-camera-domain basic networks of all the cameras share the network backbone parameters.
In specific implementation, the ImageNet data set is an existing common public data set, and the mode and the process for pre-training the backbone network in the multi-camera domain base network by using the ImageNet data set are consistent with the existing method and the process. And after the multi-camera domain network pre-training parameters are obtained through pre-training in the multi-camera domain base network, adding a Classifier class in the multi-camera domain base network. And after the pretraining parameters of the multi-camera domain network are loaded in all single-camera domain base networks, adding a Classifier class in each single-camera domain base network.
The Classifier can be in the existing common classification form, and the mode of adding the Classifier and the specific form of the Classifier can be selected according to actual needs, so that the classification requirement of using the Classifier can be met. After all Classifier classifiers are added, the construction of the multi-branch network identification basic model is realized, and then the multi-branch network identification basic model is required to be trained.
As can be seen from the above description, the network parameters are shared by the backbone network of the single-phase camera domain base network constructed based on the transducer network for each camera, but the Classifier corresponding parameters are not shared. In the implementation, after the backbone network of the single-phase machine domain basic network shares network parameters, the corresponding backbone network parameters are basically consistent.
In one embodiment of the present invention, the required training is performed on the multi-branch network identification basic model constructed as described above, specifically, after the single-camera domain basic network sharing network backbone parameters of all cameras are configured, the obtained multi-branch network identification basic model is trained until the target training state is reached.
When training the constructed multi-branch network identification basic model, the training process comprises the following steps:
Step 1, performing feature extraction on a training data set by utilizing a multi-branch network identification basic model to obtain multi-camera domain picture features F mc Single-camera-domain picture feature F of the ith camera c_i ,i=1,…,m;
Step 2, regarding the obtained multi-camera domain picture characteristic F mc Single-camera-domain picture feature F of the ith camera c_i Clustering, wherein successfully clustered pictures form cluster points Inliers, and distributing cluster point pseudo labels to the pictures in the cluster points Inliers, and unsuccessfully clustered pictures form Outliers Outlers;
step 3, generating a cluster point pseudo tag clustering center based on the cluster point pseudo tags, performing self-adaptive outlier sample reassignment on Outliers by using the generated cluster point pseudo tag clustering center, so that after the self-adaptive outlier samples are reassigned, corresponding cluster point pseudo tags are assigned to Outliers in the Outliers, and a pseudo tag training set is formed by using all the cluster point pseudo tags;
step 4, carrying out joint contrast learning on the multi-branch network identification basic model to carry out model network parameter optimization based on the joint contrast learning on the multi-branch network identification basic model, wherein,
based on the training data set and the single-camera-domain picture characteristic F of the ith camera for the ith single-camera-domain base network c_i The clustering point pseudo-label clustering center performs joint contrast learning;
for the multi-camera domain base network, based on the aboveTraining dataset, multi-camera Domain Picture feature F mc The clustering point pseudo-label clustering center performs joint contrast learning;
the joint contrast learning comprises cluster-level contrast learning and instance-level contrast learning;
step 5, the multi-branch network identification basic model based on the joint contrast learning optimization is subjected to the collaborative training of the single-camera domain basic network and the multi-camera domain basic network, wherein,
using multi-camera domain picture feature F mc Training the multi-camera domain basic network by the pseudo tag training set;
for the ith single-camera-domain base network, utilizing the single-camera-domain picture feature F of the ith camera c_i Training the pseudo tag training set;
during cooperative training, network parameters based on cooperative training are optimized for the multi-branch network identification basic model by using the calculated cooperative training identity loss;
and step 6, repeating the training processes of the steps 1 to 5 until reaching the target training state.
Fig. 2 shows an embodiment of a training process for a multi-branch network identification basic model, namely, during training, steps of feature extraction, clustering to generate partial pseudo labels and outliers, self-adaptive outlier sample distribution, joint contrast learning, multi-branch network collaborative training and the like are generally required, the training termination condition is whether the model is converged, after judging that the model is converged after training, the training is terminated, and a target training state is reached at the moment, otherwise, the training is repeated. In specific implementation, the conditions for judging that the model is in a convergence state are as follows: during training, the precision of the model is not increased any more, and the model loss is not reduced any more.
The training procedure is described in detail below.
Specifically, during training, a training data set needs to be provided or configured, wherein the training data set is an image shot and collected by m cameras, and the size of the training data set can be selected according to actual needs so as to meet required training requirements.
In one embodiment of the present invention, for step 1, the multi-camera domain picture feature F is extracted mc When in use, any training picture in a training data set is subjected to the skip processing, a parameter Cls token is connected to an image block obtained by each skip processing, and the position information of each image block and the camera information of the training picture are encoded to form multi-camera domain feature extraction information of the training picture;
processing the multi-camera domain feature extraction information of the training picture by utilizing the multi-camera domain basic network to extract and obtain multi-camera domain picture features F mc
Extracting single-camera-domain picture feature F of ith camera c_i Performing the skip processing on the training pictures acquired by the ith camera, connecting a parameter Cls token for each image block obtained by the skip processing, and forming single-camera domain feature extraction information of the training pictures for the position information of each image block;
Processing the single-camera-domain feature extraction information of the training picture by utilizing a single-camera-domain basic network corresponding to the ith camera to extract and obtain single-camera-domain picture features F c_i
In particular, the training data input is a set X of all camera pictures (i.e. pictures collected by m cameras) mc ∈R B×C×H×W Wherein, h×w is the resolution of the input picture, C is the channel number (for RGB pictures, the channel number C is 3), B is the size of each batch, and each batch size B can be selectively determined according to the actual application scenario, etc. Dividing (space) the input picture and flattening the space dimension to obtain an image block Patch sequence
Figure BDA0003936500690000111
Wherein N is the number of Patches obtained by segmentation, P h ×P w And the cut image block Patch size.
Image block Patch sequence X p Obtaining an image block Patch code E after linear projection and dimensional transformation mc ∈R B ×N×D Wherein D is the generated feature dimension.Encoding E for a block Patch mc Connecting a parameter Cls token for representing global feature, and embedding position code and camera information code to obtain E mc_cls ∈R B×N′×D . After training, the parameters Cls token will contain a feature representation of the input picture for classification. The Cls token size is R B×1×D Image block Patch code E mc After the parameter Cls token is connected, the dimension of the Patch number N is increased by 1.
The parameter Cls token is a learnable parameter, and the size is R B×1×D The method comprises the steps of carrying out a first treatment on the surface of the The position code is the position information of each divided image block Patch in the original picture, the camera information code is formed by the camera number information of the picture, and the size of the position code and the camera information code is R 1×N′×D The initial values are all 0.
Will E mc_cls Sending the multi-camera domain feature extraction information into a Block network of a transform network, and processing the multi-camera domain feature extraction information of the training picture by using the Block network to extract and obtain a multi-camera domain picture feature F mc Extracting the obtained multi-camera domain picture characteristic F mc Namely Token generated by the intersomer network in fig. 3 and 4.
Classifying the input training pictures according to the camera labels, and respectively sending the input training pictures into the corresponding single-camera-domain basic networks, for example, for the ith camera and the single-camera-domain basic network corresponding to the ith camera, sending the training pictures sent into the single-camera-domain basic network corresponding to the ith camera into the pictures collected by the ith camera. Specifically, the input data for each single camera domain base network is a collection X of single camera pictures c_i ∈R B×C×H×W Where c_i represents the ith camera, after the same segmentation and dimensional transformation as the multi-camera domain base network, an image block Patch code E is obtained c_i ∈R B×N×D . Encoding E for a block Patch c_i A parameter Cls token is concatenated into which the position code of the image block patch is also embedded. Subsequently, the image block Patch is encoded E c_i Transmitting the image into a Block network of a transducer network to extract and obtain single-camera-domain image characteristics F c_i Extracting to obtain a singleCamera domain picture feature F c_i Namely Token generated by the Intraformer network in fig. 3 and 4.
The Block network E-based is shown in FIG. 4 mc_cls In one embodiment, the method and the process of obtaining Token by processing the location information and the camera information refer to the process of fig. 4, and the method and the process of obtaining Token by the Block network are specifically similar to the method of processing the Block network in the existing transducer network, which will not be described in detail herein.
In the multi-camera domain base network for forming the multi-camera domain inter network and the single-camera domain base network for forming the single-camera domain intra network, parameters of the Block network are determined by the steps, so that the picture features can be directly extracted by using the Block network, and are particularly related to the Block network, which is well known to those skilled in the art and will not be described herein.
In one embodiment of the present invention, in step 2, the obtained multi-camera domain picture feature F mc All single-camera-domain picture features F c When clustering is carried out, the clustering method comprises a DBSCAN clustering method.
In the clustering process, the image features extracted partially are interfered by noise such as the gesture and the background of pedestrians, so that the image features are far away from a clustering center point and cannot be clustered successfully, the samples are called as outlier samples, and all the outlier samples form Outliers. In one embodiment of the invention, unsupervised pedestrian re-recognition relies on clustering derived pseudo tags for collaborative training, whereas outlier samples cannot be utilized in training due to missing tags.
In specific implementation, the clustering method can adopt a DBSCAN clustering method, the DBSCAN clustering method does not need to specify the clustering number, the clustering type number can be learned autonomously, after clustering, the clustering point pseudo labels are distributed to successfully clustered pictures, and unsuccessfully clustered pictures form Outliers. Of course, in specific implementation, the clustering method may also adopt other commonly used clustering forms, specifically, the clustering method is based on meeting the actual clustering requirement. When the DBSCAN clustering method is adopted for clustering, the specific conditions of clustering forming the clustering points Inliers and the outlier Outliers can be selected and determined according to actual needs, and the actual clustering requirements can be met.
FIG. 6 is a graph of statistics of the number of outlier samples of the pedestrian re-identification dataset Market-1501 after DBSCAN clustering, with the outlier samples occupying more than 60% of the total samples at the initial stage of training, and still occupying more than 10% after multiple iterations of the model. While a transducer network has less bias on the structural generalization of the input data, such as correlation and translational invariance, compared to a convolutional neural network, more data is needed to train the transducer network, especially at the beginning of model training. For better results, full use is required for outlier samples.
In step 3, in one embodiment of the present invention, for the pseudo tag clustering center of the clustering point, there are:
Figure BDA0003936500690000121
wherein Y is the category number of pseudo labels of the cluster points, phi i For the cluster center feature of the i-th class, f j For the features of the ith and jth pictures, num i The number of pictures contained in the i-th class;
the generated cluster point pseudo tag cluster center is stored in a cluster center feature repository Center Memory Bank;
an affinity matrix between the outlier samples within the outlier Outliers and the cluster point pseudo tag cluster center is calculated, wherein,
the affinity matrix between the Outliers and the cluster point pseudo tag cluster center is:
Figure BDA0003936500690000122
AFM (i, j) is the i-th cluster center feature phi in the affinity matrix AFM i Mutual similarity relation value with j-th outlier sample, O j Features that are the j-th outlier sample; phi i_r Representing the ith cluster center feature Φ i Is the (r) thNumerical value, O j_r Representing the j-th outlier sample feature O j N represents a feature dimension;
and when the self-adaptive outlier samples are redistributed based on the calculated affinity matrix AFM, distributing an outlier sample to a cluster point pseudo-label cluster center with the strongest mutual similarity relationship.
In specific implementation, after clustering is performed based on the DBSCAN clustering method, the category number Y of the pseudo labels of the clustering points can be obtained according to the formation of the clustering points Inliers, and of course, the characteristic f of the ith and jth pictures can also be obtained j Number num of pictures included in class i category i . Thus, after clustering, a cluster point pseudo tag cluster center can be generated, the generated cluster point pseudo tag cluster center is { phi } 1 ,Φ 2 ,...Φ i ,…Φ Y }。
For feature dimension N, then, it is associated with the constructed multi-branch network identification model multiframe, and for a determined multi-branch network identification model multiframe, feature dimension N remains fixed. At this time, the ith cluster center feature Φ i And the same characteristic dimension as the j-th outlier sample. For the affinity matrix AFM, the method is consistent with the prior art, namely, based on the affinity matrix AFM, the mutual similarity relation value between the corresponding outlier sample and the clustering center feature can be obtained.
From the above description, it is generally necessary to train the constructed multi-branch network identification basic model a plurality of times when determining whether to converge. In one embodiment of the invention, the clustering center feature repository Center Memory Bank is utilized to store the clustered pseudo tag clustering centers of the clustered points in each training process.
After the cluster center feature repository Center Memory Bank is used for storing the cluster point pseudo tag cluster centers, an affinity matrix between the outlier samples in the outlier Outliers and the cluster point pseudo tag cluster centers can be calculated, so that self-adaptive outlier sample redistribution is performed based on the calculated affinity matrix, the data volume of a training model is expanded, the feature representation capability of the model is enhanced, and better performance is achieved.
Meter for measuring timeCalculating the i-th clustering center characteristic phi in an affinity matrix AFM i When the mutual similarity relation value AFM (i, j) between the ith outlier sample and the jth outlier sample is obtained, the ith clustering center characteristic phi i Specific case of feature dimension N of (c) and the i-th cluster center feature Φ i And (5) correlation. In specific implementation, when self-adaptive outlier sample reassignment is carried out, the j-th outlier sample is assigned to the cluster point pseudo-label cluster center with the strongest mutual similarity relationship. The mutual similarity relationship is strongest, specifically, for j outlier samples, the mutual similarity relationship is the same as the i-th clustering center feature phi i The corresponding mutual similarity relation value AFM (i, j) is largest.
In one embodiment of the invention, a mutual similarity relation threshold v is configured for the mutual similarity relation between the outlier sample and the pseudo tag clustering center of the clustering point, wherein,
Figure BDA0003936500690000131
Num O v is the number of outlier samples within the outlier Outliers start Is the initial value of the mutual similarity relation threshold v, gamma is the threshold attenuation rate, epoch is the training round, e peak Training rounds for representing when the threshold v reaches a peak, wherein II (·) is an indication function, and when the training rounds are smaller than e peak When 1, i.e. II (·) =II { epoch)<e peak };
When the outlier samples are distributed based on the configured mutual similarity relation threshold v, the j-th outlier sample with the mutual similarity relation value AFM (i, j) larger than the mutual similarity relation threshold v is distributed to the cluster point pseudo label cluster center with the strongest mutual similarity relation.
In specific implementation, the multi-branch network identification basic model has poor feature extraction capability in the initial training stage, and the extracted feature accuracy is relatively low, so that a smaller mutual similarity relation threshold v is adopted. Along with the training, the model feature extraction capability is gradually enhanced, and the threshold v of the mutual similarity relationship is adaptively increased; however, because some pictures in the data have a plurality of pedestrians, shielding, blurring and the like, and part of outlier sample points are always in the condition of oscillation or incapability of clustering, which is called as strong noise points, when iteration is performed to a certain round, the mutual similarity relation threshold v is adaptively reduced so as to ignore the interference of the strong noise points on the model, as shown in fig. 7.
Based on the above conditions in different training phases, the above mutual similarity relation threshold v is configured. Initial value v of mutual similarity relation threshold v start Can be set to 0.6, and the threshold attenuation rate gamma can be set to 0.9, e peak For an empirical value, it can be set to 10 in general. epoch is the round of model training.
After the mutual similarity relation threshold v is configured, when the mutual similarity relation value AFM (i, j) is larger than the mutual similarity relation threshold v, the j-th outlier sample is distributed to the cluster point pseudo-label cluster center with the strongest mutual similarity relation, otherwise, the j-th outlier sample is not distributed.
In step 4, the clustering contrast loss l is obtained after clustering level contrast learning during joint contrast learning c The method comprises the steps of carrying out a first treatment on the surface of the After example level comparison learning, obtaining example comparison loss l t, wherein ,
for cluster contrast loss l c The following steps are:
Figure BDA0003936500690000141
wherein ,Φ+ For a positive sample of the sample picture q, Γ is a set parameter, and f (q) is a query example feature of the sample picture q;
comparative example loss l t The following steps are:
Figure BDA0003936500690000142
wherein P is the number of different pedestrians selected from a given sample, and K is the number of different pedestrians selected from the given sampleThe number of sample pictures is selected for each pedestrian, a is one picture in the K sample pictures,
Figure BDA0003936500690000143
For an anchor picture with identity i +.>
Figure BDA0003936500690000144
For a positive sample with identity i +.>
Figure BDA0003936500690000145
For a negative sample with identity j +.>
Figure BDA0003936500690000146
For an anchor picture with identity i +.>
Figure BDA0003936500690000147
The minimum gap between the similarity of the extracted image features, positive pair of beta samples and the similarity of the negative pair of samples. />
Figure BDA0003936500690000148
For an anchor picture with identity i +.>
Figure BDA0003936500690000149
Extracted image features, < >>
Figure BDA00039365006900001410
For an anchor picture with identity j +.>
Figure BDA00039365006900001411
Extracted image features.
In the implementation, in addition to the contrast learning at the conventional example level, in order to improve the clustering effect of the model, the distance between the outlier sample and the clustering center of the pseudo tag of the clustering point is reduced. Compared with example level contrast learning, the method has the advantages that the similarity between the sample and the positive clustering is mainly pulled up, the similarity between the sample and the negative clustering is pushed away, and the calculated amount of the model can be greatly reduced. The clustering is facilitated by contrast samples, and this cluster-oriented contrast learning paradigm helps the model minimize similarity between clusters to separate different clusters. The case of joint contrast learning is shown in fig. 8.
In specific implementation, during training, the single-camera-domain basic network and the multi-camera-domain basic network are subjected to joint contrast learning, wherein in the joint contrast learning, the purpose of cluster-level contrast learning is to minimize the distance between the sample picture q and the positive cluster thereof and maximize the distance between the sample picture q and the negative cluster thereof, so that after the cluster-level contrast learning, the cluster contrast loss l can be obtained c
The samples of each batch of the cluster level contrast learning only need to be subjected to contrast learning with the cluster point pseudo tag clustering center features. In the implementation, cluster level comparison learning is carried out on a single-camera-domain basic network, and the sample picture q is a picture shot and collected by a camera corresponding to the single-camera-domain basic network; and carrying out cluster-level comparison learning on the multi-camera domain basic network, wherein the sample picture q is any picture in the training data set.
From the above description, after clustering, the pictures in the training data set are configured with a pseudo-label of the clustering point after successful clustering, so that after determining the sample picture q, the pictures belong to the positive sample in the same category as the sample picture q, and belong to the negative sample when the pictures belong to different categories as the sample picture q, namely the positive sample phi of the sample picture q + And the query example features f (q) of the sample picture q can be determined by the technical means commonly used in the technical field. Thus, the clustering contrast loss l corresponding to the single-camera-domain basic network and the multi-camera-domain basic network can be obtained respectively c
In practice, the parameter Γ may be in the range of [0,1], for example, the parameter Γ may be 0.5. Further, the minimum gap β between the similarity of the positive sample pair and the similarity of the negative sample pair is an empirical value, for example, 0.3 is preferable.
In specific implementation, the purpose of example level contrast learning is to reduce the similarity between similar samples and expand the similarity between dissimilar samples. For a given set of samples, the given samples are sample pictures selected from the training data set, P different pedestrians are selected from the given samples, and K pictures are selected for each pedestrian. For each image a, a least similar positive sample p and a most similar negative sample n are selected for example level contrast learning. Generally, P may be selected as 8,K and 32 in a given sample. The identity i or the identity j specifically refers to one of P different pedestrians.
Instance level contrast learning is beneficial to learning salient features among different samples by the multi-branch identification network basic model, and feature representation capability of the multi-branch identification network basic model is enhanced. The two are combined for contrast learning, so that the clustering accuracy of the model is greatly improved, and the problem of noise pseudo labels is solved.
In one embodiment of the present invention, query instance features f (q) are used to perform contrast learning with the cluster center feature repository Center Memory Bank, and update the cluster center feature repository Center Memory Bank, where the update method is as follows:
In the calculation of cluster contrast loss l c Then, the clustering center features are updated by using the query instance features f (q), and the specific calculation formula is as follows:
Φ + ←(1-u)f(q)+uΦ +
where u is a parameter that acts to slowly update the features of the cluster center feature repository Center Memory Bank, avoiding losing feature consistency due to severe vibration. The value range of the parameter u is generally [0,1], and the specific value of the parameter u can be selected according to actual needs, so that the actual needs can be met.
In one embodiment of the invention, when the model network parameters based on the joint contrast learning are optimized, the model network parameters theta are determined so as to minimize the loss function of NH training samples under the determined model network parameters theta,
Figure BDA0003936500690000161
in the optimizing process, the multi-camera domain basic network and all the single-camera domain basic networks are simultaneously optimized, and the following steps are:
Figure BDA0003936500690000162
wherein ,f(xa ) For anchor image x a Extracted image features.
In specific implementation, NH training samples, which are samples selected from the training dataset, may be selected according to the need. maxd a,p Namely, is
Figure BDA0003936500690000163
mind a,n Namely, is
Figure BDA0003936500690000164
Namely, performing a binary norm operation.
In one embodiment of the present invention, for co-training identity loss, there is:
Figure BDA0003936500690000165
wherein ,lid In order to co-train the identity loss,
Figure BDA0003936500690000166
is x i Nz is the number of training samples in the training dataset, +.>
Figure BDA0003936500690000167
To train sample x i The multi-branch network identification basic model outputs a true identity label as +.>
Figure BDA0003936500690000168
Is a probability of (2).
In particular, the number Nz of training samples in the training data set may be rootedFrom the above description, it can be seen that during training, the clustering point pseudo labels and the reassignment of outlier samples are assigned to training sample x i The real identity label output by the multi-branch network identification basic model can be obtained as
Figure BDA0003936500690000169
Probability of (2)
Figure BDA00039365006900001610
In practice, it can be seen from the above description that during training, a cluster contrast loss l can be obtained c Example comparative loss l t Co-training identity loss l id Then, the total loss after one training can be obtained,
Figure BDA0003936500690000171
as can be seen from the above description, when judging whether the multi-branch network identification basic model is converged, the main judgment indexes include the model accuracy and model loss, wherein the model accuracy can be generally the average accuracy average mAP, and the model loss is the total loss
Figure BDA0003936500690000172
After training, the specific calculation of the average precision mean value mAP can be consistent with the prior art, so that whether the multi-branch network identification basic model is converged or not can be effectively judged. When the multi-branch network identification basic model is trained and converged, the target training state is reached, and at the moment, the multi-branch network identification basic model is utilized to form a multi-branch network identification model Multiformer.
After the multi-branch network recognition model multi-former is obtained, when unsupervised pedestrian re-recognition is performed, a query picture R needs to be provided to find pedestrians with similar characteristics to the query picture R in a picture set shot by m cameras. As can be seen from the above description, for the image set captured by the m cameras, all the features in the image set are extracted by using the multi-branch network recognition model multiframe, and the specific manner and process for extracting the image features in the image set can refer to the above description, for example, for any image, after adopting the Spilt segmentation, the linear projection, the dimension transformation, and the like, the image set is extracted by using the multi-camera domain interframe network.
And (3) after the query picture R is processed by adopting the same technical means, extracting corresponding characteristics by utilizing a multi-camera domain interframe network. After obtaining the features of the query picture R, calculating the feature similarity with the features extracted from the picture set, where the specific way and process of calculating the feature similarity may be the existing common way. When the feature similarity is calculated, the pedestrian images matched with the extracted pedestrian features can be selected and determined according to actual requirements, for example, a feature similarity threshold can be set, and all the pedestrian images meeting the feature similarity threshold are selected and determined to be pedestrian images matched with the query picture R. The selection and determination of the feature similarity threshold and the like can be selected according to the needs, and the requirements of actual application scenes can be met.
To verify the accuracy and robustness of the present invention, experiments were performed with three published data sets: market-1501, MSMT17 and DukeMTMC-reiD, in particular, dukeMTMC-reiD datasets contained 36411 images of 1812 identities taken by 8 cameras, wherein the training set had 702 identities, 16522 images, and the test set had 702 identities. The mark-1501 dataset contains 1501 pedestrians shot by 6 cameras, with the training set having 751 identities, containing 12936 images, and the test set having 750 identities, containing 19732 images. The MSMT17 dataset contained 4101 pedestrians and 126441 bounding boxes, taken by 15 cameras. The training set contained 1041 pedestrians, a total of 32621 bounding boxes, and the test set contained 3060 pedestrians, a total of 93820 bounding boxes.
Since these datasets are acquired under multiple imaging devices, there are multiple gestures, viewing angles, and illumination variations in the datasets, while there are a large amount of cluttered background and occlusions between people in different scenes, which are very challenging.
Table 1 data set introduction
Data set Number of categories Training class number Number of test categories Picture size
DukeMTMC-reID 1812 702 1110 256*128
Market-1501 1501 751 750 256*128
MSMT17 4101 1041 3060 256*128
Table 1 is the total number of categories, training category and test category for three data sets, which may be typically set to 256 x 128 for picture size.
Table 2 accuracy of model on three pedestrian re-recognition tasks
Data set Market-1501 DukeMTMC-reID MSMT17
mAP 79.1% 68.9% 36.0%
Table 2 shows the test results of the unsupervised pedestrian re-recognition method on three unsupervised pedestrian re-recognition tasks of Market-1501, dukeMTMC-reID and MSMT17, and the average precision average mAP is used as an evaluation index.
The invention obtains higher recognition rate on all three data sets. Although the three data sets have the difficulties of shielding, deformation, background confusion, low resolution and the like, the multi-former robust feature representation capability provided by the invention is beneficial, and the clustering representation optimization capability and the high-efficiency data utilization capability of the self-adaptive outlier sample redistribution policy are combined by the combination contrast learning strategy, so that the multi-former robust feature representation method has good robustness to the difficulties and excellent performance.
In order to verify the performance improvement effect of the multi-branch network identification model multi-former, the self-adaptive outlier sample reassignment strategy and the joint contrast learning strategy on the whole unsupervised pedestrian reassignment task, an ablation experiment is carried out on a Market-1501 data set as shown in table 4, specifically, VIT is taken as a base line network, namely, the multi-branch network identification model multi-former is shown as the multi-branch network identification model multi-former, JCL is shown as the joint contrast learning module Joint Contrast Learning, and AORA is shown as the self-adaptive outlier sample reassignment strategy.
It can be seen from Table 3 that the accuracy of using the baseline network alone is only 59.6% for the unsupervised pedestrian re-recognition task of Market-1501. In the base network, the network model structure is modified into a multi-branch network identification model Multiformer, and the precision can reach 69.2%, which indicates that the multi-branch network identification model Multiformer can improve the characteristic representation capability of the model.
After the cluster center features are established to perform joint contrast learning, the precision of the model can reach 77.1%, which means that the cluster level contrast learning can effectively lead the model to learn the similarity between positive clusters and the difference between negative clusters. On the basis, after the self-adaptive outlier sample redistribution strategy is added, the accuracy of the model can reach 79.1%, which indicates that the module can more fully utilize limited data samples, so that the training effect of the model is more sufficient.
TABLE 3 influence of different modules on the Market-1501 unsupervised pedestrian re-recognition task
Figure BDA0003936500690000181
Figure BDA0003936500690000191
To better demonstrate the effects of the multi former, adaptive outlier sample reassignment strategy, and joint contrast learning strategy designed in the present invention, the visualization results are presented in fig. 9.
In summary, the multi-branch network identification model multi-former is constructed based on the trans-former network, the constructed multi-branch network identification model multi-former comprises a single-camera-domain intra-former network and a multi-camera-domain inter-former network, and all the single-camera-domain intra-former networks share backbone network parameters, so that generalization capability is enhanced, inter-domain differences caused by the background, illumination and the like of different camera domains are relieved to a certain extent, robustness of the model to noise pseudo labels is improved, and accuracy of unsupervised pedestrian re-identification is further improved.
The number of pseudo tags can be expanded by utilizing the self-adaptive outlier sample redistribution, so that the characteristic representation capability of a multi-branch network identification model Multiformer is enhanced. During model training, the problem of noise pseudo labels is relieved by utilizing the joint learning of instance level comparison learning and cluster level comparison learning and greatly improving the clustering accuracy through the joint learning.

Claims (9)

1. An unsupervised pedestrian re-identification method based on multi-former and outlier sample re-distribution is characterized by comprising the following steps:
constructing a multi-branch network recognition model multi-former based on a trans-former network to perform required unsupervised pedestrian re-recognition on pedestrian images acquired by m cameras using the constructed multi-branch network recognition model multi-former, wherein,
identifying a model multi-former for the constructed multi-branch network, wherein the model multi-former comprises a single-camera domain intra-former network constructed based on a trans-former network for each camera and a multi-camera domain inter-former network constructed based on the trans-former network for all cameras;
when a multi-branch network identification model multi-former is constructed, the single-camera-domain intra-former networks and multi-camera-domain inter-former networks of all cameras adopt the same backbone network, and the single-camera-domain intra-former networks of all cameras share backbone network parameters during training;
When a pedestrian is re-identified, carrying out feature extraction on an identification image containing the pedestrian to be identified by utilizing a multi-camera domain inter-feature network so as to find and determine a pedestrian image matched with the extracted pedestrian features from pedestrian images acquired by m cameras according to the extracted pedestrian features;
when constructing the multi-branch network identification model multi-former, the construction steps comprise:
constructing a multi-branch network identification basic model based on a trans-former network, wherein the multi-branch network identification basic model comprises a multi-camera domain basic network based on the trans-former network and m single-camera domain basic networks based on the trans-former network, and a classifier is configured in the multi-camera domain basic network and all the single-camera domain basic networks, and the configured classifier is adaptively connected with the multi-camera domain basic network or a corresponding backbone network in the single-camera domain basic network;
when a multi-branch network identification basic model is built, pre-training a backbone network for building a multi-camera domain basic network based on an ImageNet data set to obtain multi-camera domain backbone network pre-training parameters of the multi-camera domain basic network;
When training the constructed single-camera-domain basic network, the obtained multi-camera-domain backbone network pre-training parameters are loaded to backbone networks of all single-camera-domain basic networks, so that the single-camera-domain basic networks of all cameras share the network backbone parameters;
performing required training on the constructed multi-branch network identification basic model so as to form a corresponding single-phase domain intra-former network based on the trained single-phase domain basic network and form a multi-phase domain inter-former network based on the trained multi-phase domain basic network when the target training state is reached;
and forming a multi-branch network identification model multi-former by using the multi-camera domain interframe network and m single-camera domain Intraformer networks.
2. The method for unsupervised pedestrian re-recognition based on multi-former and outlier sample re-distribution according to claim 1, wherein the training process comprises:
step 1, extracting characteristics of a training data set by utilizing a multi-branch network identification basic model to the training data set so as toObtaining multi-camera domain picture feature F mc Single-camera-domain picture feature F of the ith camera c_i ,i=1,…,m;
Step 2, regarding the obtained multi-camera domain picture characteristic F mc Single-camera-domain picture feature F of the ith camera c_i Clustering, wherein successfully clustered pictures form cluster points Inliers, and distributing cluster point pseudo labels to the pictures in the cluster points Inliers, and unsuccessfully clustered pictures form Outliers Outlers;
step 3, generating a cluster point pseudo tag clustering center based on the cluster point pseudo tags, performing self-adaptive outlier sample reassignment on Outliers by using the generated cluster point pseudo tag clustering center, so that after the self-adaptive outlier samples are reassigned, corresponding cluster point pseudo tags are assigned to Outliers in the Outliers, and a pseudo tag training set is formed by using all the cluster point pseudo tags;
step 4, carrying out joint contrast learning on the multi-branch network identification basic model to carry out model network parameter optimization based on the joint contrast learning on the multi-branch network identification basic model, wherein,
based on the training data set and the single-camera-domain picture characteristic F of the ith camera for the ith single-camera-domain base network c_i The clustering point pseudo-label clustering center performs joint contrast learning;
for the multi-camera domain basic network, based on the training data set and the multi-camera domain picture characteristic F mc The clustering point pseudo-label clustering center performs joint contrast learning;
the joint contrast learning comprises cluster-level contrast learning and instance-level contrast learning;
step 5, the multi-branch network identification basic model based on the joint contrast learning optimization is subjected to the collaborative training of the single-camera domain basic network and the multi-camera domain basic network, wherein,
using multi-camera domain picture feature F mc Training the multi-camera domain basic network by the pseudo tag training set;
for the ith single-camera domain base network, utilizing the ith cameraSingle camera domain picture feature F c_i Training the pseudo tag training set;
and step 6, repeating the training processes of the steps 1 to 5 until reaching the target training state.
3. The unsupervised pedestrian re-recognition method based on multi-former and outlier sample re-distribution of claim 2, wherein for step 1, multi-camera domain picture feature F is extracted mc When in use, any training picture in a training data set is subjected to the skip processing, and an image block obtained by each skip processing is connected with a parameter Cls token, and the position information of each image block and the camera information code of the training picture are embedded to configure and form multi-camera domain feature extraction information of the training picture;
Processing the multi-camera domain feature extraction information of the training picture by utilizing the multi-camera domain basic network to extract and obtain multi-camera domain picture features F mc
Extracting single-camera-domain picture feature F of ith camera c_i When the training pictures acquired by the ith camera are subjected to the Spilt processing, and the image blocks obtained by each Spilt processing are connected with a parameter Cls token and are embedded with the position information of each image block to form training picture single-camera domain feature extraction information;
processing the single-camera-domain feature extraction information of the training picture by utilizing a single-camera-domain basic network corresponding to the ith camera to extract and obtain single-camera-domain picture features F c_i
4. The unsupervised pedestrian re-recognition method based on multi-former and outlier sample re-distribution according to claim 2, wherein in step 2, the obtained multi-camera domain picture feature F mc All single-camera-domain picture features F c When clustering is carried out, the clustering method comprises a DBSCAN clustering method.
5. The unsupervised pedestrian re-recognition method based on multi-former and outlier sample re-distribution according to claim 2, wherein in step 3, for the pseudo-label clustering center of the clustering point, there are:
Figure FDA0004130318070000031
Wherein Y is the category number of pseudo labels of the cluster points, phi i For the cluster center feature of the i-th class, f j For the features of the ith and jth pictures, num i The number of pictures contained for the class i;
the generated cluster point pseudo tag cluster center is stored in a cluster center feature repository Center Memory Bank;
an affinity matrix between the outlier samples within the outlier Outliers and the cluster point pseudo tag cluster center is calculated, wherein,
the affinity matrix between the Outliers and the cluster point pseudo tag cluster center is:
Figure FDA0004130318070000032
AFM (i, j) is the i-th cluster center feature phi in the affinity matrix AFM i Mutual similarity relation value with j-th outlier sample, O j Features that are the j-th outlier sample; phi i_r Representing the ith cluster center feature Φ i R-th value of (2), O j_r Representing the j-th outlier sample feature O j N represents a feature dimension;
and when the self-adaptive outlier samples are redistributed based on the calculated affinity matrix AFM, the outlier samples are distributed to the cluster point pseudo-label cluster center with the strongest mutual similarity relationship.
6. The method for unsupervised pedestrian re-recognition based on multi-former and outlier sample re-distribution as set forth in claim 5, wherein a mutual similarity relationship threshold v is configured for the mutual similarity relationship between outlier samples and the cluster point pseudo-label cluster center, wherein,
Figure FDA0004130318070000033
Num O V is the number of outlier samples within the outlier Outliers start Is the initial value of the mutual similarity relation threshold v, gamma is the threshold attenuation rate, epoch is the training round, e peak Training rounds for representing when the threshold v reaches a peak, wherein II (·) is an indication function, and when the training rounds are smaller than e peak When 1, i.e. II (·) =II { epoch)<e peak };
When the outlier samples are distributed based on the configured mutual similarity relation threshold v, the j-th outlier sample with the mutual similarity relation value AFM (i, j) larger than the mutual similarity relation threshold v is distributed to the cluster point pseudo label cluster center with the strongest mutual similarity relation.
7. The method for unsupervised pedestrian re-recognition based on multi-former and outlier sample re-distribution of claim 5, wherein in step 4, cluster contrast loss is obtained after cluster level contrast learning during joint contrast learning
Figure FDA00041303180700000411
After example level comparison learning, example comparison loss is obtained>
Figure FDA00041303180700000412
wherein ,
for cluster contrast loss
Figure FDA00041303180700000413
Then there are:
Figure FDA0004130318070000041
wherein ,Φ+ For a positive sample of the sample picture q, Γ is a set parameter, and f (q) is a query example feature of the sample picture q;
comparative example loss
Figure FDA00041303180700000414
Then there are:
Figure FDA0004130318070000042
wherein P is the number of different pedestrians selected from a given sample, K is the number of sample pictures selected for each pedestrian in the given sample, a is one picture of the K sample pictures,
Figure FDA0004130318070000043
For an anchor picture with identity i +.>
Figure FDA0004130318070000044
For a positive sample of identity i,
Figure FDA0004130318070000045
for a negative sample with identity j +.>
Figure FDA0004130318070000046
For an anchor picture with identity i +.>
Figure FDA0004130318070000047
The extracted image features, β, are the minimum gap between the similarity of the positive pair of samples and the similarity of the negative pair of samples. />
8. The method for unsupervised pedestrian re-recognition based on multi-former and outlier sample re-distribution as set forth in claim 7, wherein the model network parameters θ are determined to minimize a loss function of NH training samples at the determined model network parameters θ when the model network parameters based on joint contrast learning are optimized,
in the optimizing process, the multi-camera domain basic network and all the single-camera domain basic networks are simultaneously optimized, and the following steps are:
Figure FDA0004130318070000048
f(x a ) For anchor image x a Extracted image features.
9. The method for unsupervised pedestrian re-recognition based on multi-former and outlier sample re-distribution of claim 7, wherein for co-training identity loss, there is:
Figure FDA0004130318070000049
wherein ,
Figure FDA00041303180700000415
for co-training identity loss->
Figure FDA00041303180700000410
Is x i Nz is the number of training samples in the training dataset, +.>
Figure FDA0004130318070000051
To train sample x i The multi-branch network identification basic model outputs a true identity label as +. >
Figure FDA0004130318070000052
Is a probability of (2). />
CN202211404730.9A 2022-11-10 2022-11-10 Unsupervised pedestrian re-identification method based on multi-former and outlier sample re-distribution Active CN115601791B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211404730.9A CN115601791B (en) 2022-11-10 2022-11-10 Unsupervised pedestrian re-identification method based on multi-former and outlier sample re-distribution

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211404730.9A CN115601791B (en) 2022-11-10 2022-11-10 Unsupervised pedestrian re-identification method based on multi-former and outlier sample re-distribution

Publications (2)

Publication Number Publication Date
CN115601791A CN115601791A (en) 2023-01-13
CN115601791B true CN115601791B (en) 2023-05-02

Family

ID=84852926

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211404730.9A Active CN115601791B (en) 2022-11-10 2022-11-10 Unsupervised pedestrian re-identification method based on multi-former and outlier sample re-distribution

Country Status (1)

Country Link
CN (1) CN115601791B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022043741A1 (en) * 2020-08-25 2022-03-03 商汤国际私人有限公司 Network training method and apparatus, person re-identification method and apparatus, storage medium, and computer program
CN115205570A (en) * 2022-09-14 2022-10-18 中国海洋大学 Unsupervised cross-domain target re-identification method based on comparative learning

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10425641B2 (en) * 2013-05-30 2019-09-24 Intel Corporation Quantization offset and cost factor modification for video encoding
GB2548870B (en) * 2016-03-31 2018-12-05 Ekkosense Ltd Remote monitoring
CN111723645B (en) * 2020-04-24 2023-04-18 浙江大学 Multi-camera high-precision pedestrian re-identification method for in-phase built-in supervised scene
CN114663685B (en) * 2022-02-25 2023-07-04 江南大学 Pedestrian re-recognition model training method, device and equipment
CN114596589A (en) * 2022-03-14 2022-06-07 大连理工大学 Domain-adaptive pedestrian re-identification method based on interactive cascade lightweight transformations
CN114972794A (en) * 2022-06-15 2022-08-30 上海理工大学 Three-dimensional object recognition method based on multi-view Pooll transducer

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022043741A1 (en) * 2020-08-25 2022-03-03 商汤国际私人有限公司 Network training method and apparatus, person re-identification method and apparatus, storage medium, and computer program
CN115205570A (en) * 2022-09-14 2022-10-18 中国海洋大学 Unsupervised cross-domain target re-identification method based on comparative learning

Also Published As

Publication number Publication date
CN115601791A (en) 2023-01-13

Similar Documents

Publication Publication Date Title
US10002313B2 (en) Deeply learned convolutional neural networks (CNNS) for object localization and classification
CN109961051B (en) Pedestrian re-identification method based on clustering and block feature extraction
CN108460356B (en) Face image automatic processing system based on monitoring system
WO2019218824A1 (en) Method for acquiring motion track and device thereof, storage medium, and terminal
CN111639564B (en) Video pedestrian re-identification method based on multi-attention heterogeneous network
US9984280B2 (en) Object recognition system using left and right images and method
CN110717411A (en) Pedestrian re-identification method based on deep layer feature fusion
Yu et al. Multi-attribute adaptive aggregation transformer for vehicle re-identification
CN111191655B (en) Object identification method and device
CN108491856B (en) Image scene classification method based on multi-scale feature convolutional neural network
CN111553845B (en) Quick image stitching method based on optimized three-dimensional reconstruction
CN107292272B (en) Method and system for recognizing human face in real-time transmission video
EP3874404A1 (en) Video recognition using multiple modalities
CN112084952B (en) Video point location tracking method based on self-supervision training
US20230289608A1 (en) Optimizing Supervised Generative Adversarial Networks via Latent Space Regularizations
CN106845555A (en) Image matching method and image matching apparatus based on Bayer format
CN112686247A (en) Identification card number detection method and device, readable storage medium and terminal
Yu et al. Learning bipartite graph matching for robust visual localization
CN115601791B (en) Unsupervised pedestrian re-identification method based on multi-former and outlier sample re-distribution
CN116798070A (en) Cross-mode pedestrian re-recognition method based on spectrum sensing and attention mechanism
HÜSEM et al. A survey on image super-resolution with generative adversarial networks
Liu et al. Visible–Infrared Dual-Sensor Fusion for Single-Object Tracking
Liu et al. Age estimation for low-quality facial images: from separate dcnns to a decision fuser
CN111553202B (en) Training method, detection method and device for neural network for living body detection
CN113763474A (en) Scene geometric constraint-based indoor monocular depth estimation method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant