CN115830548B - Self-adaptive pedestrian re-identification method based on non-supervision multi-field fusion - Google Patents

Self-adaptive pedestrian re-identification method based on non-supervision multi-field fusion Download PDF

Info

Publication number
CN115830548B
CN115830548B CN202310125639.1A CN202310125639A CN115830548B CN 115830548 B CN115830548 B CN 115830548B CN 202310125639 A CN202310125639 A CN 202310125639A CN 115830548 B CN115830548 B CN 115830548B
Authority
CN
China
Prior art keywords
domain
complex
representing
model
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310125639.1A
Other languages
Chinese (zh)
Other versions
CN115830548A (en
Inventor
贾明晖
于洁潇
张涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202310125639.1A priority Critical patent/CN115830548B/en
Publication of CN115830548A publication Critical patent/CN115830548A/en
Application granted granted Critical
Publication of CN115830548B publication Critical patent/CN115830548B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Image Analysis (AREA)

Abstract

The invention provides an unsupervised multi-field fusion self-adaptive pedestrian re-recognition method, which can pretrain a Transformer based on labeled monitored pedestrian pictures of a plurality of monitoring devices at different places in different time periods, and the trained network performs secondary training on the non-labeled monitored pedestrian pictures, so that the aim of finding the same pedestrian at different time of different devices is fulfilled, and the method is not limited by different devices and different places.

Description

Self-adaptive pedestrian re-identification method based on non-supervision multi-field fusion
Technical Field
The invention belongs to the technical field of pedestrian re-identification, and particularly relates to an unsupervised multi-field fusion self-adaptive pedestrian re-identification method.
Background
The purpose of pedestrian re-recognition is to associate specific objects in different scenes and camera views, and an important component of the method is to extract robust features and distinguishing features, which have been dominated by a CNN-based method for a long time, but the CNN-based method mainly focuses on smaller distinguishing areas and downsampling operations (pooling and convolution steps) adopted by the CNN-based method reduce the spatial resolution of an output feature map, so that the distinguishing capability of objects with similar appearances is greatly affected. Most attention mechanism-based methods are embedded in deep layers, more preferring larger continuous areas, and it is difficult to extract multiple diverse discriminable areas. Most of the existing methods are limited by different devices and different places, the application range is limited, and the matching accuracy is low.
Disclosure of Invention
In view of the above, the invention provides an unsupervised multi-field fusion self-adaptive pedestrian re-identification method, which achieves the purpose of finding the same pedestrian at different time spans different devices, is not limited by different devices and different places, and has a larger application range and higher matching accuracy.
In order to achieve the above purpose, the technical scheme of the invention is realized as follows:
an unsupervised multi-domain fusion self-adaptive pedestrian re-identification method comprises the following steps:
step 1: taking the pictures of monitoring pedestrians of a plurality of monitoring devices at different places in different time periods as inputs of a transducer network and a DBSCAN network, and dividing the pictures into a source domain and a target domain according to the number of tag information;
step 2: generating pseudo labels for the pedestrian data of the unlabeled target domain by using the DBSCAN network, so that the label data of the target domain and the label data of the source domain are kept consistent;
step 3: slicing and position coding are carried out on pedestrian images input by a source domain, a transducer network is built to obtain different concerns on different pedestrian characteristic points, and training model parameters are reserved;
step 4: slicing and position coding pedestrian images of the target domain, and simultaneously taking the pedestrian images and the source domain images as input of a transducer network so as to transfer different concerns obtained by the transducer network on the source domain to the target domain;
step 5: constructing a complex graph convolution neural network, taking complex domain centers of a plurality of domain data sets extracted through a transducer network as input, inputting the complex domain centers into the complex graph convolution neural network fused by the multi-domain information, realizing the distribution alignment of samples of different domains, and reducing the difference between the data sets;
step 6: constructing a space context information fusion module, wherein the space context information fusion module is provided with a double-layer MLP structure, real number multi-domain central characteristics extracted by a transducer network are used as the input of the space context information fusion module, the input characteristics are transposed and then put into the double-layer MLP structure for information interaction, and the output of the double-layer MLP structure is transposed again and added with the input characteristics for output;
step 7: and generating image features of the source domain and the target domain from the trained transducer network, and matching the image features, so that the same pedestrian appearing on the target device can be found through the monitoring image of the source device.
Further, the step 3 specifically includes:
blocking each pedestrian picture in the source domain data, namely, blocking the nth picture of the ith domain
Figure SMS_1
Dividing into m blocks, and then encoding each block to obtain word embedded representation, denoted +.>
Figure SMS_2
Position information corresponding to each block is obtained through coding calculation, and the calculation mode is as follows:
Figure SMS_3
wherein
Figure SMS_4
The dimension representing the mth block after the input image division, pos represents the word embedded representation of the mth block, and thus, the input of the transducer network can be expressed as:
Figure SMS_5
wherein
Figure SMS_6
M-block position information representing an nth picture of an ith field;
the transducer network contains two modules, one being the MLP and the other being the multi-head attention mechanism, the two modules will
Figure SMS_7
Input multi-headIn the attention mechanism, the network is enabled to pay attention to the picture in different degrees, and the calculation mode is as follows:
Figure SMS_8
Figure SMS_9
wherein
Figure SMS_10
Representing the projection matrix in the pre-training model, +.>
Figure SMS_11
Output characteristics representing the mechanism as sample image +.>
Figure SMS_12
Also, the input of MLP, so the output of the transducer is:
Figure SMS_13
wherein ,
Figure SMS_14
representing the output of a transducer network>
Figure SMS_15
And calculating and optimizing cross entropy loss of the prediction label of the source domain and the real label of the source domain by the prediction label of the picture, and finally obtaining a transducer network pre-training model with good performance.
Further, the step 4 specifically includes:
constructing two models with completely consistent structures, wherein one model is a training model, and the other model is a ema model; slicing the image sample of the target domain, and slicing the nth picture of the target domain
Figure SMS_16
Dividing into m blocks, and then encoding each block to obtain word embedded representation, denoted +.>
Figure SMS_17
And calculates the corresponding position code, which is recorded as: />
Figure SMS_18
Figure SMS_19
wherein
Figure SMS_20
The dimension representing the mth block after the input image division, pos represents the word embedded representation of the mth block, and thus, the target field input of the transducer network can be expressed as:
Figure SMS_21
Figure SMS_22
m-block position information representing an nth picture of the target domain;
source domain
Figure SMS_23
And the target Domain->
Figure SMS_24
Coded samples->
Figure SMS_25
Input into the Multi-head attention mechanism, where +.>
Figure SMS_26
and />
Figure SMS_27
Respectively representing the number of source domains and the number of target domainsAccording to the sample set, the transform network also forms different degrees of attention to the target domain picture so as to realize the migration of the attention degree, and the calculation mode is as follows:
Figure SMS_28
Figure SMS_29
wherein
Figure SMS_30
Is the sample feature obtained by extracting n image samples in the d-th domain, ++>
Figure SMS_31
,/>
Figure SMS_32
Representing the source domain->
Figure SMS_33
Representing a target domain, wherein->
Figure SMS_34
The projection matrix representing the current training is updated in the following way:
Figure SMS_35
wherein ,
Figure SMS_36
model parameter matrix expressed as optimal, +.>
Figure SMS_37
The duty cycle of the model obtained for the historical training times and the current training times is calculated as the current training times, when the training times is 0,
Figure SMS_38
for the pre-training of the obtained parameter matrix,by continuously training the optimization, a training model which can achieve the best effect on the same equipment is finally obtained.
Further, the step 5 specifically includes:
the nth sample feature obtained in step 4
Figure SMS_39
, wherein />
Figure SMS_40
Representing the d-th domain, generating the real number domain center of each domain, denoted as:
Figure SMS_41
wherein ,
Figure SMS_42
representing the real number domain center of the d-th domain; mapping it to a complex feature space, denoted as each domain center:
Figure SMS_43
Figure SMS_44
wherein ,
Figure SMS_45
representing an operation of acquiring as the real part of a feature in a complex feature space,/->
Figure SMS_46
Representing an operation of acquiring the imaginary part of the feature in the complex feature space,/->
Figure SMS_47
Representing the complex domain center of the d-th domain, the>
Figure SMS_48
and />
Figure SMS_49
Is two trainable weights, two complex graph convolution neural networks are respectively constructed aiming at a training model and a ema model, and the complex graph convolution neural network of the training model is marked as->
Figure SMS_50
The complex graph convolutional neural network of ema model is marked as
Figure SMS_51
Complex graph convolutional neural network for training model, its nodes
Figure SMS_52
Is the complex domain center of the d-th domain, its adjacency matrix +.>
Figure SMS_53
Represented is a vector relationship between the centers in the complex domain, which can be expressed as:
Figure SMS_54
wherein a and b represent an a-th domain and a b-th domain, so that the complex graph convolutional neural network of the training model is updated as follows:
Figure SMS_55
wherein ,
Figure SMS_56
plural domain center representing the d-th domain, < >>
Figure SMS_57
Representing the global complex domain center post-update feature of the d-th domain,/->
Figure SMS_58
Representing complex figuresAdjacency matrix of convolutional neural network, +.>
Figure SMS_59
Representing a matrix of unity complex>
Figure SMS_60
Complex degree matrix of representation->
Figure SMS_61
Representing a complex nonlinear activation function, < >>
Figure SMS_62
Is a learnable complex parameter;
aiming at the complex diagram convolutional neural network of the ema model, the output is obtained by calculation by using the same calculation method as the complex diagram convolutional neural network method of the training model
Figure SMS_63
And performing modulo operation on the output complex centers, measuring the distance between the complex domain centers updated by the two complex graph convolutional neural networks by using Euclidean distance, and further reducing the sample distribution difference between different domains by using an MSE loss function so as to optimize a transducer network model. />
Further, the step 6 specifically includes: respectively constructing two double-layer MLP structures aiming at a training model and a ema model;
for training model, the real number domain center of the d-th domain obtained in the step 5 is used for
Figure SMS_64
First, transpose is performed to obtain
Figure SMS_65
And inputs it into a double layer MLP, whose output is:
Figure SMS_66
wherein
Figure SMS_67
and />
Figure SMS_68
Representing two trainable weights, +.>
Figure SMS_69
Representing a nonlinear activation function, then the output of the double layer MLP is +.>
Figure SMS_70
After transposition with the original input->
Figure SMS_71
And adding the spatial context information as an output of the spatial context information fusion module:
Figure SMS_72
obtaining the output of the spatial context information fusion module on the ema model in the same way as training the model
Figure SMS_73
And finally, carrying out distance measurement on the output of the spatial fusion module on the training model and the ema model, further reducing the difference between multiple domains, and optimizing the transducer network model.
Compared with the prior art, the self-adaptive pedestrian re-identification method based on the non-supervision multi-field fusion has the following advantages:
the invention provides a multi-source domain-based transform and multi-domain information fusion technology, which trains data sets generated on different devices, so that the obtained sample features are more generalized, and the problem that the same pedestrian cannot be matched accurately due to the inconsistency of angles, shooting time and space positions among the devices is solved;
the invention constructs a complex graph convolutional neural network, uses a complex feature space to explore vector semantic association, and further reduces inter-domain sample distribution difference from the vector structure level;
the invention constructs a spatial context information fusion, directly explores spatial direct association of the domain central characteristics, and further reduces inter-domain sample distribution difference from the spatial context level.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention. In the drawings:
FIG. 1 is a flow chart of a first stage of training of the present invention;
fig. 2 is a flow chart of a second stage of training of the present invention.
Detailed Description
It should be noted that, without conflict, the embodiments of the present invention and features of the embodiments may be combined with each other.
In the description of the present invention, it should be understood that the terms "center", "longitudinal", "lateral", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, are merely for convenience in describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first", "a second", etc. may explicitly or implicitly include one or more such feature. In the description of the present invention, unless otherwise indicated, the meaning of "a plurality" is two or more.
In the description of the present invention, it should be noted that, unless explicitly specified and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be either fixedly connected, detachably connected, or integrally connected, for example; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art in a specific case.
The invention will be described in detail below with reference to the drawings in connection with embodiments.
The invention provides a transform-based self-adaptive pedestrian re-identification method based on non-supervision multi-field fusion. The training of the method model requires two stages. As shown in fig. 1, in the first stage of training, model pre-training is performed using pedestrian data photographed by a plurality of different monitoring devices at different time periods and having a large number of tags, and model parameters obtained by the training are retained. As shown in fig. 2, in the second stage of training, the model parameters of the first stage are imported into the network, and the model is fine-tuned by using the same tag data as the first stage and the untagged pedestrian data shot by the additional target monitoring device, so as to find different time periods and different places of the same pedestrian on the target source device.
The invention discloses an unsupervised multi-field fusion self-adaptive pedestrian re-identification method, which comprises the following steps:
step 1, taking the pictures of monitoring pedestrians of a plurality of monitoring devices at different places in different time periods as inputs of a transducer network and a DBSCAN network, and dividing the pictures into a source domain and a target domain according to the number of tag information.
Specifically, we take 4 groups of monitoring device images as an example, wherein 3 groups of monitoring device images with complete labels and 1 group of monitoring device images without labels. We consider the ith group of n monitoring device images with complete labels as source domain data, defined as
Figure SMS_74
Each group contains n samples, wherein +.>
Figure SMS_75
The corresponding data tag is +.>
Figure SMS_76
. We regard n monitoring device images without labels as target domain data, defined as +.>
Figure SMS_77
Containing n samples.
And 2, generating pseudo labels for the pedestrian data of the unlabeled target domain by using the DBSCAN network so as to keep the label data of the target domain consistent with the label data of the source domain. The invention uses the DBSCAN network to generate pseudo labels for the pedestrian data of the unlabeled target domain.
Specifically, since the target domain data does not contain tag data, the target domain tag needs to be generated by a clustering method, and DBSCAN is a clustering algorithm based on density clustering, which is representative, and defines a maximum set of points connected by density for clusters. The target domain data tag is acquired by using DBSCAN, and the steps are as follows:
(1) Assuming that any data point is taken in the target domain
Figure SMS_78
Parameter->
Figure SMS_79
Representing the tightness of the sample distribution of the neighborhood, wherein +.>
Figure SMS_80
Neighborhood distance threshold value representing a certain core point, < ->
Figure SMS_81
Indicating a core point radius of +.>
(2)
Figure SMS_82
The minimum number of core points in the neighborhood of (a);
(3) If for parameters
Figure SMS_83
and />
Figure SMS_84
Selected data point +.>
Figure SMS_85
For the core point, find all the sub +.>
Figure SMS_86
Data points with reachable density form a cluster;
(4) If a data point is selected
Figure SMS_87
The edge point is used for selecting another data point as a core point;
(5) Repeating the steps (2) and (3) until all points are processed.
Finally, each cluster obtained represents a category, and thus, the pseudo tag of the obtained target domain is represented as
Figure SMS_88
Wherein the target field contains n samples.
And 3, in the first stage of training, slicing and position coding are carried out on the pedestrian image input by the source domain, a transducer network is built to obtain different concerns about different pedestrian characteristic points, and training model parameters are reserved.
Specifically, we block each pedestrian picture in the source domain data, i.e. the nth picture of the ith domain
Figure SMS_89
Dividing into m blocks, and then encoding each block to obtain word embedded representation, denoted +.>
Figure SMS_90
. By encoding, we can calculate and obtain the position information corresponding to each block, and the calculation mode is as follows:
Figure SMS_91
Figure SMS_92
wherein
Figure SMS_93
Representing the dimension of the mth block after the input image is divided, pos represents the word embedded representation of the mth block. Thus, the input of the transducer can be expressed as:
Figure SMS_94
wherein
Figure SMS_95
M-block position information indicating an nth picture of an ith field. The transducer network contains two modules, one being an MLP and the other being a multi-headed attention mechanism. We will
Figure SMS_96
The method is input into a multi-head attention mechanism, so that the network focuses on the picture to different degrees, and the calculation mode is as follows:
Figure SMS_97
Figure SMS_98
wherein
Figure SMS_99
Representing the projection matrix in the pre-trained model. />
Figure SMS_100
Representing the output characteristics of the mechanism, i.e. sample image +.>
Figure SMS_101
Also, the input of MLP, so the output of the transducer is: />
Figure SMS_102
wherein ,
Figure SMS_103
representing the output of a transducer network>
Figure SMS_104
Predictive label of picture. And calculating and optimizing cross entropy loss of the prediction label and the real label of the source domain, and finally obtaining the transducer network pre-training model with good performance.
And 4, in the second training stage, slicing and position encoding the pedestrian image of the target domain, and simultaneously taking the pedestrian image and the source domain image as input of a transducer network so as to migrate different interests obtained by the transducer network on the source domain to the target domain.
We constructed two models of exactly identical structure in the second training phase, one training model and the other model ema (i.e. the optimal model), which used the same initialization conditions. The purpose of training the model is to obtain model parameters for the current training batch. The purpose of the ema model is to obtain the mean of the model parameters of the previous training batch and the model parameters of the current batch, thereby representing the optimal model.
Specifically, the image sample of the target domain is sliced and recorded as
Figure SMS_105
And calculates the corresponding position code, which is recorded as:
Figure SMS_106
Figure SMS_107
wherein
Figure SMS_108
Representing the dimension of the mth block after the input image is divided, pos represents the word embedded representation of the mth block. Thus, the target field input of a transducer can be expressed as:
Figure SMS_109
Figure SMS_110
m-block position information representing an nth picture of the target domain;
we will source domain
Figure SMS_111
And the target Domain->
Figure SMS_112
Coded samples->
Figure SMS_113
Input into the Multi-head attention mechanism, where +.>
Figure SMS_114
and />
Figure SMS_115
The method respectively represents a source domain data sample set and a target domain data sample set, so that the network also forms different degrees of attention to the target domain picture to realize migration of the attention degree, and the calculation mode is as follows:
Figure SMS_116
Figure SMS_117
wherein
Figure SMS_118
Is the sample feature obtained by extracting n image samples in the d-th domain, ++>
Figure SMS_119
,/>
Figure SMS_120
Representing the source domain->
Figure SMS_121
Representing a target domain, wherein->
Figure SMS_122
The projection matrix representing the current training is updated in the following way:
Figure SMS_123
wherein ,
Figure SMS_124
model parameter matrix expressed as optimal, +.>
Figure SMS_125
The duty cycle of the model obtained for the historical training times and the current training times. When the number of training times is 0,
Figure SMS_126
the obtained parameter matrix is pre-trained. By constantly training the optimization, a training model is finally obtained that achieves the best results on all four different devices.
5. The complex graph convolution neural network module is constructed, complex domain centers of a plurality of domain data sets extracted through a transducer network are used as input and input into the complex graph convolution neural network module with multi-domain information fusion, and vector space structural association among domains is explored, so that distribution alignment of different domain samples is further realized, and differences among the data sets are reduced.
Specifically, by usingThe nth sample feature obtained In (IV)
Figure SMS_127
, wherein />
Figure SMS_128
Representing the d-th domain, generating the real number domain center of each domain, denoted as:
Figure SMS_129
wherein ,
Figure SMS_130
the real number domain center of the d-th domain is shown. To further consider the structured relationships between domain centers and mine the semantic associations of domain centers, we map them to a complex feature space to explore the vector semantic associations between domain centers, i.e., each domain center is noted as:
Figure SMS_131
Figure SMS_132
wherein ,
Figure SMS_133
representing an operation of acquiring as the real part of a feature in a complex feature space,/->
Figure SMS_134
Representing an operation of acquiring the imaginary part of the feature in the complex feature space,/->
Figure SMS_135
Representing the complex domain center of the d-th domain, the>
Figure SMS_136
and />
Figure SMS_137
Is two trainable weights. Two complex graph convolutional neural networks are respectively constructed aiming at a training model and a ema model, and the complex graph convolutional neural network of the training model is marked as +.>
Figure SMS_138
The complex graph convolutional neural network of ema model is marked as
Figure SMS_139
For complex graph convolutional neural network of training model, its node
Figure SMS_140
Is the complex domain center of the d-th domain, its adjacency matrix +.>
Figure SMS_141
Represented is a vector relationship between the centers in the complex domain, which can be expressed as:
Figure SMS_142
wherein a and b represent an a-th domain and a b-th domain. Therefore, the updating mode of the graph convolution neural network is as follows:
Figure SMS_143
wherein ,
Figure SMS_144
representing the complex domain center of the d-th domain, the>
Figure SMS_145
Representing the global complex domain center post-update feature of the d-th domain,/->
Figure SMS_146
Representing the adjacency matrix of the complex-graph convolutional neural network,/>
Figure SMS_147
representing a matrix of unity complex>
Figure SMS_148
Complex degree matrix of representation->
Figure SMS_149
Representing a complex nonlinear activation function, < >>
Figure SMS_150
Is a complex parameter that can be learned.
For the complex graph convolutional neural network of the ema model, the calculation method is consistent with the model training method, so that output is obtained
Figure SMS_151
After the output complex centers are subjected to modulo operation, euclidean distance is used for measuring the distance between complex domain centers updated by two complex graph convolutional neural networks, and MSE loss functions are utilized for further reducing sample distribution differences among unnecessary domains, so that a transform network model is optimized.
And 6, constructing a space context information fusion module, wherein the space context information fusion module is provided with a double-layer MLP structure, taking real number multi-domain central features extracted by a transducer network as the input of the space context information fusion module, transposing the input features, putting the transposed input features into the double-layer MLP for information interaction, transposing the output of the double-layer MLP again and adding the input features, exploring the association of each dimension of the central features in each domain, namely exploring the direct association of one space between the domain centers, and further realizing the distribution alignment of multi-domain samples in space.
Specifically, we constructed two bilayer MLP structures for the training model and ema model, respectively. For training models, the real number domain center of the d-th domain obtained in (fifth), namely
Figure SMS_152
First transpose is carried out to obtain +.>
Figure SMS_153
And inputs it into a double layer MLP, whose output is:
Figure SMS_154
wherein
Figure SMS_155
and />
Figure SMS_156
Representing two trainable weights, +.>
Figure SMS_157
Representing a nonlinear activation function. Then we will add the output of the bilayer MLP +.>
Figure SMS_158
After transposition with the original input->
Figure SMS_159
And adding the spatial context information as an output of the spatial context information fusion module:
Figure SMS_160
similar to the training model, we also obtain the output of the spatial context information fusion module on the ema model
Figure SMS_161
. Finally, the distance measurement is carried out on the outputs of the spatial fusion modules on the training model and the ema model, so that the difference between multiple domains is further reduced, and the optimization is realized on the transducer network model.
7. And generating image features of the source domain and the target domain from the trained transducer network, and matching the image features, so that the same pedestrian appearing on the target device can be found through the monitoring image of the source device.
After the optimal model is obtained, image features acquired by different devices are matched, and similarity between every two image features is measured by using Euclidean distance, so that images of the same pedestrian are found under different devices, and pedestrian re-recognition is realized.
The method provided by the invention is primarily compared with the CNN algorithm-based pedestrian re-identification problem in four data sets, the average accuracy of the algorithm is 65.9%, the average accuracy of the algorithm based on the transformation algorithm is 62.8%, and the average accuracy of the algorithm based on the CNN algorithm is 56.3%. The method of the invention improves the accuracy of pedestrian re-identification.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims (3)

1. An unsupervised multi-domain fusion self-adaptive pedestrian re-identification method is characterized in that: the method comprises the following steps:
step 1: taking the pictures of monitoring pedestrians of a plurality of monitoring devices at different places in different time periods as inputs of a transducer network and a DBSCAN network, and dividing the pictures into a source domain and a target domain according to the number of tag information;
step 2: generating pseudo labels for the pedestrian data of the unlabeled target domain by using the DBSCAN network, so that the label data of the target domain and the label data of the source domain are kept consistent;
step 3: slicing and position coding are carried out on pedestrian images input by a source domain, a transducer network is built to obtain different concerns on different pedestrian characteristic points, and training model parameters are reserved;
step 4: slicing and position coding pedestrian images of the target domain, and simultaneously taking the pedestrian images and the source domain images as input of a transducer network so as to transfer different concerns obtained by the transducer network on the source domain to the target domain; the method specifically comprises the following steps:
constructing two models with completely consistent structures, wherein one model is a training model, and the other model is a ema model;
slicing the image sample of the target domain, and slicing the nth picture of the target domain
Figure QLYQS_1
Dividing into m blocks, and then encoding each block to obtain word embedded representation, denoted +.>
Figure QLYQS_2
And calculates the corresponding position code, which is recorded as:
Figure QLYQS_3
Figure QLYQS_4
wherein
Figure QLYQS_5
The dimension representing the mth block after the input image division, pos represents the word embedded representation of the mth block, and thus, the target field input of the transducer network can be expressed as:
Figure QLYQS_6
Figure QLYQS_7
m-block position information representing an nth picture of the target domain;
encoded samples of source and target fields
Figure QLYQS_8
Input into the Multi-head attention mechanism, where +.>
Figure QLYQS_9
and />
Figure QLYQS_10
The method respectively represents a source domain data sample set and a target domain data sample set, so that the Transformer network also forms different degrees of attention to the target domain picture to realize the migration of the attention degree, and the calculation mode is as follows:
Figure QLYQS_11
Figure QLYQS_12
wherein
Figure QLYQS_13
Is the sample feature obtained by extracting n image samples in the d-th domain, ++>
Figure QLYQS_14
,/>
Figure QLYQS_15
Representing the source domain->
Figure QLYQS_16
Representing a target domain, wherein->
Figure QLYQS_17
The projection matrix representing the current training is updated in the following way: />
Figure QLYQS_18
wherein ,
Figure QLYQS_19
model parameter moment expressed as optimumArray (S)>
Figure QLYQS_20
For the duty ratio of the model obtained for the historical training times and the current training times, when the training times is 0,/for the model obtained for the current training times>
Figure QLYQS_21
The parameter matrix obtained for pre-training is optimized through continuous training, and finally a training model which can achieve the best effect on the same equipment is obtained;
step 5: constructing a complex graph convolution neural network, taking complex domain centers of a plurality of domain data sets extracted through a transducer network as input, inputting the complex domain centers into the complex graph convolution neural network fused by the multi-domain information, realizing the distribution alignment of samples of different domains, and reducing the difference between the data sets; the method specifically comprises the following steps:
the nth sample feature obtained in step 4
Figure QLYQS_22
, wherein />
Figure QLYQS_23
Representing the d-th domain, generating the real number domain center of each domain, denoted as:
Figure QLYQS_24
wherein ,
Figure QLYQS_25
representing the real number domain center of the d-th domain; mapping it to a complex feature space, denoted as each domain center:
Figure QLYQS_26
Figure QLYQS_27
wherein ,
Figure QLYQS_28
representing an operation of acquiring as the real part of a feature in a complex feature space,/->
Figure QLYQS_29
Representing an operation of acquiring the imaginary part of the feature in the complex feature space,/->
Figure QLYQS_30
Representing the complex domain center of the d-th domain, the>
Figure QLYQS_31
and />
Figure QLYQS_32
Is two trainable weights, two complex graph convolution neural networks are respectively constructed aiming at a training model and a ema model, and the complex graph convolution neural network of the training model is marked as->
Figure QLYQS_33
The complex graph convolutional neural network of ema model is marked as
Figure QLYQS_34
Complex graph convolutional neural network for training model, its nodes
Figure QLYQS_35
Is the complex domain center of the d-th domain, its adjacency matrix +.>
Figure QLYQS_36
Represented is a vector relationship between the centers in the complex domain, which can be expressed as: />
Figure QLYQS_37
Wherein a and b represent an a-th domain and a b-th domain, so that the complex graph convolutional neural network of the training model is updated as follows:
Figure QLYQS_38
wherein ,
Figure QLYQS_39
plural domain center representing the d-th domain, < >>
Figure QLYQS_40
Representing the updated features of the center of the global complex domain of the d-th domain,
Figure QLYQS_41
adjacency matrix representing a complex graph convolutional neural network, < >>
Figure QLYQS_42
Representing a matrix of unity complex>
Figure QLYQS_43
The complex number matrix of the representation is presented,
Figure QLYQS_44
representing a complex nonlinear activation function, < >>
Figure QLYQS_45
Is a learnable complex parameter;
aiming at the complex diagram convolutional neural network of the ema model, the output is obtained by calculation by using the same calculation method as the complex diagram convolutional neural network method of the training model
Figure QLYQS_46
Then, after the output complex centers are subjected to the modulo operation, the Euclidean distance is used for measuring twoThe distance between the centers of complex domains updated by the complex graph convolutional neural network is further reduced by using an MSE loss function, so that a transducer network model is optimized;
step 6: constructing a space context information fusion module, wherein the space context information fusion module is provided with a double-layer MLP structure, real number multi-domain central characteristics extracted by a transducer network are used as the input of the space context information fusion module, the input characteristics are transposed and then put into the double-layer MLP structure for information interaction, and the output of the double-layer MLP structure is transposed again and added with the input characteristics for output;
step 7: and generating image features of the source domain and the target domain from the trained transducer network, and matching the image features, so that the same pedestrian appearing on the target device can be found through the monitoring image of the source device.
2. The self-adaptive pedestrian re-recognition method for non-supervision multi-domain fusion according to claim 1, wherein the method comprises the following steps: the step 3 specifically includes:
blocking each pedestrian picture in the source domain data, namely, blocking the nth picture of the ith domain
Figure QLYQS_47
Dividing into m blocks, and then encoding each block to obtain word embedded representation, denoted +.>
Figure QLYQS_48
Position information corresponding to each block is obtained through coding calculation, and the calculation mode is as follows:
Figure QLYQS_49
wherein
Figure QLYQS_50
Representing the dimension of the mth block after the input image is divided, pos representing the word embedding of the mth blockInput into the representation, therefore, the input into the transducer network can be represented as:
Figure QLYQS_51
/>
wherein
Figure QLYQS_52
M-block position information representing an nth picture of an ith field;
the transducer network contains two modules, one being the MLP and the other being the multi-head attention mechanism, the two modules will
Figure QLYQS_53
The method is input into a multi-head attention mechanism, so that the network focuses on the picture to different degrees, and the calculation mode is as follows:
Figure QLYQS_54
Figure QLYQS_55
wherein
Figure QLYQS_56
Representing the projection matrix in the pre-training model, +.>
Figure QLYQS_57
Output characteristics representing the mechanism as sample image +.>
Figure QLYQS_58
Also, the input of MLP, so the output of the transducer is:
Figure QLYQS_59
wherein ,
Figure QLYQS_60
representing the output of a transducer network>
Figure QLYQS_61
And calculating and optimizing cross entropy loss of the prediction label of the source domain and the real label of the source domain by the prediction label of the picture, and finally obtaining a transducer network pre-training model with good performance.
3. The self-adaptive pedestrian re-recognition method for non-supervision multi-domain fusion according to claim 1, wherein the method comprises the following steps: the step 6 specifically includes: respectively constructing two double-layer MLP structures aiming at a training model and a ema model;
for training model, the real number domain center of the d-th domain obtained in the step 5 is used for
Figure QLYQS_62
First transpose is carried out to obtain +.>
Figure QLYQS_63
And inputs it into a double layer MLP, whose output is:
Figure QLYQS_64
wherein
Figure QLYQS_65
and />
Figure QLYQS_66
Representing two trainable weights, +.>
Figure QLYQS_67
Representing a nonlinear activation function, then the output of the double layer MLP is +.>
Figure QLYQS_68
After transposition with the original input->
Figure QLYQS_69
And adding the spatial context information as an output of the spatial context information fusion module:
Figure QLYQS_70
obtaining the output of the spatial context information fusion module on the ema model in the same way as training the model
Figure QLYQS_71
And finally, carrying out distance measurement on the output of the spatial fusion module on the training model and the ema model, further reducing the difference between multiple domains, and optimizing the transducer network model. />
CN202310125639.1A 2023-02-17 2023-02-17 Self-adaptive pedestrian re-identification method based on non-supervision multi-field fusion Active CN115830548B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310125639.1A CN115830548B (en) 2023-02-17 2023-02-17 Self-adaptive pedestrian re-identification method based on non-supervision multi-field fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310125639.1A CN115830548B (en) 2023-02-17 2023-02-17 Self-adaptive pedestrian re-identification method based on non-supervision multi-field fusion

Publications (2)

Publication Number Publication Date
CN115830548A CN115830548A (en) 2023-03-21
CN115830548B true CN115830548B (en) 2023-05-05

Family

ID=85521672

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310125639.1A Active CN115830548B (en) 2023-02-17 2023-02-17 Self-adaptive pedestrian re-identification method based on non-supervision multi-field fusion

Country Status (1)

Country Link
CN (1) CN115830548B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110720906A (en) * 2019-09-25 2020-01-24 上海联影智能医疗科技有限公司 Brain image processing method, computer device, and readable storage medium
CN111476168A (en) * 2020-04-08 2020-07-31 山东师范大学 Cross-domain pedestrian re-identification method and system based on three stages
CN112288042A (en) * 2020-12-18 2021-01-29 蚂蚁智信(杭州)信息技术有限公司 Updating method and device of behavior prediction system, storage medium and computing equipment
CN112446423A (en) * 2020-11-12 2021-03-05 昆明理工大学 Fast hybrid high-order attention domain confrontation network method based on transfer learning
CN114677646A (en) * 2022-04-06 2022-06-28 上海电力大学 Vision transform-based cross-domain pedestrian re-identification method
CN115050045A (en) * 2022-04-06 2022-09-13 上海电力大学 Vision MLP-based pedestrian re-identification method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220092420A1 (en) * 2020-09-21 2022-03-24 Intelligent Fusion Technology, Inc. Method, device, and storage medium for deep learning based domain adaptation with data fusion for aerial image data analysis

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110720906A (en) * 2019-09-25 2020-01-24 上海联影智能医疗科技有限公司 Brain image processing method, computer device, and readable storage medium
CN111476168A (en) * 2020-04-08 2020-07-31 山东师范大学 Cross-domain pedestrian re-identification method and system based on three stages
CN112446423A (en) * 2020-11-12 2021-03-05 昆明理工大学 Fast hybrid high-order attention domain confrontation network method based on transfer learning
CN112288042A (en) * 2020-12-18 2021-01-29 蚂蚁智信(杭州)信息技术有限公司 Updating method and device of behavior prediction system, storage medium and computing equipment
CN114677646A (en) * 2022-04-06 2022-06-28 上海电力大学 Vision transform-based cross-domain pedestrian re-identification method
CN115050045A (en) * 2022-04-06 2022-09-13 上海电力大学 Vision MLP-based pedestrian re-identification method

Also Published As

Publication number Publication date
CN115830548A (en) 2023-03-21

Similar Documents

Publication Publication Date Title
CN111400620B (en) User trajectory position prediction method based on space-time embedded Self-orientation
CN109949317B (en) Semi-supervised image example segmentation method based on gradual confrontation learning
CN111291212B (en) Zero sample sketch image retrieval method and system based on graph convolution neural network
CN107506740B (en) Human body behavior identification method based on three-dimensional convolutional neural network and transfer learning model
CN110781838A (en) Multi-modal trajectory prediction method for pedestrian in complex scene
CN109858390A (en) The Activity recognition method of human skeleton based on end-to-end space-time diagram learning neural network
Zhao et al. Where are you heading? dynamic trajectory prediction with expert goal examples
CN116738911B (en) Wiring congestion prediction method and device and computer equipment
CN115423847B (en) Twin multi-modal target tracking method based on Transformer
CN113065409A (en) Unsupervised pedestrian re-identification method based on camera distribution difference alignment constraint
Yin et al. Automerge: A framework for map assembling and smoothing in city-scale environments
CN116485839A (en) Visual tracking method based on attention self-adaptive selection of transducer
CN115841683A (en) Light-weight pedestrian re-identification method combining multi-level features
CN115051925A (en) Time-space sequence prediction method based on transfer learning
CN116524197B (en) Point cloud segmentation method, device and equipment combining edge points and depth network
CN115830548B (en) Self-adaptive pedestrian re-identification method based on non-supervision multi-field fusion
CN115631513B (en) Transformer-based multi-scale pedestrian re-identification method
CN115830643A (en) Light-weight pedestrian re-identification method for posture-guided alignment
CN115797557A (en) Self-supervision 3D scene flow estimation method based on graph attention network
CN116030255A (en) System and method for three-dimensional point cloud semantic segmentation
CN115034459A (en) Pedestrian trajectory time sequence prediction method
CN112801179A (en) Twin classifier certainty maximization method for cross-domain complex visual task
CN117612214B (en) Pedestrian search model compression method based on knowledge distillation
Chen et al. Memory segment matching network based image geo-localization
Sheng et al. Learning a deep metric: a lightweight relation network for loop closure in complex industrial scenarios

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant