CN115830548B

CN115830548B - Self-adaptive pedestrian re-identification method based on non-supervision multi-field fusion

Info

Publication number: CN115830548B
Application number: CN202310125639.1A
Authority: CN
Inventors: 贾明晖; 于洁潇; 张涛
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2023-02-17
Filing date: 2023-02-17
Publication date: 2023-05-05
Anticipated expiration: 2043-02-17
Also published as: CN115830548A

Abstract

The invention provides an unsupervised multi-field fusion self-adaptive pedestrian re-recognition method, which can pretrain a Transformer based on labeled monitored pedestrian pictures of a plurality of monitoring devices at different places in different time periods, and the trained network performs secondary training on the non-labeled monitored pedestrian pictures, so that the aim of finding the same pedestrian at different time of different devices is fulfilled, and the method is not limited by different devices and different places.

Description

Self-adaptive pedestrian re-identification method based on non-supervision multi-field fusion

Technical Field

The invention belongs to the technical field of pedestrian re-identification, and particularly relates to an unsupervised multi-field fusion self-adaptive pedestrian re-identification method.

Background

The purpose of pedestrian re-recognition is to associate specific objects in different scenes and camera views, and an important component of the method is to extract robust features and distinguishing features, which have been dominated by a CNN-based method for a long time, but the CNN-based method mainly focuses on smaller distinguishing areas and downsampling operations (pooling and convolution steps) adopted by the CNN-based method reduce the spatial resolution of an output feature map, so that the distinguishing capability of objects with similar appearances is greatly affected. Most attention mechanism-based methods are embedded in deep layers, more preferring larger continuous areas, and it is difficult to extract multiple diverse discriminable areas. Most of the existing methods are limited by different devices and different places, the application range is limited, and the matching accuracy is low.

Disclosure of Invention

In view of the above, the invention provides an unsupervised multi-field fusion self-adaptive pedestrian re-identification method, which achieves the purpose of finding the same pedestrian at different time spans different devices, is not limited by different devices and different places, and has a larger application range and higher matching accuracy.

In order to achieve the above purpose, the technical scheme of the invention is realized as follows:

an unsupervised multi-domain fusion self-adaptive pedestrian re-identification method comprises the following steps:

step 1: taking the pictures of monitoring pedestrians of a plurality of monitoring devices at different places in different time periods as inputs of a transducer network and a DBSCAN network, and dividing the pictures into a source domain and a target domain according to the number of tag information;

step 2: generating pseudo labels for the pedestrian data of the unlabeled target domain by using the DBSCAN network, so that the label data of the target domain and the label data of the source domain are kept consistent;

step 3: slicing and position coding are carried out on pedestrian images input by a source domain, a transducer network is built to obtain different concerns on different pedestrian characteristic points, and training model parameters are reserved;

step 4: slicing and position coding pedestrian images of the target domain, and simultaneously taking the pedestrian images and the source domain images as input of a transducer network so as to transfer different concerns obtained by the transducer network on the source domain to the target domain;

step 5: constructing a complex graph convolution neural network, taking complex domain centers of a plurality of domain data sets extracted through a transducer network as input, inputting the complex domain centers into the complex graph convolution neural network fused by the multi-domain information, realizing the distribution alignment of samples of different domains, and reducing the difference between the data sets;

step 6: constructing a space context information fusion module, wherein the space context information fusion module is provided with a double-layer MLP structure, real number multi-domain central characteristics extracted by a transducer network are used as the input of the space context information fusion module, the input characteristics are transposed and then put into the double-layer MLP structure for information interaction, and the output of the double-layer MLP structure is transposed again and added with the input characteristics for output;

step 7: and generating image features of the source domain and the target domain from the trained transducer network, and matching the image features, so that the same pedestrian appearing on the target device can be found through the monitoring image of the source device.

Further, the step 3 specifically includes:

blocking each pedestrian picture in the source domain data, namely, blocking the nth picture of the ith domain

Dividing into m blocks, and then encoding each block to obtain word embedded representation, denoted +.>

Position information corresponding to each block is obtained through coding calculation, and the calculation mode is as follows:

wherein

The dimension representing the mth block after the input image division, pos represents the word embedded representation of the mth block, and thus, the input of the transducer network can be expressed as:

wherein

M-block position information representing an nth picture of an ith field;

the transducer network contains two modules, one being the MLP and the other being the multi-head attention mechanism, the two modules will

Input multi-headIn the attention mechanism, the network is enabled to pay attention to the picture in different degrees, and the calculation mode is as follows:

wherein

Representing the projection matrix in the pre-training model, +.>

Output characteristics representing the mechanism as sample image +.>

Also, the input of MLP, so the output of the transducer is:

wherein ,

representing the output of a transducer network>

And calculating and optimizing cross entropy loss of the prediction label of the source domain and the real label of the source domain by the prediction label of the picture, and finally obtaining a transducer network pre-training model with good performance.

Further, the step 4 specifically includes:

constructing two models with completely consistent structures, wherein one model is a training model, and the other model is a ema model; slicing the image sample of the target domain, and slicing the nth picture of the target domain

And calculates the corresponding position code, which is recorded as: />

wherein

The dimension representing the mth block after the input image division, pos represents the word embedded representation of the mth block, and thus, the target field input of the transducer network can be expressed as:

m-block position information representing an nth picture of the target domain;

source domain

And the target Domain->

Coded samples->

Input into the Multi-head attention mechanism, where +.>

and />

Respectively representing the number of source domains and the number of target domainsAccording to the sample set, the transform network also forms different degrees of attention to the target domain picture so as to realize the migration of the attention degree, and the calculation mode is as follows:

wherein

Is the sample feature obtained by extracting n image samples in the d-th domain, ++>

，/>

Representing the source domain->

Representing a target domain, wherein->

The projection matrix representing the current training is updated in the following way:

wherein ,

model parameter matrix expressed as optimal, +.>

The duty cycle of the model obtained for the historical training times and the current training times is calculated as the current training times, when the training times is 0,

for the pre-training of the obtained parameter matrix,by continuously training the optimization, a training model which can achieve the best effect on the same equipment is finally obtained.

Further, the step 5 specifically includes:

the nth sample feature obtained in step 4

, wherein />

Representing the d-th domain, generating the real number domain center of each domain, denoted as:

wherein ,

representing the real number domain center of the d-th domain; mapping it to a complex feature space, denoted as each domain center:

wherein ,

representing an operation of acquiring as the real part of a feature in a complex feature space,/->

Representing an operation of acquiring the imaginary part of the feature in the complex feature space,/->

Representing the complex domain center of the d-th domain, the>

and />

Is two trainable weights, two complex graph convolution neural networks are respectively constructed aiming at a training model and a ema model, and the complex graph convolution neural network of the training model is marked as->

The complex graph convolutional neural network of ema model is marked as

；

Complex graph convolutional neural network for training model, its nodes

Is the complex domain center of the d-th domain, its adjacency matrix +.>

Represented is a vector relationship between the centers in the complex domain, which can be expressed as:

wherein a and b represent an a-th domain and a b-th domain, so that the complex graph convolutional neural network of the training model is updated as follows:

wherein ,

plural domain center representing the d-th domain, < >>

Representing the global complex domain center post-update feature of the d-th domain,/->

Representing complex figuresAdjacency matrix of convolutional neural network, +.>

Representing a matrix of unity complex>

Complex degree matrix of representation->

Representing a complex nonlinear activation function, < >>

Is a learnable complex parameter;

aiming at the complex diagram convolutional neural network of the ema model, the output is obtained by calculation by using the same calculation method as the complex diagram convolutional neural network method of the training model

And performing modulo operation on the output complex centers, measuring the distance between the complex domain centers updated by the two complex graph convolutional neural networks by using Euclidean distance, and further reducing the sample distribution difference between different domains by using an MSE loss function so as to optimize a transducer network model. />

Further, the step 6 specifically includes: respectively constructing two double-layer MLP structures aiming at a training model and a ema model;

for training model, the real number domain center of the d-th domain obtained in the step 5 is used for

First, transpose is performed to obtain

And inputs it into a double layer MLP, whose output is:

wherein

and />

Representing two trainable weights, +.>

Representing a nonlinear activation function, then the output of the double layer MLP is +.>

After transposition with the original input->

And adding the spatial context information as an output of the spatial context information fusion module:

obtaining the output of the spatial context information fusion module on the ema model in the same way as training the model

And finally, carrying out distance measurement on the output of the spatial fusion module on the training model and the ema model, further reducing the difference between multiple domains, and optimizing the transducer network model.

Compared with the prior art, the self-adaptive pedestrian re-identification method based on the non-supervision multi-field fusion has the following advantages:

the invention provides a multi-source domain-based transform and multi-domain information fusion technology, which trains data sets generated on different devices, so that the obtained sample features are more generalized, and the problem that the same pedestrian cannot be matched accurately due to the inconsistency of angles, shooting time and space positions among the devices is solved;

the invention constructs a complex graph convolutional neural network, uses a complex feature space to explore vector semantic association, and further reduces inter-domain sample distribution difference from the vector structure level;

the invention constructs a spatial context information fusion, directly explores spatial direct association of the domain central characteristics, and further reduces inter-domain sample distribution difference from the spatial context level.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention. In the drawings:

FIG. 1 is a flow chart of a first stage of training of the present invention;

fig. 2 is a flow chart of a second stage of training of the present invention.

Detailed Description

It should be noted that, without conflict, the embodiments of the present invention and features of the embodiments may be combined with each other.

In the description of the present invention, it should be understood that the terms "center", "longitudinal", "lateral", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, are merely for convenience in describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first", "a second", etc. may explicitly or implicitly include one or more such feature. In the description of the present invention, unless otherwise indicated, the meaning of "a plurality" is two or more.

In the description of the present invention, it should be noted that, unless explicitly specified and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be either fixedly connected, detachably connected, or integrally connected, for example; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art in a specific case.

The invention will be described in detail below with reference to the drawings in connection with embodiments.

The invention provides a transform-based self-adaptive pedestrian re-identification method based on non-supervision multi-field fusion. The training of the method model requires two stages. As shown in fig. 1, in the first stage of training, model pre-training is performed using pedestrian data photographed by a plurality of different monitoring devices at different time periods and having a large number of tags, and model parameters obtained by the training are retained. As shown in fig. 2, in the second stage of training, the model parameters of the first stage are imported into the network, and the model is fine-tuned by using the same tag data as the first stage and the untagged pedestrian data shot by the additional target monitoring device, so as to find different time periods and different places of the same pedestrian on the target source device.

The invention discloses an unsupervised multi-field fusion self-adaptive pedestrian re-identification method, which comprises the following steps:

step 1, taking the pictures of monitoring pedestrians of a plurality of monitoring devices at different places in different time periods as inputs of a transducer network and a DBSCAN network, and dividing the pictures into a source domain and a target domain according to the number of tag information.

Specifically, we take 4 groups of monitoring device images as an example, wherein 3 groups of monitoring device images with complete labels and 1 group of monitoring device images without labels. We consider the ith group of n monitoring device images with complete labels as source domain data, defined as

Each group contains n samples, wherein +.>

The corresponding data tag is +.>

. We regard n monitoring device images without labels as target domain data, defined as +.>

Containing n samples.

And 2, generating pseudo labels for the pedestrian data of the unlabeled target domain by using the DBSCAN network so as to keep the label data of the target domain consistent with the label data of the source domain. The invention uses the DBSCAN network to generate pseudo labels for the pedestrian data of the unlabeled target domain.

Specifically, since the target domain data does not contain tag data, the target domain tag needs to be generated by a clustering method, and DBSCAN is a clustering algorithm based on density clustering, which is representative, and defines a maximum set of points connected by density for clusters. The target domain data tag is acquired by using DBSCAN, and the steps are as follows:

(1) Assuming that any data point is taken in the target domain

Parameter->

Representing the tightness of the sample distribution of the neighborhood, wherein +.>

Neighborhood distance threshold value representing a certain core point, < ->

Indicating a core point radius of +.>

（2）

The minimum number of core points in the neighborhood of (a);

(3) If for parameters

and />

Selected data point +.>

For the core point, find all the sub +.>

Data points with reachable density form a cluster;

(4) If a data point is selected

The edge point is used for selecting another data point as a core point;

(5) Repeating the steps (2) and (3) until all points are processed.

Finally, each cluster obtained represents a category, and thus, the pseudo tag of the obtained target domain is represented as

Wherein the target field contains n samples.

And 3, in the first stage of training, slicing and position coding are carried out on the pedestrian image input by the source domain, a transducer network is built to obtain different concerns about different pedestrian characteristic points, and training model parameters are reserved.

Specifically, we block each pedestrian picture in the source domain data, i.e. the nth picture of the ith domain

. By encoding, we can calculate and obtain the position information corresponding to each block, and the calculation mode is as follows:

wherein

Representing the dimension of the mth block after the input image is divided, pos represents the word embedded representation of the mth block. Thus, the input of the transducer can be expressed as:

wherein

M-block position information indicating an nth picture of an ith field. The transducer network contains two modules, one being an MLP and the other being a multi-headed attention mechanism. We will

The method is input into a multi-head attention mechanism, so that the network focuses on the picture to different degrees, and the calculation mode is as follows:

wherein

Representing the projection matrix in the pre-trained model. />

Representing the output characteristics of the mechanism, i.e. sample image +.>

Also, the input of MLP, so the output of the transducer is: />

wherein ,

representing the output of a transducer network>

Predictive label of picture. And calculating and optimizing cross entropy loss of the prediction label and the real label of the source domain, and finally obtaining the transducer network pre-training model with good performance.

And 4, in the second training stage, slicing and position encoding the pedestrian image of the target domain, and simultaneously taking the pedestrian image and the source domain image as input of a transducer network so as to migrate different interests obtained by the transducer network on the source domain to the target domain.

We constructed two models of exactly identical structure in the second training phase, one training model and the other model ema (i.e. the optimal model), which used the same initialization conditions. The purpose of training the model is to obtain model parameters for the current training batch. The purpose of the ema model is to obtain the mean of the model parameters of the previous training batch and the model parameters of the current batch, thereby representing the optimal model.

Specifically, the image sample of the target domain is sliced and recorded as

And calculates the corresponding position code, which is recorded as:

wherein

Representing the dimension of the mth block after the input image is divided, pos represents the word embedded representation of the mth block. Thus, the target field input of a transducer can be expressed as:

m-block position information representing an nth picture of the target domain;

we will source domain

And the target Domain->

Coded samples->

Input into the Multi-head attention mechanism, where +.>

and />

The method respectively represents a source domain data sample set and a target domain data sample set, so that the network also forms different degrees of attention to the target domain picture to realize migration of the attention degree, and the calculation mode is as follows:

wherein

，/>

Representing the source domain->

Representing a target domain, wherein->

wherein ,

model parameter matrix expressed as optimal, +.>

The duty cycle of the model obtained for the historical training times and the current training times. When the number of training times is 0,

the obtained parameter matrix is pre-trained. By constantly training the optimization, a training model is finally obtained that achieves the best results on all four different devices.

5. The complex graph convolution neural network module is constructed, complex domain centers of a plurality of domain data sets extracted through a transducer network are used as input and input into the complex graph convolution neural network module with multi-domain information fusion, and vector space structural association among domains is explored, so that distribution alignment of different domain samples is further realized, and differences among the data sets are reduced.

Specifically, by usingThe nth sample feature obtained In (IV)

, wherein />

wherein ,

the real number domain center of the d-th domain is shown. To further consider the structured relationships between domain centers and mine the semantic associations of domain centers, we map them to a complex feature space to explore the vector semantic associations between domain centers, i.e., each domain center is noted as:

wherein ,

Representing the complex domain center of the d-th domain, the>

and />

Is two trainable weights. Two complex graph convolutional neural networks are respectively constructed aiming at a training model and a ema model, and the complex graph convolutional neural network of the training model is marked as +.>

The complex graph convolutional neural network of ema model is marked as

。

For complex graph convolutional neural network of training model, its node

Is the complex domain center of the d-th domain, its adjacency matrix +.>

wherein a and b represent an a-th domain and a b-th domain. Therefore, the updating mode of the graph convolution neural network is as follows:

wherein ,

representing the complex domain center of the d-th domain, the>

Representing the adjacency matrix of the complex-graph convolutional neural network,/>

representing a matrix of unity complex>

Complex degree matrix of representation->

Representing a complex nonlinear activation function, < >>

Is a complex parameter that can be learned.

For the complex graph convolutional neural network of the ema model, the calculation method is consistent with the model training method, so that output is obtained

。

After the output complex centers are subjected to modulo operation, euclidean distance is used for measuring the distance between complex domain centers updated by two complex graph convolutional neural networks, and MSE loss functions are utilized for further reducing sample distribution differences among unnecessary domains, so that a transform network model is optimized.

And 6, constructing a space context information fusion module, wherein the space context information fusion module is provided with a double-layer MLP structure, taking real number multi-domain central features extracted by a transducer network as the input of the space context information fusion module, transposing the input features, putting the transposed input features into the double-layer MLP for information interaction, transposing the output of the double-layer MLP again and adding the input features, exploring the association of each dimension of the central features in each domain, namely exploring the direct association of one space between the domain centers, and further realizing the distribution alignment of multi-domain samples in space.

Specifically, we constructed two bilayer MLP structures for the training model and ema model, respectively. For training models, the real number domain center of the d-th domain obtained in (fifth), namely

First transpose is carried out to obtain +.>

And inputs it into a double layer MLP, whose output is:

wherein

and />

Representing two trainable weights, +.>

Representing a nonlinear activation function. Then we will add the output of the bilayer MLP +.>

After transposition with the original input->

similar to the training model, we also obtain the output of the spatial context information fusion module on the ema model

. Finally, the distance measurement is carried out on the outputs of the spatial fusion modules on the training model and the ema model, so that the difference between multiple domains is further reduced, and the optimization is realized on the transducer network model.

7. And generating image features of the source domain and the target domain from the trained transducer network, and matching the image features, so that the same pedestrian appearing on the target device can be found through the monitoring image of the source device.

After the optimal model is obtained, image features acquired by different devices are matched, and similarity between every two image features is measured by using Euclidean distance, so that images of the same pedestrian are found under different devices, and pedestrian re-recognition is realized.

The method provided by the invention is primarily compared with the CNN algorithm-based pedestrian re-identification problem in four data sets, the average accuracy of the algorithm is 65.9%, the average accuracy of the algorithm based on the transformation algorithm is 62.8%, and the average accuracy of the algorithm based on the CNN algorithm is 56.3%. The method of the invention improves the accuracy of pedestrian re-identification.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims

1. An unsupervised multi-domain fusion self-adaptive pedestrian re-identification method is characterized in that: the method comprises the following steps:

step 4: slicing and position coding pedestrian images of the target domain, and simultaneously taking the pedestrian images and the source domain images as input of a transducer network so as to transfer different concerns obtained by the transducer network on the source domain to the target domain; the method specifically comprises the following steps:

constructing two models with completely consistent structures, wherein one model is a training model, and the other model is a ema model;

slicing the image sample of the target domain, and slicing the nth picture of the target domain

And calculates the corresponding position code, which is recorded as:

wherein

m-block position information representing an nth picture of the target domain;

encoded samples of source and target fields

Input into the Multi-head attention mechanism, where +.>

and />

The method respectively represents a source domain data sample set and a target domain data sample set, so that the Transformer network also forms different degrees of attention to the target domain picture to realize the migration of the attention degree, and the calculation mode is as follows:

wherein

，/>

Representing the source domain->

Representing a target domain, wherein->

The projection matrix representing the current training is updated in the following way: />

wherein ,

model parameter moment expressed as optimumArray (S)>

For the duty ratio of the model obtained for the historical training times and the current training times, when the training times is 0,/for the model obtained for the current training times>

The parameter matrix obtained for pre-training is optimized through continuous training, and finally a training model which can achieve the best effect on the same equipment is obtained;

step 5: constructing a complex graph convolution neural network, taking complex domain centers of a plurality of domain data sets extracted through a transducer network as input, inputting the complex domain centers into the complex graph convolution neural network fused by the multi-domain information, realizing the distribution alignment of samples of different domains, and reducing the difference between the data sets; the method specifically comprises the following steps:

the nth sample feature obtained in step 4

, wherein />

wherein ,

wherein ,

Representing the complex domain center of the d-th domain, the>

and />

The complex graph convolutional neural network of ema model is marked as

；

Complex graph convolutional neural network for training model, its nodes

Is the complex domain center of the d-th domain, its adjacency matrix +.>

Represented is a vector relationship between the centers in the complex domain, which can be expressed as: />

wherein ,

plural domain center representing the d-th domain, < >>

Representing the updated features of the center of the global complex domain of the d-th domain,

adjacency matrix representing a complex graph convolutional neural network, < >>

Representing a matrix of unity complex>

The complex number matrix of the representation is presented,

representing a complex nonlinear activation function, < >>

Is a learnable complex parameter;

Then, after the output complex centers are subjected to the modulo operation, the Euclidean distance is used for measuring twoThe distance between the centers of complex domains updated by the complex graph convolutional neural network is further reduced by using an MSE loss function, so that a transducer network model is optimized;

2. The self-adaptive pedestrian re-recognition method for non-supervision multi-domain fusion according to claim 1, wherein the method comprises the following steps: the step 3 specifically includes:

wherein

Representing the dimension of the mth block after the input image is divided, pos representing the word embedding of the mth blockInput into the representation, therefore, the input into the transducer network can be represented as:

/>

wherein

M-block position information representing an nth picture of an ith field;

wherein

Representing the projection matrix in the pre-training model, +.>

Output characteristics representing the mechanism as sample image +.>

Also, the input of MLP, so the output of the transducer is:

wherein ,

representing the output of a transducer network>

3. The self-adaptive pedestrian re-recognition method for non-supervision multi-domain fusion according to claim 1, wherein the method comprises the following steps: the step 6 specifically includes: respectively constructing two double-layer MLP structures aiming at a training model and a ema model;

First transpose is carried out to obtain +.>

And inputs it into a double layer MLP, whose output is:

wherein

and />

Representing two trainable weights, +.>

After transposition with the original input->

And finally, carrying out distance measurement on the output of the spatial fusion module on the training model and the ema model, further reducing the difference between multiple domains, and optimizing the transducer network model. />