WO2023082561A1

WO2023082561A1 - Person re-identification method and system, and electronic device and storage medium

Info

Publication number: WO2023082561A1
Application number: PCT/CN2022/090217
Authority: WO
Inventors: 王立; 郭振华; 范宝余; 赵雅倩; 李仁刚
Original assignee: 苏州浪潮智能科技有限公司
Priority date: 2021-11-15
Filing date: 2022-04-29
Publication date: 2023-05-19
Also published as: CN114299442A

Abstract

A person re-identification method, comprising: constructing an auxiliary training model and a target model, which are based on convolutional neural networks; determining loss functions of the auxiliary training model and the target model, and training the auxiliary training model and the target model by using the loss functions; after the training of the auxiliary training model is completed, migrating knowledge of the auxiliary training model to the target model, so as to obtain a person re-identification model; inputting a person image into the person re-identification model to obtain an embedding layer feature of the person image; and performing similarity comparison on the embedding layer feature of the person image and an embedding layer feature of an image to be queried, and outputting a person re-identification result according to a similarity comparison result. By means of the method, the accuracy of person re-identification can be improved without increasing a parameter amount and a calculation amount.

Description

A pedestrian re-identification method, system, electronic device and storage medium

This application claims the priority of the Chinese patent application submitted to the China Patent Office on November 15, 2021, with the application number 202111344388.3, and the title of the invention is "A Pedestrian Re-identification Method, System, Electronic Device and Storage Medium", the entire content of which Incorporated in this application by reference.

technical field

The present application relates to the technical field of deep learning, in particular to a pedestrian re-identification method and system, an electronic device and a storage medium.

Background technique

Deep learning techniques can solve problems in the field of computer vision such as image classification, image segmentation and object detection. With the continuous development of deep learning technology, pedestrian re-identification technology has also made great progress.

Re-identification of pedestrians (Re-ID) is an important image recognition technology, which is widely used in public security systems, traffic supervision and other fields. Pedestrian re-identification searches cameras distributed in different locations to determine whether pedestrians in different camera fields of view are the same pedestrian. In order to further improve network performance, related technologies usually improve the accuracy of pedestrian re-identification technology by building a more complex network structure. However, deeper, wider or more complex networks usually bring a surge in the amount of parameters and calculations. The increase in the amount of parameters is not conducive to the storage and deployment of portable devices, and the increase in the amount of calculations is not conducive to real-time requirements. application in the scene.

Therefore, how to improve the accuracy of pedestrian re-identification without increasing the amount of parameters and calculation is a technical problem that those skilled in the art need to solve.

Contents of the invention

The purpose of this application is to provide a method and system for pedestrian re-identification, an electronic device and a storage medium, which can improve the accuracy of pedestrian re-identification without increasing the amount of parameters and computation.

In order to solve the above technical problems, this application provides a pedestrian re-identification method, the pedestrian re-identification method includes:

Build auxiliary training models and target models based on convolutional neural networks;

determining a loss function of the auxiliary training model and the target model, and using the loss function to train the auxiliary training model and the target model;

After the auxiliary training model is trained, the knowledge of the auxiliary training model is transferred to the target model to obtain a pedestrian re-identification model;

Input the pedestrian image to the pedestrian re-identification model to obtain the embedded layer features of the pedestrian image;

Comparing the features of the embedded layer of the pedestrian image with the embedded layer of the image to be queried for similarity comparison, and outputting a pedestrian re-identification result according to the similarity comparison result.

Optionally, construct an auxiliary training model and a target model based on a convolutional neural network, including:

Constructing the auxiliary training model comprising at least two convolutional neural networks, constructing the target model comprising at least two convolutional neural networks;

Or, using a convolutional neural network comprising at least two head networks to construct the auxiliary training model, and utilizing a convolutional neural network comprising at least two head networks to construct the target model;

Wherein, the head network includes a pooling layer, an embedding layer, a fully connected layer, an output layer and a softmax layer.

Optionally, determining the loss function of the auxiliary training model and the target model includes:

The cross-entropy loss function of the convolutional neural network is provided, and the cross-entropy loss function is used to calculate the cross-entropy loss of each of the convolutional neural networks;

A feature similarity loss function of the convolutional neural network is provided, and the feature similarity loss function is used to calculate the feature similarity loss between any two convolutional neural networks in the at least two convolutional neural networks;

A class center loss function of the convolutional neural network is provided, and the class center loss function is used to calculate the class center loss of each of the convolutional neural networks;

A loss function that constrains the class center distance of the convolutional neural network is provided, and the loss function that constrains the class center distance is used to calculate each of the convolutional neural networks that constrains the class center distance. loss;

The loss functions of the auxiliary training model and the target model are determined according to the cross-entropy loss function, the feature similarity loss function, the class center loss function, and the loss function constraining class center distances.

Optionally, the feature similarity loss function is:

Among them, L _m represents the feature similarity loss, N represents the number of samples, n represents the nth input sample, u and v represent the uth network and the vth network,

represents the output of the embedding layer of the nth input sample of the uth network,

Represents the output of the embedding layer for the nth input sample of the vth network.

Optionally, the class center loss function is:

in,

Represents the class center loss of the u-th network, N represents the number of samples, n represents the n-th input sample, u represents the u-th network,

Represents the embedding feature of the nth sample of the uth network, and the category of the nth sample is the c class,

represent

The corresponding class center, that is,

Represents the class center of the c-th class of embedding layer features of the u-th network.

Optionally, the pedestrian re-identification method also includes:

Perform weighted calculations on the latest determined class center and the current output embedding layer features to obtain the updated class center, that is, through the following formula:

in,

Represents the updated class center

represent

The corresponding class center, that is,

Represents the class center of the c-th class of the embedding layer features of the u-th network, and α and β represent weighted values.

Optionally, before weighting the last determined class center and the current output embedding layer features, it also includes:

Judging whether the feature classification corresponding to the currently output embedding layer feature is correct;

If yes, then enter the step of weighting the last determined class center and the current output embedding layer features.

Optionally, said constraining the class center distance includes:

Through difficult sample mining, find the minimum inter-class difference of each class center;

Using the minimum inter-class difference, constrain the class center distance.

Optionally, the loss function constraining the class center distance is:

in,

Represents the loss of the u-th convolutional neural network that constrains the class center distance, C represents the number of categories of samples,

represents the class center of the i-th class of the u-th network,

representative distance

nearest class center.

Optionally, the determination of the auxiliary training model and the The loss function of the target model includes:

Optionally, add the cross-entropy loss function, the feature similarity loss function, the class center loss function, and the loss function constraining the class center distance to obtain the auxiliary training model and the The loss function of the target model; wherein, the loss calculated by the auxiliary training model and the loss function of the target model is the cross-entropy loss function, the feature similarity loss function, the class center loss function, and the sum of the losses respectively calculated by the loss function that constrains the class center distance.

Optionally, using the loss function to train the auxiliary training model and the target model includes:

Step a: Initialize the weights of each network layer in the model to be trained, wherein the model to be trained is any one of the auxiliary training model and the template model;

Step b: select training data, input the training data to the model to be trained, propagate the training data forward in the model to be trained so that the training data passes through the various network layers in turn, and output Propagate the output value forward;

Step c: using a loss function to obtain the error between the forward propagation output value and the target value;

Step d: backpropagating the error in the model to be trained to obtain the backpropagation error of each network layer;

Step e: updating the weights of each network layer based on the backpropagation error;

Step f: Repeat steps b to e, and when the error is less than an error threshold, end the training of the model to be trained, or end the training of the model to be trained when the repetition reaches a specified number of times .

Optionally, the construction of an auxiliary training model and a target model based on a convolutional neural network includes:

Constructing the auxiliary training model and the target model based on a convolutional neural network according to preset rules; wherein, the preset rule is that the model complexity of the auxiliary training model is higher than the model complexity of the target model.

The present application also provides a pedestrian re-identification system, which includes:

Model building blocks for constructing convolutional neural network-based auxiliary training models and target models;

A model training module, configured to determine a loss function of the auxiliary training model and the target model, and use the loss function to train the auxiliary training model and the target model;

A knowledge transfer module, configured to transfer the knowledge of the auxiliary training model to the target model to obtain a pedestrian re-identification model after the training of the auxiliary training model is completed;

A feature extraction module, configured to input a pedestrian image to the pedestrian re-identification model to obtain the embedded layer features of the pedestrian image;

The pedestrian re-identification module is used to compare the similarity between the embedded layer features of the pedestrian image and the embedded layer of the image to be queried, and output the pedestrian re-identification result according to the similarity comparison result.

The present application also provides a storage medium on which a computer program is stored, and when the computer program is executed, the steps performed by the above-mentioned pedestrian re-identification method are realized.

The present application also provides an electronic device, including a memory and a processor, the memory stores a computer program, and the processor implements the steps performed by the above pedestrian re-identification method when calling the computer program in the memory.

The present application provides a pedestrian re-identification method, including: constructing an auxiliary training model and a target model based on a convolutional neural network; determining the loss function of the auxiliary training model and the target model, and using the loss function to train the The auxiliary training model and the target model; after the training of the auxiliary training model is completed, transfer the knowledge of the auxiliary training model to the target model to obtain a pedestrian re-identification model; input pedestrians to the pedestrian re-identification model image to obtain the embedded layer features of the pedestrian image; comparing the similarity between the embedded layer features of the pedestrian image and the embedded layer of the image to be queried, and outputting a pedestrian re-identification result according to the similarity comparison result.

The application constructs an auxiliary training model and a target model based on a convolutional neural network, and determines the loss functions of the auxiliary training model and the target model, and then uses the loss function to train the auxiliary training model and the target model. After the auxiliary training model is trained, the application transfers the knowledge learned in the auxiliary training model to the target model through knowledge transfer to obtain a pedestrian re-identification model. Since the pedestrian re-identification model includes knowledge learned from the auxiliary training model and the target model, the accuracy of the pedestrian re-identification model can be improved without additional reasoning costs. Therefore, the present application can improve the accuracy of pedestrian re-identification without increasing the amount of parameters and computation. At the same time, the present application also provides a pedestrian re-identification system, an electronic device and a storage medium, which have the above-mentioned beneficial effects and will not be repeated here.

Description of drawings

In order to illustrate the embodiments of the present application more clearly, the following will briefly introduce the accompanying drawings used in the embodiments. Obviously, the accompanying drawings in the following description are only some embodiments of the present application. As far as people are concerned, other drawings can also be obtained based on these drawings on the premise of not paying creative work.

FIG. 1 is a flowchart of a pedestrian re-identification method provided by an embodiment of the present application;

Fig. 2 is the schematic diagram of the first kind of model provided by the embodiment of the present application;

FIG. 3 is a schematic diagram of the second model provided by the embodiment of the present application;

Fig. 4 is a schematic diagram of a model retention result provided by the embodiment of the present application;

FIG. 5 is a schematic diagram of a pedestrian re-identification application provided by the embodiment of the present application;

FIG. 6 is a schematic structural diagram of a pedestrian re-identification system provided by an embodiment of the present application.

Detailed ways

In order to make the purposes, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below in conjunction with the drawings in the embodiments of the present application. Obviously, the described embodiments It is a part of the embodiments of this application, not all of them. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the scope of protection of this application.

The process described in the above-mentioned embodiments will be described below through embodiments in practical applications.

With the continuous development of deep learning, deep learning networks have achieved remarkable performance in various fields. In order to further improve network performance, scholars usually continue to improve its performance by constructing more complex network structures. However, improving network performance in this way has the following disadvantages: (1) Deeper, wider or more complex networks usually bring a surge in the number of parameters, which is not conducive to the storage and deployment of portable devices. For example, to realize the deployment of a real-time pedestrian detection and recognition program in a network camera, the network needs to have a small amount of parameters (easy to store) and a high recognition accuracy. (2) Deeper, wider or more complex networks usually increase the amount of calculation, which is not conducive to the application of scenarios with high real-time requirements. For example: retrieval and tracking of criminal suspects. A large calculation delay will cause the entire system to miss the best opportunity, and have a negative impact on system functions. Therefore, how to improve the performance of the network without increasing the amount of parameters and calculations has become a problem that needs to be solved.

In order to solve the above problems, this embodiment proposes a method for constructing, training, and inferring a convolutional neural network based on knowledge supervision. This method can realize knowledge transfer without increasing the amount of parameters and calculations (large model knowledge transfer to Small model migration), maximize the mining of network potential, and improve network performance. In this embodiment, multiple results for the same image can assist each other, so that more accurate results can be obtained by utilizing the knowledge learned by the group. Wherein, the multiple results include both final results and intermediate results.

This embodiment is based on the idea of knowledge-supervised learning. First, one or more networks are established, and the networks realize knowledge transfer and improve the generalization ability of each model through mutual supervised learning.

Please refer to FIG. 1 below. FIG. 1 is a flow chart of a pedestrian re-identification method provided by an embodiment of the present application.

Specific steps can include:

S101: Construct an auxiliary training model and a target model based on a convolutional neural network;

Among them, this embodiment can establish an auxiliary training model and a target model including one or more convolutional neural networks based on the idea of knowledge-supervised learning. The above-mentioned convolutional neural networks can realize knowledge transfer through mutual supervised learning to improve The generalization ability of each model.

As a feasible implementation, the present application can construct the auxiliary training model and the target model based on the convolutional neural network in the following manner: construct the auxiliary training model including at least two convolutional neural networks, and construct the auxiliary training model including at least two convolutional neural networks. The target model of the convolutional neural network; or, utilize the convolutional neural network comprising at least two head networks to construct the auxiliary training model, utilize the convolutional neural network comprising at least two head networks to construct the target model; Wherein, the head network includes a pooling layer, an embedding layer, a fully connected layer, an output layer and a softmax layer.

Please refer to FIG. 2. FIG. 2 is a schematic diagram of the first model provided by the embodiment of the present application. The schematic diagram of the model shows the implementation of establishing an auxiliary training model and a target model including two convolutional neural networks. The auxiliary training model and /or the target model may be the model shown in FIG. 2 . As shown in Figure 2, two convolutional neural networks Net1 and Net2 are established. The two convolutional neural networks can be isomorphic or heterogeneous. The output of the network can reduce the dimensionality of the feature map (Batchsize×Channel×H×W) into a vector through the Pooling layer. Here, e1 and e2 are used to represent the embedded layer features, and the dimensions of e1 and e2 are Batchsize×Channel. The above model includes a backbone model and a head model. The backbone network is used to extract features, and the head network is used to realize classification and loss function calculation. The head network includes a pooling layer pool, an embedding layer, a fully connected layer fc, an output layer, and a softmax layer. The head network can use the triplet loss function Triplet loss and the cross-entropy loss function for parameter adjustment.

Please refer to Fig. 3. Fig. 3 is a schematic diagram of the second model provided by the embodiment of the present application. The schematic diagram of the model shows the implementation of establishing an auxiliary training model and a target model including a convolutional neural network. The model is multi-headed Convolutional neural network (i.e. there are multiple head networks).

The auxiliary training model in this embodiment may be the model shown in any one of FIG. 2 and FIG. 3 , and the target training model may also be the model shown in any one of FIG. 2 and FIG. 3 . The complexity of the above-mentioned auxiliary training model is higher than that of the target model, and the complexity of the model can be measured by the amount of parameters and computation of the model. Specifically, in this embodiment, the auxiliary training model and the target model based on a convolutional neural network can be constructed according to preset rules; wherein, the preset rule is that the model complexity of the auxiliary training model is higher than that of the The model complexity of the target model.

S102: Determine a loss function of the auxiliary training model and the target model, and use the loss function to train the auxiliary training model and the target model;

Wherein, in this embodiment, the loss functions of the auxiliary training model and the target model can be the same, specifically, the loss functions of the two can be determined in the following manner: providing the cross-entropy loss function of the convolutional neural network, the cross-entropy loss The function is used to calculate the cross entropy loss of each of the convolutional neural networks; the feature similarity loss function of the convolutional neural network is provided, and the feature similarity loss function is used to calculate the at least two convolutional neural networks The feature similarity loss between any two convolutional neural networks; the class center loss function of the convolutional neural network is provided, and the class center loss function is used to calculate the class center loss of each of the convolutional neural networks ; Provide the loss function of the convolutional neural network that constrains the class center distance, and the loss function that constrains the class center distance is used to calculate each of the convolutional neural networks that constrains the class center distance The loss of the loss; according to the cross-entropy loss function, the feature similarity loss function, the class center loss function, and the loss function that constrains the class center distance to determine the auxiliary training model and the target model loss function.

In the pedestrian re-identification task, the features of the embedding layer are finally used for feature matching retrieval, so constrained optimization of them is of great significance for the pedestrian re-identification task.

This embodiment designs a new loss function, and the implementation steps of this function are as follows:

Loss function (1) Calculation process: Add a fully connected layer (fc) after the embedding layer to obtain the fully connected layer features, and perform softmax normalization on the fully connected layer features, and finally calculate the loss through the cross entropy loss function

The superscript 1 represents the first branch.

Loss function (2) Calculation process: The embedded features e1 and e2 of the convolutional neural network Net1 and convolutional neural network Net2 should be similar, because they both support the same pedestrian classification task during the training process, and during the inference process, e1 and e2 will be used for similarity comparison. Therefore, the embedded features e1 and e2 of Net1 and Net2 are functionally the same, and similar features should be learned.

Net1 and Net2 have different network structures and initialization weight coefficients, and their e1 and e2 have diversity, but their commonality is that they have excellent expressive ability for pedestrians. In order to take advantage of their commonality, they suppress noise. This embodiment provides a feature similarity loss function to implement a mutual learning mechanism, that is, e1 learns from e2, and e2 learns from e1 to obtain the feature similarity between the convolutional neural network Net1 and the convolutional neural network Net2 Loss L _m .

Its Loss function is:

Among them, n represents the nth input sample, and u and v represent the uth network and the vth network. The formula can be summarized as follows:

All samples of each batch are traversed. As mentioned above, assuming that the samples of each batch contain N samples, it is traversed N times. The samples are passed through each network in turn, and the output results of the samples in the embedding layer of each network are obtained. For example, for sample x _n , assuming there are 2 networks, there are 2 embedding layer output results

and

Similarly, if there are 3 networks, there are 3 embedding layer outputs.

For the output results of the embedding layer of all networks, a pairwise traversal is performed. For example, there are two networks 1 and 2 in this embodiment. Use the above formula to calculate the feature similarity loss L _e between the two networks. In the same way, assuming that there are 3 networks, there are 3 combinations without repeated traversal: (1, 2) (1, 3) (2, 3), and the feature similarity loss function L _e ( u,v).

Loss function (3) Calculation process: The features of e1 and e2 are similar, and have the following defects in the mutual learning process. For example: in the initial stage of the training process, the prediction of the network model is very inaccurate, and its network features e1 and e2 have large deviations and noises. The mutual learning between e1 and e2 may be inaccurate features to inaccurate features. Learning, may not have a good effect. In order to suppress the noise, this embodiment proposes an embedded feature optimization method, which can effectively reduce the noise of the embedded feature by using the class center loss function. The specific implementation is as follows:

The core idea of constructing the class center loss function is: the embedding layer features of each image are learned from their respective class centers. Because the various centers of image samples are relatively stable, it can effectively suppress the deviation caused by embedding layer features learning from other branches. The learning method is: Find the class centers of all types of samples, input all samples x _N into each network in turn, and obtain the embedding features of all samples

and

The superscripts 1 and 2 represent different branches, and the subscript N represents a total of N samples. For the output of each network, the class center of the sample is calculated separately. Assuming that all samples x _N contain a total of C categories (that is, C pedestrians), use the following formula to find the class center:

in,

Represents the class center of the c-th class of the embedding layer features of the first network. There are C class centers in total, with

express. Similarly, for multiple networks, the class center of the embedding layer of each network is obtained separately.

Represents the embedding feature of the nth sample of the first network, and its category is the cth category.

The u-th network learns each sample from the embedding layer class center corresponding to its sample category, and finally obtains the class center loss as follows:

in,

representative sample

The corresponding class center. Traverse the features of each network in turn, and calculate the loss of each network by the class center loss function

and

Since the network is continuously iteratively optimized, its samples are constantly changing relative to the class center of each network, and the class centers of various types of samples are dynamically updated in the following way

where the superscript u represents the uth network. The update method of the class center can use the method of first-in-first-out stack, and the class center of the n-step samples closest to the current step is obtained as the real class center.

In this embodiment, the class center may also be selected according to the classification probability corresponding to the embedding feature. That is: first judge whether the classification using this feature is correct, and if it is correct, it will be included in the calculation of the class center.

Loss function (4) Calculation process: For each network, the position of the class centers can be further constrained so that the centers of each class can be separated as much as possible, which is beneficial to distinguish different pedestrian characteristics and improve the discriminability of the network. Even if each pedestrian feature can be better separated. In this embodiment, a loss function that constrains the class center distance (class center distance) can be constructed:

represents the class center of the i-th class of the u-th network,

representative distance

nearest class center. In this embodiment, a difficult sample mining (difficult sample mining) method may be used to implement class-centered loss optimization. Difficult sample mining is not the mean value of the inter-class differences of all classes, but the minimum inter-class difference (ie, the smallest class center distance) of all classes.

In this embodiment, the loss functions (1)-(4) are combined to obtain the total loss function:

L _loss is the total loss of the model,

is the cross-entropy loss of the first convolutional neural network,

is the cross-entropy loss of the second convolutional neural network, L _m is the feature similarity loss of the first convolutional neural network and the second convolutional neural network,

is the class center loss of the first convolutional neural network,

is the cross-entropy loss of the second convolutional neural network,

is the loss that constrains the class center distance for the first convolutional neural network,

The loss that constrains the class center distance for the second convolutional neural network.

This embodiment provides a network structure for multi-model knowledge collaborative training, which protects and combines the above mutual learning loss, class center loss, and class center optimization loss functions for supervised learning training. The multi-model knowledge supervised training method mines the features in the network embedding layer, improves the discrimination of the embedding layer features, and deletes redundant models during reasoning, so no additional reasoning costs are required to improve accuracy. This method is in the field of image classification have a broad vision of application.

The following describes the process of training the model in the above embodiment. After a convolutional neural network is established, it needs to be trained to make it converge, and the trained network weights are obtained after convergence. In the inference process, the weight coefficients trained by the network are preloaded to perform final classification on the input data.

The model training idea of this embodiment is as follows: (1) according to different network structures, build a plurality of network models for training, usually select a larger model (i.e., auxiliary training model) and a smaller model (i.e. , target model) to achieve knowledge transfer. Calculate cross-entropy loss, mutual learning loss, class-centered loss, and class-centered optimization loss for all network models. Among them, the cross entropy loss is calculated by the cross entropy loss function, the mutual learning loss is calculated by the feature similarity loss function, the class center loss is calculated by the class center loss function, and the class center optimization loss is calculated by the loss function that constrains the class center distance. According to the above loss function, the network is trained to converge.

The convolutional neural network training process is as follows: The training process of the convolutional neural network is divided into two stages. The first stage is the stage in which the data propagates from the low level to the high level, that is, the forward propagation stage. The other stage is that when the results obtained by the forward propagation do not match the expectations, the stage of propagating the error from the high level to the bottom level is the stage of backpropagation. The training process includes the following steps:

Step 1. The network layer weights are initialized, generally using random initialization;

Step 2. The input image data is forward-propagated through the convolution layer, down-sampling layer, fully connected layer and other layers to obtain the output value;

Step 3. Calculate the error between the output value of the network and the target value (label):

Step 4. The error is reversely transmitted back to the network, and the backpropagation error of each layer of the network: fully connected layer, convolutional layer, etc. is obtained in turn.

Step 5. Each layer of the network adjusts all weight coefficients in the network according to the backpropagation error of each layer, that is, updates the weights.

Step 6. Randomly select new image data again, and then enter the second step to obtain the output value from the forward propagation of the network.

Step 7. Infinite reciprocating iterations. When the error between the output value of the network and the target value (label) is found to be less than a certain threshold, or the number of iterations exceeds a certain threshold, the training ends.

Step 8. Save the trained network parameters of all layers and store the trained weights.

Optionally, in this embodiment, the constraining the class center distance includes: finding the minimum inter-class difference of each class center through difficult sample mining; using the minimum inter-class difference to constrain the class center distance.

Before this step, there may also be acquired training data for pedestrian re-identification, and then use the training data to train the auxiliary training model and the target model respectively. In this embodiment, the accuracy of training and reasoning of the neural network is improved without increasing the amount of parameters and the amount of calculation of the network during reasoning.

S103: After the auxiliary training model is trained, transfer the knowledge of the auxiliary training model to the target model to obtain a pedestrian re-identification model;

Among them, after the training of the auxiliary training model is completed, the auxiliary training model has learned the knowledge information about pedestrian re-identification. This embodiment can transfer the above knowledge information to the target model through knowledge transfer. In this embodiment, the training is completed, and the transfer The target model with the knowledge of the auxiliary training model is used as the person re-identification model. The above knowledge refers to features in the network, and this embodiment of multiple views of the same data will provide additional regularization information, thereby improving network accuracy.

S104: Input the pedestrian image to the pedestrian re-identification model to obtain the embedded layer features of the pedestrian image;

After the pedestrian re-identification model is obtained, if the pedestrian re-identification task is received, the pedestrian image is input to the pedestrian re-identification model to obtain the embedded layer features of each pedestrian image.

S105: Compare the similarity between the embedded layer feature of the pedestrian image and the embedded layer of the image to be queried, and output a pedestrian re-identification result according to the similarity comparison result.

Among them, this embodiment can compare the similarity between the embedded layer features of the pedestrian image and the embedded layer of the image to be queried, and determine the pedestrian image with the highest similarity according to the similarity comparison result, so that the pedestrian image with the highest similarity can be used as the pedestrian image. re-identification results.

In this embodiment, an auxiliary training model and a target model based on a convolutional neural network are constructed, and loss functions of the auxiliary training model and the target model are determined, and then the auxiliary training model and the target model are trained using the loss function. After the auxiliary training model is trained, in this embodiment, the knowledge learned in the auxiliary training model is transferred to the target model through knowledge transfer to obtain a pedestrian re-identification model. Since the pedestrian re-identification model includes knowledge learned from the auxiliary training model and the target model, the accuracy of the pedestrian re-identification model can be improved without additional reasoning costs. Therefore, this embodiment can improve the accuracy of pedestrian re-identification without increasing the amount of parameters and computation.

The following provides a knowledge collaborative network training method using the above embodiments to train a model and apply it to the field of pedestrian re-identification. The training process has been described in detail above, and the specific method of reasoning application is explained below:

Please refer to FIG. 4, which is a schematic diagram of a model retention result provided by the embodiment of the present application. The reasoning process provided in this embodiment is as follows: remove all auxiliary training models, retain only one network model (ie, the target model), load pre-trained weights, classify images or extract image features.

During inference: remove the rest of the models (auxiliary training model), and only keep the main model (target model). Please refer to FIG. 5 , which is a schematic diagram of a pedestrian re-identification application provided by an embodiment of the present application. In Figure 5, conv represents the convolutional layer, bottleneck represents the bottleneck layer, and the bottleneck layer represents a specific network structure of ResNet. In the pedestrian re-identification application, the input images 1, 2, 3 and the image to be queried are input into the network to obtain the embedding layer features in the network, and the images 1, 2, and 3 constitute the query data set for the pedestrian re-identification task. The image to be queried is also input into the network to obtain the embedding layer features of the image to be queried. Compare the embedding layer features of the image to be queried with all the features in the query data set. The comparison method is to find the distance between the embedding layer features of the image to be queried and all the features in the query data set, that is, the distance between the vector and the query data sample with the smallest distance It is the same person as the image to be queried.

For pedestrian re-identification tasks, the discriminability of embedding features directly affects the highest performance of the model. Therefore, how to mine the features of the embedding layer of the model so that the samples can be correctly classified and discriminated is extremely important. Therefore, the present invention proposes a new embedding feature mining method and a multi-model collaborative training method, and establishes a basis for feature mining by establishing multiple neural network models. Embedding mining between branches is realized by mutual learning of embedding features between two models and constructing a new type of loss function. At the same time, the loss function learned from each classification center is combined with the embedding features in the branch, and combined into a new loss function to train the entire network.

The training method proposed in this embodiment does not increase the amount of parameters and calculations during network inference. By optimizing the training process, the potential of the network is tapped so that it can achieve optimal performance, thereby showing better results in the inference process. In this embodiment, for the task of pedestrian re-identification, the present invention proposes a method of embedding feature mining based on multi-model knowledge supervision and collaborative training, which can improve the accuracy of pedestrian re-identification without increasing the amount of parameters and computation.

Please refer to FIG. 6. FIG. 6 is a schematic structural diagram of a pedestrian re-identification system provided by an embodiment of the present application. The system may include:

Model construction module 601, for constructing the auxiliary training model and target model based on convolutional neural network;

A model training module 602, configured to determine a loss function of the auxiliary training model and the target model, and use the loss function to train the auxiliary training model and the target model;

A knowledge transfer module 603, configured to transfer the knowledge of the auxiliary training model to the target model to obtain a pedestrian re-identification model after the training of the auxiliary training model is completed;

A feature extraction module 604, configured to input a pedestrian image to the pedestrian re-identification model to obtain the embedded layer features of the pedestrian image;

The pedestrian re-identification module 605 is configured to perform a similarity comparison between the embedded layer features of the pedestrian image and the embedded layer of the image to be queried, and output a pedestrian re-identification result according to the similarity comparison result.

Optionally, the model construction module 601 is configured to construct the auxiliary training model comprising at least two convolutional neural networks, construct the target model comprising at least two convolutional neural networks; The convolutional neural network of the head network constructs the auxiliary training model, and utilizes the convolutional neural network comprising at least two head networks to construct the target model; wherein, the head network includes a pooling layer, an embedding layer, a full Connection layer, output layer and softmax layer.

Optionally, the model training module 602 is used to provide the cross-entropy loss function of the convolutional neural network, and the cross-entropy loss function is used to calculate the cross-entropy loss of each of the convolutional neural networks; The feature similarity loss function of the convolutional neural network, the feature similarity loss function is used to calculate the feature similarity loss between any two convolutional neural networks in the at least two convolutional neural networks; also used for The class center loss function of the convolutional neural network is provided, and the class center loss function is used to calculate the class center loss of each of the convolutional neural networks; it is also used to provide the class center of the convolutional neural network. A loss function that is constrained by distance, and the loss function that constrains the class center distance is used to calculate the loss of each of the convolutional neural networks that constrains the class center distance; , the feature similarity loss function, the class center loss function, and the loss function constraining the class center distance determine the loss functions of the auxiliary training model and the target model.

Optionally, also include:

The class center update module is used to perform weighted calculation on the latest determined class center and the currently output embedding layer features to obtain the updated class center.

Optionally, also include:

Judgment module, used to judge whether the feature classification corresponding to the embedded layer feature of the current output is correct before the weighted calculation of the latest determined class center and the current output embedded layer feature; if so, enter the latest determined The step of weighting the class center and the current output embedding layer features.

Optionally, with regard to the model training module 602, the constraining the class center distance includes: finding the minimum inter-class difference of each class center through difficult sample mining; using the minimum inter-class difference to constrain the class center distance .

Optionally, the model construction module 601 is configured to construct the auxiliary training model and the target model based on a convolutional neural network according to preset rules; wherein, the preset rule is the model complexity of the auxiliary training model Model complexity higher than the target model.

Since the embodiments of the system part correspond to the embodiments of the method part, please refer to the description of the embodiments of the method part for the embodiments of the system part, and details will not be repeated here.

The present application also provides a storage medium on which a computer program is stored. When the computer program is executed, the steps provided in the above-mentioned embodiments can be realized. The storage medium may include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program codes.

The present application also provides an electronic device, which may include a memory and a processor, where a computer program is stored in the memory, and when the processor invokes the computer program in the memory, the steps provided in the above embodiments can be implemented. Of course, the electronic device may also include various network interfaces, power supplies and other components.

Each embodiment in the description is described in a progressive manner, each embodiment focuses on the difference from other embodiments, and the same and similar parts of each embodiment can be referred to each other. As for the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and for relevant details, please refer to the description of the method part. It should be pointed out that those skilled in the art can make several improvements and modifications to the application without departing from the principles of the application, and these improvements and modifications also fall within the protection scope of the claims of the application.

It should also be noted that in this specification, relative terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply that these entities or operations There is no such actual relationship or order between the operations. Furthermore, the term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article, or apparatus comprising a set of elements includes not only those elements, but also includes elements not expressly listed. other elements of or also include elements inherent in such a process, method, article, or device. Without further limitations, an element defined by the phrase "comprising a ..." does not exclude the presence of additional identical elements in the process, method, article or apparatus comprising said element.

Claims

A pedestrian re-identification method, characterized in that, comprising:

Build auxiliary training models and target models based on convolutional neural networks;

determining a loss function of the auxiliary training model and the target model, and using the loss function to train the auxiliary training model and the target model;

After the auxiliary training model is trained, the knowledge of the auxiliary training model is transferred to the target model to obtain a pedestrian re-identification model;

Input the pedestrian image to the pedestrian re-identification model to obtain the embedded layer features of the pedestrian image;

Comparing the features of the embedded layer of the pedestrian image with the embedded layer of the image to be queried for similarity comparison, and outputting a pedestrian re-identification result according to the similarity comparison result.
According to the described pedestrian re-identification method of claim 1, it is characterized in that constructing an auxiliary training model and a target model based on a convolutional neural network, comprising:

Constructing the auxiliary training model comprising at least two convolutional neural networks, constructing the target model comprising at least two convolutional neural networks;

Or, using a convolutional neural network comprising at least two head networks to construct the auxiliary training model, and utilizing a convolutional neural network comprising at least two head networks to construct the target model;

Wherein, the head network includes a pooling layer, an embedding layer, a fully connected layer, an output layer and a softmax layer.
The pedestrian re-identification method according to claim 2, wherein determining the loss function of the auxiliary training model and the target model comprises:

The cross-entropy loss function of the convolutional neural network is provided, and the cross-entropy loss function is used to calculate the cross-entropy loss of each of the convolutional neural networks;

A feature similarity loss function of the convolutional neural network is provided, and the feature similarity loss function is used to calculate the feature similarity loss between any two convolutional neural networks in the at least two convolutional neural networks;

A class center loss function of the convolutional neural network is provided, and the class center loss function is used to calculate the class center loss of each of the convolutional neural networks;

A loss function that constrains the class center distance of the convolutional neural network is provided, and the loss function that constrains the class center distance is used to calculate each of the convolutional neural networks that constrains the class center distance. loss;

The loss functions of the auxiliary training model and the target model are determined according to the cross-entropy loss function, the feature similarity loss function, the class center loss function, and the loss function constraining class center distances.
The pedestrian re-identification method according to claim 3, wherein the feature similarity loss function is:

Among them, L m represents the feature similarity loss, N represents the number of samples, n represents the nth input sample, u and v represent the uth network and the vth network,
represents the output of the embedding layer of the nth input sample of the uth network,
Represents the output of the embedding layer for the nth input sample of the vth network. 5. according to the described pedestrian re-identification method of claim 3, it is characterized in that, described class center loss function is:

in,
Represents the class center loss of the u-th network, N represents the number of samples, n represents the n-th input sample, u represents the u-th network,
Represents the embedding feature of the nth sample of the uth network, and the category of the nth sample is the c class,
represent
The corresponding class center, that is,
Represents the class center of the c-th class of embedding layer features of the u-th network.
The pedestrian re-identification method according to claim 5, further comprising:

Perform weighted calculations on the latest determined class center and the current output embedding layer features to obtain the updated class center, that is, through the following formula:

in,
Represents the updated class center
Represents the embedding feature of the nth sample of the uth network, and the category of the nth sample is the c class,
represent
The corresponding class center, that is,
Represents the class center of the c-th class of the embedding layer features of the u-th network, and α and β represent weighted values.
According to the described pedestrian re-identification method of claim 6, it is characterized in that, before carrying out weighted calculation to the class center determined last time and the embedding layer feature of current output, also include:

Judging whether the feature classification corresponding to the currently output embedding layer feature is correct;

If yes, then enter the step of weighting the last determined class center and the current output embedding layer features.
The pedestrian re-identification method according to claim 3, wherein said constraining the class center distance comprises:

Through difficult sample mining, find the minimum inter-class difference of each class center;

Using the minimum inter-class difference, constrain the class center distance.
The pedestrian re-identification method according to claim 8, wherein the loss function constraining the class center distance is:

in,
Represents the loss of the u-th convolutional neural network that constrains the class center distance, C represents the number of categories of samples,
represents the class center of the i-th class of the u-th network,
representative distance
nearest class center.
According to the described pedestrian re-identification method of claim 3, it is characterized in that, according to the described cross-entropy loss function, the feature similarity loss function, the class center loss function, and the constraint on the class center distance The loss function determining the loss function of the auxiliary training model and the target model includes:

Adding the cross-entropy loss function, the feature similarity loss function, the class center loss function, and the loss function constraining the class center distance to obtain the auxiliary training model and the target model loss function;

Wherein, the loss calculated by the loss function of the auxiliary training model and the target model is the cross-entropy loss function, the feature similarity loss function, the class center loss function, and the class center distance The sum of losses calculated separately by the constrained loss functions.
The pedestrian re-identification method according to claim 3, wherein said using said loss function to train said auxiliary training model and said target model comprises:

Step a: Initialize the weights of each network layer in the model to be trained, wherein the model to be trained is any one of the auxiliary training model and the template model;

Step b: select training data, input the training data to the model to be trained, propagate the training data forward in the model to be trained so that the training data passes through the various network layers in turn, and output Propagate the output value forward;

Step c: using a loss function to obtain the error between the forward propagation output value and the target value;

Step d: backpropagating the error in the model to be trained to obtain the backpropagation error of each network layer;

Step e: updating the weights of each network layer based on the backpropagation error;

Step f: Repeat steps b to e, and when the error is less than an error threshold, end the training of the model to be trained, or end the training of the model to be trained when the repetition reaches a specified number of times .
The pedestrian re-identification method according to any one of claims 1 to 11, wherein the construction of an auxiliary training model and a target model based on a convolutional neural network includes:

Constructing the auxiliary training model and the target model based on a convolutional neural network according to preset rules; wherein, the preset rule is that the model complexity of the auxiliary training model is higher than the model complexity of the target model.
A pedestrian re-identification system, characterized in that it includes:

Model building blocks for constructing convolutional neural network-based auxiliary training models and target models;

A model training module, configured to determine a loss function of the auxiliary training model and the target model, and use the loss function to train the auxiliary training model and the target model;

A knowledge transfer module, configured to transfer the knowledge of the auxiliary training model to the target model to obtain a pedestrian re-identification model after the training of the auxiliary training model is completed;

A feature extraction module, configured to input a pedestrian image to the pedestrian re-identification model to obtain the embedded layer features of the pedestrian image;

The pedestrian re-identification module is used to compare the similarity between the embedded layer features of the pedestrian image and the embedded layer of the image to be queried, and output the pedestrian re-identification result according to the similarity comparison result.
An electronic device, characterized in that it includes a memory and a processor, wherein a computer program is stored in the memory, and when the processor invokes the computer program in the memory, the pedestrian described in any one of claims 1 to 11 is implemented. Steps in the re-identification method.
A storage medium, characterized in that computer-executable instructions are stored in the storage medium, and when the computer-executable instructions are loaded and executed by a processor, pedestrian re-identification according to any one of claims 1 to 11 is realized method steps.