CN115690833A

CN115690833A - Pedestrian re-identification method based on deep active learning and model compression

Info

Publication number: CN115690833A
Application number: CN202211091748.8A
Authority: CN
Inventors: 付春玲; 侯巍; 郑文奎
Original assignee: Henan University
Current assignee: Henan University
Priority date: 2022-09-07
Filing date: 2022-09-07
Publication date: 2023-02-03

Abstract

The invention relates to the technical field of image processing, in particular to a pedestrian re-identification method based on deep active learning and model compression; obtaining a sample by adopting a pedestrian re-identification strategy based on active learning, marking the sample, and sending the marked sample into a ResNet101 network for training to obtain a trained network model; compressing the trained network model by adopting a knowledge distillation-based model compression method; acquiring characteristic output of a pedestrian Re-ID network model, and acquiring characteristic distillation loss of characteristic information transmission difference according to local L2 norms among characteristics; adding the characteristic distillation loss into the student model to obtain the total loss of the student model; the invention can be widely applied in the fields of safety monitoring, pedestrian searching, criminal investigation and the like, solves the problem that a large amount of manpower is consumed to obtain the marked data in the actual situation, can meet the design requirement of an actual engineering system, and has good engineering application value.

Description

Pedestrian re-identification method based on deep active learning and model compression

Technical Field

The invention relates to the technical field of image processing, in particular to a pedestrian re-identification method based on deep active learning and model compression.

Background

The Re-identification (Re-ID) of the pedestrians aims at judging whether the pedestrians from different scenes and different cameras are consistent or not, and has important significance in the fields of safety monitoring, pedestrian searching, criminal investigation and the like. In recent years, the deep learning-based pedestrian Re-ID method has been greatly advanced and has higher performance; most of the supervised pedestrian Re-ID methods need to collect a pair of images of the same pedestrian under different cameras before training a model, and then manually mark IDs on each pair of images so as to obtain a data set; the data set obtained based on the method enables some surveillance pedestrian Re-ID methods to achieve good effects.

However, in practical applications, some leading-edge surveillance methods are difficult to be directly applied, and the reason is two-fold, on one hand, the data volume of pedestrian images collected by a camera is large, and the cost for making complete annotations is high. On the other hand, even if the labeling task of a large number of pedestrian images is completed, the supervision method needs to train all data sets to achieve a certain effect, when the used data set is small, the advantage of deep learning is weakened, and in addition, the supervision method treats all the images without distinction and does not explore inherent information in the images. To address these difficulties, researchers began to focus on unsupervised and semi-supervised approaches; however, compared with the method based on supervised learning, the pedestrian Re-ID model based on unsupervised and semi-supervised learning is inherently weaker and can influence the performance of the pedestrian Re-ID in practical application, and in this case, the active learning provides an effective method for selecting a sample set for labeling. Different from the method, in the active learning setting, the algorithm can select the sample to be marked, so that the trained model has higher performance.

The active learning algorithm is adopted to select the training data with the most information quantity to optimize the depth model, so that the depth model has higher performance after learning on the premise of not needing a large amount of additional data marking; different from the supervised learning method, the sample with the most information quantity is selected by active learning and transmitted to one or more human annotators, and the human annotators mark the samples; the most critical part of this process is to decide which samples are more informative and more desirable to transmit to a human annotator for annotation.

In the past, different selection methods have been developed in a variety of computer vision tasks, such as classification, recognition and object detection. Recently, some works of carrying out pedestrian Re-ID by using active learning and combining with a deep learning model exist, but the methods provided at present still have two problems, one is that the methods use the confidence score of the model as the basis for selecting the sample, but the model has low reliability before training, so the sample selected by the methods is not necessarily the most information-bearing sample; another problem is that the deep learning models adopted by them are large in scale, which is not favorable for deployment and popularization in practical application.

Disclosure of Invention

In order to solve the technical problems that the scale of the medium-deep learning model is large and is not beneficial to deployment and popularization in practical application, the invention aims to provide a pedestrian re-identification method based on deep active learning and model compression, and the adopted technical scheme is as follows:

collecting a plurality of images of the same pedestrian shot by a camera a and a camera b at different time and different places, and forming an image data set;

taking the ResNet101 structure as a backbone network of a pedestrian Re-ID network model, and extracting pedestrian image features by utilizing the backbone network; the loss function of the network model is a cross entropy loss function;

firstly, a pedestrian re-identification strategy based on active learning is adopted, and S pedestrian image sets are obtained by utilizing a network model and a query function psi (U, S, gamma (·)) from a small preheating pedestrian image set with the size of S; manually labeling labels for the S pedestrian image sets; then removing S pedestrian image sets from the unlabeled data set U; repeating the above process until the network model reaches the standard performance or the data is used up; wherein γ (·) is a query policy; s is greater than 1;

in each iteration process, two calculation processes of gradient embedding calculation and sampling calculation are provided; in the process of calculating the gradient embedding, determining the norm size of the gradient embedding of the last layer of the network model according to the certainty of the network model on the predicted category;

after gradient embedding is calculated, clustering a batch of samples with the size of S by adopting a k-means + + algorithm for labeling, and sending the samples into a ResNet101 network for training to obtain a trained network model;

compressing the trained network model by a knowledge distillation-based model compression method, and firstly, giving an initial student model and a teacher model trained in a pedestrian Re-ID data set; when the student models are trained, the teacher model is frozen, and the knowledge of the teacher model is circularly transferred to the student models in a double-flow mode; meanwhile, the transfer progress and the transfer direction in the knowledge transfer process are monitored;

acquiring characteristic output of a pedestrian Re-ID network model, and acquiring characteristic distillation loss of characteristic information transmission gap according to local L2 norms among characteristics;

and adding the characteristic distillation loss into a student model to obtain the total loss of the student model.

Preferably, the gradient of the last layer of the network model is recorded as

Then:

wherein the content of the first and second substances,

the gradient of the last layer of the network model; l is _reid Is a cross entropy loss function; w is the weight of the last layer of the network model; f (x; theta) is an expression of the network model, theta is a weight parameter of the network model in the expression, and x is a pedestrianAn image; y is the true label of the pedestrian image x.

Preferably, the transfer direction is:

wherein L is _logit For the direction of transfer of knowledge,/ _s As logical output of the student model,/ _t For the logical output of the teacher model, D _KL Is the KL divergence between the teacher model and the student model.

Preferably, the local L2 norm between the features is:

wherein, d _f (T, S) is the local L2 norm between features, T is the feature vector of the teacher model, S is the feature vector of the student model, T _i Is the ith component of the teacher model feature vector, S _i Is the ith component of the student model feature vector; d _f Is a symbol representing a local L2 norm; w is the width of the pedestrian image, H is the height of the pedestrian image, and C is the number of channels of the pedestrian image.

Preferably, the characteristic distillation loss is:

L _overhaul ＝d _f (σ(f _t ),r(f _s ))

wherein L is _overhaul Characteristic distillation loss; f. of _t Characteristics output for the teacher model; f. of _s Features output for the student model; σ (-) is a nonlinear function, r (-) is a regressor; d _f Is a symbol representing the local L2 norm.

Preferably, the overall loss is:

L _kd ＝L _logit +αL _overhaul +L _reid

wherein L is _kd For total loss, L _logit For the direction of transfer of knowledge, L _overhaul Is characterized in thatLoss by distillation; l is a radical of an alcohol _reid For the cross entropy loss function, α is the scaling factor.

The embodiment of the invention at least has the following beneficial effects:

the method aims at maximizing the performance of the pedestrian re-identification network model while minimizing the manpower labeling cost, applies an active learning algorithm to the pedestrian re-identification, and selects samples based on the network model gradient, so that the selected samples have uncertainty and diversity at the same time, and the network model can have better performance only by a small amount of labeled data; in addition, different from the traditional method, the method adopts the model compression method based on knowledge distillation to compress the network model, so that the scale of the network model is reduced, and the performance of the network model is ensured to the maximum extent. Therefore, the method can be widely applied to the fields of safety monitoring, pedestrian searching, criminal investigation and the like, the problem that a large amount of manpower is needed to obtain the marked data in the actual situation is solved, the problem that a large-scale network model aggravates the difficulty of pedestrian re-identification and deployment is solved, the design requirement of an actual engineering system can be met, and the method has good engineering application value.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions and advantages of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a block diagram of an embodiment of a pedestrian re-identification method based on deep active learning and model compression according to the present invention;

FIG. 2 is a flow chart of active learning;

FIG. 3 is a flow chart of a knowledge-based distillation model compression method.

Detailed Description

To further explain the technical means and effects of the present invention adopted to achieve the predetermined objects, the following detailed description of the proposed solution, its specific implementation, structure, features and effects will be made with reference to the accompanying drawings and preferred embodiments. In the following description, different "one embodiment" or "another embodiment" refers to not necessarily the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

Referring to fig. 1, a frame diagram of a pedestrian re-identification method based on deep active learning and model compression according to an embodiment of the present invention is shown, where the method includes the following steps:

step 1, collecting a plurality of images of the same pedestrian shot by a camera a and a camera b at different time and different places, and forming an image data set.

Specifically, after the image data set is obtained, the image data set is randomly divided into a training set and a test set.

Step 2, taking the ResNet101 structure as a backbone network of a pedestrian Re-ID network model, and extracting pedestrian image features by utilizing the backbone network; the loss function of the network model is a cross entropy loss function.

The ResNet101 structure in the above is a ResNet101 structure pre-trained on the ImageNet dataset, and the cross entropy loss function is:

wherein L is _reid For the cross entropy loss function, y (x) is the true identity label of the pedestrian image x, p _i (x) Predicting probability of the pedestrian image x type for the backbone network model; n is the total number of pedestrian images x.

It should be noted that, by using the cross entropy loss function, the network model can learn more robust image features, and can effectively learn pedestrian identity features.

Step 3, adopting a pedestrian re-identification strategy based on active learning to obtain S pedestrian image sets by using a network model and a query function psi (U, S, gamma (·)) from a small preheating pedestrian image set with the size of S; manually labeling labels for the S pedestrian image sets; then removing S pedestrian image sets from the unlabeled data set U; repeating the above process until the network model reaches the standard performance or the data is used up; wherein gamma (·) is a query policy; s is greater than 1.

Specifically, images in a pedestrian Re-ID (Re-identification) task are manually labeled, and the cost of manual labeling can be greatly reduced by integrating an active learning idea into the pedestrian Re-ID task; the greatest challenge in active learning is how to select the most informative samples x from the unlabeled data set U for manual labeling. Therefore, in the present embodiment, S pedestrian image sets X = { X } obtained by the method in step 3 ₁ ,...,x _s And manually labeling labels Y = { Y) of the S pedestrian image sets ₁ ,...,y _s The S pedestrian images are a set of the selected most information quantity; the flow chart of active learning is shown in fig. 2.

The standard performance is set by the implementer according to specific conditions, and the standard performance corresponding to the network model is different due to different implementation scenarios, so the standard performance is set by the implementer according to the specific conditions, and is not described again.

Step 4, in each iteration process, two calculation processes of gradient embedding calculation and sampling calculation are provided; in the process of calculating the gradient embedding, determining the norm size of the gradient embedding of the last layer of the network model according to the certainty of the network model on the predicted category;

according to the certainty of the network model to the predicted category, the specific size of the gradient embedded norm of the last layer of the network model is determined as follows: when the network model determines the height of the predicted category, namely the determination degree is high, a small norm is embedded in the gradient of the last layer of the network model; on the contrary, the gradient of the last layer of the network model is embedded with a larger norm.

The active learning method needs to sample in a gradient embedding space of a sample, so that the selected sample can be ensured to have diversity and uncertainty; therefore, in each iteration, there are mainly two calculation processes: gradient embedding calculation and sampling calculation; before calculating gradient embedding, the gradient embedding can be calculated only by obtaining the parameter weight of the last layer of the network model according to the expression of the network model, and for the network model, the last layer of the network model has a nonlinear function, namely the nonlinear function of the last layer is as follows:

wherein z is _i Is the output of the ith neuron of the last layer of the network model, z _j The output of the j-th neuron of the last layer of the network model; k is the number of the neurons in the last layer of the network model; e is a natural constant.

When the last layer of the network model has the weight of

In the formula, K is the number of the neurons in the last layer of the network model; w ₁ Weight of the 1 st neuron, W _K Weight of the Kth neuron; t is transposition operation; d is the dimension of the weight vector;

a vector space of dimension K x d; and V contains the weights of all layers in front of the network model, the network model f (x; theta) is a parameterized network model with theta = (W, V), and the expression of the network model f (x; theta) = sigma (W.z (x; V)), wherein z (x; V) is a nonlinear mapping function for converting the pedestrian image x into the feature vector V.

After the network model expression is obtained, for the pedestrian image x, when p is the same as the pedestrian image x according to the network model _i (x)＝f(x；θ) _i Then, the cross entropy loss function in step 2 is:

wherein L is _reid Is a cross entropy loss function, y is the true label of the pedestrian image x, W _j Outputting the weight of the jth neuron of the layer for the network model; w is a group of _y Weighting the pedestrian image x real label y corresponding to the network model output layer, wherein K is the number of neurons of the network model output layer; z (x; V) is a nonlinear mapping function that transforms the pedestrian image x into a feature vector V, and e is a natural constant.

It should be noted that the last layer of the network model is an output layer of the network model.

In the above, in the process of calculating gradient embedding, if the network model determines the predicted class height, the gradient embedding of the last layer of the network model has a small norm; otherwise, the gradient of the last layer of the network model is embedded with a larger norm; namely when

When the label is predicted by the current network model, if the confidence of the network model is higher, the predicted label is adopted

Solving the gradient of the last layer of the network model

When the temperature of the water is higher than the set temperature,

norm of

Is not very different, therefore, the gradient embedding includes uncertainty of the network model and potential network model update direction.

For the real label y, when the gradient of the last layer of the network model is

When the current is in the normal state; then:

wherein the content of the first and second substances,

gradient, L, of the last layer of the network model _reid Is a cross entropy loss function; w is the weight of the last layer of the network model; f (x; theta) is an expression of the network model, theta is a weight parameter of the network model in the expression, and x is a pedestrian image; y is a real label of the pedestrian image x; z (x; V) is a non-linear mapping function that transforms the pedestrian image x into V.

And 5, after the gradient embedding is calculated, clustering a batch of samples with the size of S by adopting a k-means + + algorithm for labeling, and sending the samples into a ResNet101 network for training to obtain a trained network model.

In the sampling process, the sample with higher confidence coefficient has smaller loss gradient, and when the k-means + + algorithm is adopted, the sample with higher confidence coefficient has small probability to be selected as the next clustering center point.

It should be noted that the k-means + + algorithm is a known technique and will not be described in detail.

Step 6, compressing the trained network model by adopting a knowledge distillation-based model compression method, and firstly, giving an initial student model and a teacher model trained in a pedestrian Re-ID data set; when the student models are trained, the teacher model is frozen, and the knowledge of the teacher model is circularly transferred to the student models in a double-flow mode; and meanwhile, the transmission progress and the transmission direction in the knowledge transmission process are monitored.

In order to more conveniently deploy the network model with better performance in practical application, the implementation adopts a model compression method based on knowledge distillation to compress the network model, so that the performance of the network model is kept to the maximum extent and the scale of the network model is reduced. A flow diagram of a model compression method in which knowledge-based distillation is performed is shown in fig. 3.

Specifically, an initial student model and a teacher model trained in a pedestrian Re-ID data set are given; the teacher model consists of deeper network layers, the student model is simpler and shallow in network layer number, the student model is a pedestrian Re-ID network model to be deployed in an actual scene, and the number of the network layers is less than that of a ResNet101 network; in training the student models, the teacher model is frozen, and then the knowledge of the teacher model is cyclically transferred to the student models in a dual-flow manner, so that the student models can learn the classification capability of the teacher model as much as possible. Meanwhile, in order to monitor the transmission progress and the transmission direction in the knowledge transmission process, the difference between the student model and the teacher model is measured by Jensen-Shannon (JS) divergence; jiesen-Shannon divergence is a well known technique and will not be described in detail.

The transfer direction is as follows:

wherein L is _logit For the direction of transfer of knowledge,/ _s For the logical output of the student model,/ _t For the logical output of the teacher model, D _KL Is the KL divergence between the teacher model and the student model.

It should be noted that, because KL (Kullback-Leibler) divergence itself is asymmetric, there is a case that different training results are caused by different sequences when the neural network is trained by using KL divergence; the JS divergence is symmetrical to the outputs of the two models, so that the problem of KL divergence in the process of training the neural network can be avoided; thus, the present example uses Jensen-Shannon (JS) divergence to measure the difference between the student model and the teacher model.

And 7, acquiring characteristic output of the pedestrian Re-ID network model, and acquiring characteristic distillation loss of characteristic information transmission gap according to a local L2 norm between characteristics.

Specifically, in order to guarantee the stability of the two models and make full use of information in the teacher model, feature output of the pedestrian Re-ID network model is obtained first, and then a local L2 norm d is adopted _f (T, S) reducing adverse effects on the student model in the feature information transfer process; meanwhile, an output value before a corrected Linear Unit (RELU) in the teacher model is selected as a source feature, an output value r (x) after 1 × 1 convolution in the student model is selected as a target feature, feature distillation loss for measuring feature information transmission gaps is defined according to local L2 norms among the features, and transmission of the feature information is optimized.

The local L2 norm between the features is:

wherein, d _f (T, S) is local L2 norm between features, T is the feature vector of teacher model, S is the feature vector of student model, T _i Is the ith component of the teacher model feature vector, S _i Is the ith component of the student model feature vector; d is a radical of _f Is a symbol representing a local L2 norm; w is the width of the pedestrian image, H is the height of the pedestrian image, and C is the number of channels of the pedestrian image.

The characteristic distillation loss is:

L _overhaul ＝d _f (σ(f _t ),r(f _s ))

wherein L is _overhaul Characterized by distillation losses; f. of _t Characteristics output for the teacher model; f. of _s Features output for the student model; sigma (-) is a nonlinear function, and r (-) is a regressor; d is a radical of _f Is a symbol representing the local L2 norm.

Note that, the feature f to be output by the teacher model is _t Features f output from the student model _s Match, so this embodiment is for f respectively _t Using a non-linear function sigma (-) to perform mapping transformation on f _s Using a convolution by 1 x 1 and normalizationAnd (4) carrying out mapping transformation by a regressor r (-) composed of layers.

And 8, adding the characteristic distillation loss into the student model to obtain the total loss of the student model.

The total loss is:

L _kd ＝L _logit +αL _overhaul +L _reid

wherein L is _kd For total loss, L _logit For the direction of transfer of knowledge, L _overhaul Characterized by distillation losses; l is a radical of an alcohol _reid For the cross entropy loss function, α is the scaling factor.

It should be noted that, in order to further make the student model play a role in a specific pedestrian Re-ID task, the characteristic distillation loss is added to the student model to Re-optimize the student model, so as to obtain the total loss of the student model.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; the modifications or substitutions do not make the essence of the corresponding technical solutions deviate from the technical solutions of the embodiments of the present application, and are included in the protection scope of the present application.

Claims

1. A pedestrian re-identification method based on deep active learning and model compression is characterized by comprising the following steps:

firstly, acquiring S pedestrian image sets by using a network model and a query function psi (U, S, gamma (-)) from a small preheating pedestrian image set with the size of S by adopting a pedestrian re-identification strategy based on active learning; manually labeling labels for the S pedestrian image sets; then removing S pedestrian image sets from the unlabeled data set U; repeating the above process until the network model reaches the standard performance or the data is used up; wherein γ (·) is a query policy; s is greater than 1;

in each iteration process, two calculation processes of gradient embedding calculation and sampling calculation are provided; in the process of calculating gradient embedding, determining the norm size of gradient embedding of the last layer of the network model according to the certainty of the network model on the predicted category;

compressing the trained network model by a knowledge distillation-based model compression method, and firstly, giving an initial student model and a teacher model trained in a pedestrian Re-ID data set; when the student models are trained, the teacher model is frozen, and the knowledge of the teacher model is circularly transferred to the student models in a double-flow mode; meanwhile, the transmission progress and the transmission direction in the knowledge transmission process are monitored;

acquiring characteristic output of a pedestrian Re-ID network model, and acquiring characteristic distillation loss of characteristic information transmission difference according to local L2 norms among characteristics;

2. The pedestrian re-identification method based on deep active learning and model compression as claimed in claim 1, wherein the gradient of the last layer of the network model is recorded as

Then:

wherein, the first and the second end of the pipe are connected with each other,

the gradient of the last layer of the network model; l is _reid Is a cross entropy loss function; w is the weight of the last layer of the network model; f (x; theta) is an expression of the network model, theta is a weight parameter of the network model in the expression, and x is a pedestrian image; y is the true label of the pedestrian image x.

3. The pedestrian re-identification method based on deep active learning and model compression as claimed in claim 1, wherein the transmission direction is as follows:

wherein L is _logit For the direction of transfer of knowledge, l _s As logical output of the student model,/ _t For the logical output of the teacher model, D _KL Is the KL divergence between the teacher model and the student model.

4. The pedestrian re-identification method based on deep active learning and model compression as claimed in claim 1, wherein the local L2 norm between the features is:

wherein d is _f (T, S) is local L2 norm between features, T is the feature vector of teacher model, S is the feature vector of student model, T _i Is the ith component of the teacher model feature vector, S _i Is the ith component of the student model feature vector; d is a radical of _f Is a symbol representing a local L2 norm; w being pedestrian imageWidth, H is the height of the pedestrian image, and C is the number of channels of the pedestrian image.

5. The pedestrian re-identification method based on deep active learning and model compression as claimed in claim 1, wherein the characteristic distillation loss is:

L _overhaul ＝d _f (σ(f _t ),r(f _s ))

wherein L is _overhaul Characterized by distillation losses; f. of _t Characteristics output for the teacher model; f. of _s Features output for the student model; sigma (-) is a nonlinear function, and r (-) is a regressor; d _f Is a symbol representing the local L2 norm.

6. The pedestrian re-identification method based on deep active learning and model compression as claimed in claim 1, wherein the overall loss is:

L _kd ＝L _logit +αL _overhaul +L _reid

wherein L is _kd For total loss, L _logit For the direction of transfer of knowledge, L _overhaul Characteristic distillation loss; l is _reid Alpha is a scaling factor for the cross entropy loss function.