CN112560631B

CN112560631B - Knowledge distillation-based pedestrian re-identification method

Info

Publication number: CN112560631B
Application number: CN202011431855.1A
Authority: CN
Inventors: 尚振宏; 李粘粘
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2020-12-09
Filing date: 2020-12-09
Publication date: 2022-06-21
Anticipated expiration: 2040-12-09
Also published as: CN112560631A

Abstract

The invention discloses a pedestrian re-identification method based on knowledge distillation, which comprises the following steps: inputting a pedestrian image training set into a teacher network, and inputting the same data set into a student network; distilling is carried out simultaneously at a plurality of stages of the whole backbone network through the synergistic effect of student network transfer, characteristic distilling positions and distance loss functions, so that the characteristic output of the student network is continuously close to the characteristic output of the teacher network; parameters of the student model are updated in a minimized mode through a distillation loss function, and a student network is trained; distance measurement is carried out on the obtained feature vectors, a pedestrian target graph with the highest similarity is searched, and finally the accuracy of the student network resnet18 is greatly improved to be close to that of the teacher network resnet 50. The method realizes the re-recognition of personnel by using a knowledge distillation migration learning method, effectively reduces the computational complexity by replacing a large model with a small model, and ensures the accuracy of a student model.

Description

Knowledge distillation-based pedestrian re-identification method

Technical Field

The invention relates to the field of computer vision and image processing, in particular to a pedestrian re-identification method based on knowledge distillation.

Background

The purpose of person re-identification is to find a particular pedestrian in a library of images taken by many different cameras. The difficulty of this problem lies in the following aspects: the shooting visual angles, pedestrian postures, illumination intensities and shelters of different pictures can be greatly different. And in the pedestrian re-identification module, comparing the specified query image with the pictures in the gallery, and retrieving the picture of the same person as the query image. To compare the pictures in the gallery to the query pictures, the system first extracts a feature representation describing each image using a hand-made descriptor or deep neural network. Usually, the characteristics of the gallery are calculated and stored off-line in advance, so that during testing, only the characteristics of the query image need to be extracted. After the features are extracted, they can be compared to the features of the gallery by computing a similarity measure.

In an actual application scenario, the computing resources are often limited to a certain extent, the computing cost of the algorithm in operation must be optimized, and the algorithm can still maintain a high accuracy while the computing overhead is reduced. The pedestrian re-identification algorithm mainly comprises a manual characteristic-based method and a deep learning method, and the accuracy of the deep learning-based pedestrian re-identification method far exceeds that of the traditional manual characteristic-based method. However, the calculation cost of the operation of the deep neural network is high, so that a pedestrian re-identification method based on deep learning can be adopted, the calculation cost is reduced on the basis of the existing deep learning method, and the actual scene requirements are better met.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a pedestrian re-identification method based on knowledge distillation, which is used for reducing the calculation overhead on the existing deep learning method and better meeting the actual scene requirements. New methods of knowledge distillation are therefore proposed, smaller models trained with the support of larger/deeper networks, allowing for reduced computational effort and enabling small models to achieve accuracy very similar to deep networks.

In order to solve the technical problems, the technical scheme of the invention is as follows: a pedestrian re-identification method based on knowledge distillation comprises the following steps:

step 1, inputting a pedestrian image training set into a PCB of a resnet50 as a teacher network, and inputting the same data set into a PCB of a resnet18 as a student network;

step 2, distilling is carried out simultaneously at a plurality of stages of the whole backbone network through the synergistic effect of student network transfer, characteristic distillation positions and distance loss functions, so that the characteristic output of the student network is continuously close to the characteristic output of the teacher network;

step 3, loss function L by distillation_distillMinimizing and updating parameters of the student model, and training a student network;

and 4, measuring the distance of the obtained feature vectors, and searching out a pedestrian target map with the highest similarity.

As a further description of the above technical solution: the teacher model in the step 1 is a trained model, and is a complex model which completes the same task as the student model and is used for assisting in training a student network; the teacher network trains by using a network structure of a PCB with a backbone network as resnet50, and the student network trains by simulating a teacher by using a distillation method by using a PCB with a backbone network as resnet 18; the feature graph output by the backbone network is longitudinally and uniformly divided into 6 parts, namely 6 tensors with the space size of 4 × 8, then global average pooling is respectively carried out to obtain 6 features A, the features A are reduced into the number of channels by 1 × 1 convolution, and then the full connection layer and softmax are respectively connected.

As a further description of the above technical solution: the student network transfer process in the step 2 is as follows: changing the dimensionality of a student network, processing the feature diagram of the student network, increasing the dimensionality of the student network to the number of channels of the feature diagram of the teacher network corresponding to the student network through 1 × 1 convolution, and distilling the feature diagram before ReLU, wherein the values in the feature diagram comprise positive numbers and negative numbers, the student network only needs to be close to the positive values of the teacher network as much as possible, and for the negative values, the output of the student network does not need to be completely consistent with the negative values of the teacher network, but only needs to be negative as the teacher network; thus, after passing through the ReLU layer, negative values of both the teacher network and the student network will output 0.

As a further description of the above technical solution: in the step 2, distillation position selection is carried out at a plurality of down-sampling stages of a backbone network, and when resnet is adopted by the backbone network, the distillation position selection is carried out at the ends of Conv2_ x, Conv3_ x, Conv4_ x and Conv5_ x of the resnet; the distillation method is structurally divided into two parts, wherein the first part is distilled at different stages of a backbone network; the second part distills the features behind the fully connected layer; finally, the feature sFeatureD output by the student network after the full connection layer is similar to the feature tFeatureD output by the teacher network as much as possible.

As a further description of the above technical solution: in the step 2, the loss function of the distillation of the first part in the backbone network part, and the characteristics N, S e R extracted by the teacher network and the student network at each stage of the backbone network^W×H×CValue N of the ith bit in the feature_i，S_iE is R, R is a three-dimensional vector of the characteristic diagram, W is width, H is height, and C is the number of channels; feature pass for distillation in student networkAfter the 1 × 1 convolution and batch normalization are converted to be consistent with the characteristic dimension of the teacher network for distillation, calculating the distance between the student and the teacher network characteristic as shown in the formula (1);

in the formula (1), N represents teacher 'S characteristics, S represents student' S characteristics, d_p(N, S) represents a distance function, by d_pThe distance loss function calculated by the (N, S) enables the output of the student network at a plurality of stages of the backbone network to be more and more similar to the output of the teacher network at corresponding stages, so that the human body characteristics extracted by the network are also more similar, r is used as a conversion function of 1x1 convolution and batch normalization after the characteristic diagram is extracted by the student backbone network, and the distillation loss function of the first part is defined as shown in a formula (2):

L_distill1＝d_p(F_n，r(F_s)) (2)

in the formula (2), F_nRepresenting teacher features, F_sRepresenting the characteristics of the students before the network conversion.

As a further description of the above technical solution: the second part of distillation in the step 2 is to distill the extracted human body characteristics, namely the network characteristics behind the full connection layer; the modified Softmax loss function proposed by Hinton is shown in equation (3):

in the formula (3), T is a temperature parameter, and when T is equal to 1, the T is a standard softmax function; when T is increased, the probability distribution of softmax output becomes smoother, so that more information of the teacher network can be utilized;

when the student network is trained, the softmax function of the student network uses the same T as the teacher network, and the loss function takes the soft label output by the teacher network as a target; such a loss function is called "cancellation loss"; the effect is better when the correct data label is used in the training process; the method specifically comprises the steps of calculating the rejection loss, and calculating the standard loss T which is 1 by using hard label at the same time, wherein the loss is called 'student loss'; and integrating the two losses to obtain a distillation loss function of the second part, wherein an integrated formula is shown as a formula (4):

L_distill2(x；θ)＝α*M(y，σ(z_s，T＝1))+β*M(σ(z_t；T＝τ)，σ(z_s，T＝τ)) (4)

in the formula (4), x is input, theta is a parameter of the student model, M is a cross entropy loss function, y is a real label, sigma is a function of which the parameter has T, alpha and beta are coefficients, and z is_s，z_tThe locations output of the student and teacher, respectively.

As a further description of the above technical solution: the loss function of the knowledge distillation pedestrian re-recognition finally obtained in the step 3 is shown as a formula (5):

L_distill＝λ*L_distill1+μL_distill2 (5)

in the formula (5), λ and μ are constants.

As a further description of the above technical solution: and 4, comparing the feature vector of the image to be identified with the pedestrian feature vector of the image set, and searching out the pedestrian target image with the highest similarity.

Compared with the prior art, the invention has the following beneficial effects: the present invention proposes an improvement to optimize this trade-off, in view of the most appropriate configuration for the actual application conditions, providing a trade-off analysis between accuracy of test time and computational cost in the human re-identification problem. To this end, the present invention uses resnet50 (as a teacher) to transfer knowledge into a more compact model represented by resnet-18 (as a student), and finally the accuracy of resnet18 is greatly improved to close to that of resnet 50. The method realizes the re-recognition of personnel by using a knowledge distillation migration learning method, effectively reduces the computational complexity by replacing a large model with a small model, and ensures the accuracy of a student model. The calculation amount is reduced, and the small model can achieve the accuracy very similar to that of a deep network.

In the invention, because the dimensionality of the teacher network is higher in the student network conversion, the dimensionality of the student model and the dimensionality of the teacher model can be kept consistent as much as possible by the 1x1 convolution + BN, so that the method is more beneficial to extracting characteristic information and enables the training to be more stable. The distance function is used as the distance loss function, when the loss function is large, the output characteristics of students and the characteristics of teachers are continuously close to each other as much as possible through gradient back propagation, parameter updating and loss function minimization.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

The technical solutions of the present invention will be described in further detail with reference to the drawings and specific examples, but the present invention is not limited to the following technical solutions.

Example 1

Step 1, inputting a pedestrian image training set into a PCB of a resnet50 as a teacher network, and using the PCB of a resnet18 as a student network;

in the step 1, the teacher model is a trained model and a complex model which completes the same task with the student model and is used for assisting in training a student network. The teacher network trains by using a network structure of a PCB with a backbone network as resnet50, and the student network trains by simulating teachers by using a distillation method by using a PCB with a backbone network as resnet 18. The feature graph output by the backbone network is longitudinally and uniformly divided into 6 parts, namely 6 tensors with the space size of 4 × 8, then global average pooling is respectively carried out to obtain 6 features A, the features A are reduced into the number of channels by using 1 × 1 convolution, and then the full connection layer and softmax are respectively connected.

Step 2, simultaneously distilling the student network at multiple stages of the whole backbone network through the synergistic effect of student network transfer, characteristic distillation positions and distance loss functions, so that the characteristic output of the student network is continuously close to the characteristic output of a teacher network;

the student network transfer process is as follows: the dimensionality of the student network is changed, and because the number of output characteristic diagram channels of the teacher network and the student network in different stages of the backbone network is not consistent, the difference between the teacher network characteristic diagram and the student network characteristic diagram cannot be directly calculated. We process the feature map of the student network, and make the student network increase dimension to the number of channels of the corresponding teacher network feature map through 1x1 convolution. Further, the distillation method in the present embodiment sufficiently takes the characteristics of relu into consideration here. And (4) taking the characteristic diagram before ReLU for distillation, wherein the values in the characteristic diagram comprise positive numbers and negative numbers. The student network only needs to be as close as possible to the positive value of the teacher network, and for negative values, the output of the student network does not need to be completely consistent with the negative value of the teacher network, but only needs to be negative as the teacher network. Thus, after passing through the ReLU layer, negative values of both the teacher network and the student network will output 0.

Distillation position selection is performed at multiple down-sampling stages of the backbone network. When resnet is used as the backbone network, it is performed at ends of Conv2_ x, Conv3_ x, Conv4_ x and Conv5_ x of resnet. The distillation method is structurally divided into two parts, wherein the first part is used for distilling at different stages of a backbone network; the second part distills the features after full connection of the layers. Finally, it is desirable that the feature sFeatureD before the fully connected layer of the student network output be as similar as possible to the feature tFeatureD of the teacher network output.

Loss function of distillation of the first part in the backbone network part, and features N, S epsilon R extracted by the teacher network and the student network at each stage of the backbone network^W×H×CValue N of the ith bit in the feature_i，S_iE.g. R, after the characteristics of the student network for distillation are converted into the characteristics with the same dimension as the characteristics of the teacher network for distillation through 1x1 convolution and batch normalization, calculating the distance between the network characteristics of the student and the teacher, as shown in the formula (1);

L_distill1＝d_p(F_n，r(F_s)) (2)；

in the formula (2), F_nRepresenting teacher features, F_sRepresenting the characteristics of students before the network conversion;

the second part of distillation is to distill the extracted human body characteristics, namely the network characteristics behind the full connection layer; the second part of distillation in the step 2 is to distill the extracted human body characteristics, namely the network characteristics after the full connection layer, and an improved Soft max function proposed by Hinton is utilized, as shown in formula (3):

in the formula (3), T is a temperature parameter. When T is equal to 1, it is the standard softmax function. As T increases, the probability distribution of softmax output becomes more soft (smooth), thus making more information available to the teacher model. When training a student, the student's softmax function uses the same T as the teacher, and the loss function targets the soft label output by the teacher. Such a loss function is called "rejection loss". The use of the correct data label (hard label) during the training process will work better. Specifically, while calculating the rejection loss, i also calculate the loss (T ═ 1) of the standard by using hard label, and this loss is called "student loss". The formula for integrating the two types of loss is shown in formula (4):

in the formula (4), x is input, theta is a parameter of the student model, M is a cross entropy loss function, y is a true label, sigma is a function of which the parameter has T, tau is greater than 1, alpha and beta are coefficients, z is_s，z_tThe locations output of the student and teacher, respectively.

Step 3, loss function L by distillation_distillMinimally updating parameters of the student model, and training a student network; finally, the loss function of knowledge distillation pedestrian re-identification is obtained as shown in the formula (5):

L_distill＝λ*L_distill1+μL_distill2 (5)

in the formula (5), λ is 2 and μ is 6.

And 4, comparing the feature vector of the image to be identified with the pedestrian feature vector of the image set, measuring the distance of the obtained feature vector, and searching out the pedestrian target image with the highest similarity.

In the embodiment, the size of the input picture is 384 × 128, the batch size is 64, the resnet50 is used as the backbone network of the teacher network, and the parameters trained on the ImageNet are used as the pre-training model. The SGD with momentum set to 0.9 was chosen as the optimizer. The initial learning rate for the distillation training was 0.5, the learning rate was attenuated to 0.05 and 0.0005 at 20 epochs and 40 epochs, respectively, and the training was stopped at 60 epochs. During training, the pre-processing of the pictures adopts random horizontal turning for data amplification. The method of the embodiment and the existing method are adopted to carry out verification comparison on the Market-1501 data set and the DukeMTMC-reiD data set, the two data sets contain some problems existing in practical application, each identity appears in different cameras and presents different visual angles, postures and illumination changes, and therefore testing on the two data sets has great challenges and certain significance.

The results are shown in tables 1 and 2.

TABLE 1 Experimental results on Market-1501 data set

TABLE 2 Experimental results on DukeMTMC-reiD data set

In the experimental method of this example, PCB + MKD indicates that at the last four stages of resnet, resnet18 was distilled by resnet50 as teacher, PCB + FKD indicates that after the last fully-connected layer, resnet50 was distilled by resnet18 as teacher, PCB + MKD + FKD indicates that the four-stage distillation of resnet and the distillation after the fully-connected layer of PCB + FKD simultaneously used resnet50 as teacher to distill resnet 18.

During distillation, it is desirable that the pre-full connectivity layer signature of the student network output (sFeatureD) be as similar as possible to the teacher network output signature (tFeatureD). Although I only need to obtain sFeatureD similar to tFeatureD, satisfactory results are difficult to achieve if sFeatureD is distilled alone. Because of the large difference between the backbone network of the student network and the backbone network of the teacher network, it is extremely difficult to distill only with sFeatureD and tFeatureD if it is desired to make sFeatureD and tFeatureD as close as possible. We therefore perform simultaneous distillation at multiple stages throughout the backbone network to aid in the distillation of sFeatureD and tFeatureD, which makes sFeatureD similar to tFeatureD more readily available. The experimental results in tables 1 and 2 show that the performance of the student network is even better than that of the teacher network, and the effectiveness of the method provided by the invention is proved.

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit and scope of the present invention.

Claims

1. A pedestrian re-identification method based on knowledge distillation is characterized by comprising the following steps:

step 4, measuring the distance of the obtained characteristic vectors, and searching out a pedestrian target map with the highest similarity;

the teacher network in the step 1 is a trained model, and a complex model which completes the same task with the student network is used for assisting in training the student network; the teacher network trains by using a network structure of a PCB with a backbone network as resnet50, and the student network trains by simulating a teacher model by using a PCB with a backbone network as resnet18 through a distillation method; the characteristic diagram output by the backbone network is longitudinally and uniformly divided into 6 parts, namely 6 tensors with the space size of 4 × 8, then global average pooling is respectively carried out to obtain 6 characteristic A, the characteristic A is reduced into the number of channels by 1 × 1 convolution, and then a full connection layer and softmax are respectively connected;

the student network transfer process in the step 2 is as follows: changing the dimensionality of a student network, processing the feature diagram of the student network, increasing the dimensionality of the student network to the number of channels of the feature diagram of the teacher network corresponding to the student network through 1 × 1 convolution, and distilling the feature diagram before ReLU, wherein the values in the feature diagram comprise positive numbers and negative numbers, the student network only needs to be close to the positive values of the teacher network as much as possible, and for the negative values, the output of the student network does not need to be completely consistent with the negative values of the teacher network, but only needs to be negative as the teacher network; therefore, after passing through the ReLU layer, the negative value of both the teacher network and the student network can output 0;

in the step 2, distillation position selection is carried out at a plurality of down-sampling stages of a backbone network, and when resnet is adopted by the backbone network, the distillation position selection is carried out at the ends of Conv2_ x, Conv3_ x, Conv4_ x and Conv5_ x of the resnet; the distillation method is structurally divided into two parts, wherein the first part is distilled at different stages of a backbone network; the second part distills the features behind the fully connected layer; finally, the feature sFeatureD output by the student network after the full connection layer is similar to the feature tFeatureD output by the teacher network as much as possible;

in the step 2, the loss function of the distillation of the first part in the backbone network part, and the characteristics N, S e R extracted by the teacher network and the student network at each stage of the backbone network^W×H×CValue N of the ith bit in the feature_i，S_iThe method comprises the following steps of (1) belonging to the field of R, wherein R is a three-dimensional vector of a characteristic diagram, W is width, H is height, and C is the number of channels; after the feature used for distillation of the student network is converted into the feature dimension consistent with the feature used for distillation of the teacher network through 1 × 1 convolution and batch normalization, calculating the distance between the student and the teacher network feature, as shown in formula (1);

L_distill1＝d_p(F_n，r(F_s)) (2)

the second part of distillation in the step 2 is to distill the extracted human body characteristics, namely the network characteristics behind the full connection layer; the modified Softmax loss function proposed by Hinton is shown in equation (3):

in the formula (3), T refers to a temperature parameter, and is a standard softmax function when T is equal to 1; when T is increased, the probability distribution of softmax output becomes smoother, so that more information of the teacher network can be utilized;

when the student network is trained, the softmax function of the student network uses the same T as the teacher network, and the loss function takes the soft label output by the teacher network as a target; such a loss function is called "rejection loss"; the effect is better when the correct data label is used in the training process; the method specifically comprises the steps of calculating the rejection loss, and calculating the standard loss T which is 1 by using hard label at the same time, wherein the loss is called 'student loss'; and integrating the two losses to obtain a distillation loss function of the second part, wherein an integrated formula is shown as a formula (4):

L_distill2(x；θ)＝α*M(y，σ(z_s；T＝1))+β*M(σ(z_t；T＝τ)，σ(z_s，T＝τ)) (4)

in the formula (4), x is input, theta is a parameter of the student model, M is a cross entropy loss function, y is a real label, sigma is a function of which the parameter has T, alpha and beta are hyper-parameters, and z is_s，z_tThe locations output for students and teachers, respectively;

the loss function of the knowledge distillation pedestrian re-identification finally obtained in the step 3 is shown as a formula (5):

L_distill＝λ*L_distill1+μL_distill2 (5)

in the formula (5), λ and μ are constants.

2. The pedestrian re-identification method based on knowledge distillation as claimed in claim 1, wherein: and 4, comparing the feature vector of the image to be identified with the pedestrian feature vector of the image set, and searching out the pedestrian target image with the highest similarity.