CN112633406A

CN112633406A - Knowledge distillation-based few-sample target detection method

Info

Publication number: CN112633406A
Application number: CN202011626826.0A
Authority: CN
Inventors: 杨嘉琛; 郭晓岚; 王晨光
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2021-04-09

Abstract

The invention provides a knowledge distillation-based few-sample target detection method, which is characterized by comprising the following steps of: constructing a picture database with labels and tags to meet the condition of target detection of few samples, and enhancing the diversity of the extended samples by certain data, namely, combining training data and fine-tuning training data; selecting a target detection framework and a backbone network, constructing a network model, and training the network model by using joint training data to obtain a weight model; fine-tuning the first weight model obtained in the second step by using fine-tuning training data to obtain a new weight model, which is called a second weight model; and taking the first weight model obtained in the second step as a student network, taking the second weight model obtained in the third step as a teacher network, using the fine tuning training data again, and performing knowledge distillation according to a test result of the teacher network on the fine tuning training data to realize fine tuning of the student network weight so as to obtain a final third weight model.

Description

Knowledge distillation-based few-sample target detection method

Technical Field

The invention belongs to the field of few-sample object detection, and relates to a few-sample target detection method.

Background

In the field of computer vision, object detection is a very popular research topic. Especially with the proposed Convolutional Neural Network (CNN) and its wide application in the field of image processing, object detection has been rapidly developed and achieved significant achievement^[1]. However, the general CNN-based target detection framework needs to be trained by a large data set. When a detection task with only rare samples is faced, the current target detection cannot achieve satisfactory effect, which is also a limitation of the general target detection technology. The main reason for this is that CNN is too powerful in its characterization capability, and when faced with too few samples, the samples are over-fitted, resulting in a lack of generalization capability on new samples, which in turn reduces the detection capability. Therefore, in order to improve the detection performance of target detection under the condition of few samples, a method for effectively improving the network overfitting must be considered.

The overfitting refers to a problem in the model parameter fitting process, and since the training data contains sampling errors, the sampling errors are also considered by the complex model during training, and the sampling errors are well fitted. The method has the main performance that the model has good effect on a training set, has poor effect on a test set and has weak generalization capability. In the case of a very small number of samples, the target detection network is susceptible to such overfitting. Some solutions to this problem are available. Some researchers use data enhancement-based methods to improve the over-fitting problem by increasing the sample complexity to create interference to the neural network model, but such methods are still not applicable for the case of few samples (e.g., only 1 or 2, etc.). Some researchers recently used meta-learning methods^[2]The method enables the neural network to have the ability of relearning through a task learning mode, and then well learns the knowledge of a small number of samples. Although some progress is made in the field of target detection of a few samples, the methods usually need to add an additional meta-learning module to enhance the characterization capability of the model on the few samples, and have no universality and a complex structure.

Knowledge steamerDistillation (KD) is the desire to migrate dark knowledge in complex models (teachers) to simple models (students), which are typically more compact than teacher models. Through knowledge distillation, it is desirable that student models approach or exceed teachers as much as possible, thereby achieving similar predictive results with less complexity^[3]. At present, there are two main modes of application of knowledge distillation in CNN: (1) the general distillation mode, i.e. distillation between different models, is to perform knowledge transfer between models. (2) Self-distillation mode, i.e. distillation between the same models, treats distillation as a regularization way to improve the network model. Therefore, the self-knowledge distillation can be used as a training method for a target detection task with few samples, the network is retrained by the self-knowledge distillation, the overfitting degree of the neural network under the condition of few samples is reduced, and an additional module is not required to be added.

[1] Liu dong, plum, Cao Shi Dong, deep learning and its application in image object classification and detection are reviewed [ J ] computer science 2016, (12):13-23.

[2] Panxingmu, zhang xulong, dong unknown, etc. current research situation of few sample object detection [ J ] proceedings of Nanjing university of information engineering (Nature science edition), 2019(6).

[3]Hinton G,Vinyals O,Dean J.Distilling the Knowledge in a Neural Network[J].Computer Science,2015,14(7):38-39.

Disclosure of Invention

Aiming at the problem of overfitting of a network in the detection of the target with less samples, the invention provides the method for detecting the target with less samples based on knowledge distillation, which can be used for conveniently, universally and accurately approximating the overfitting network to improve the overfitting. The general target detection framework has better low-sample detection performance. The technical scheme is as follows:

a few-sample target detection method based on knowledge distillation is characterized by comprising the following specific operation steps:

the first step is as follows: and constructing a picture database with labels and tags to meet the condition of target detection of few samples, and enhancing the diversity of the extended samples by certain data to divide the extended samples into joint training data D _ join and fine-tuning training data D _ ft.

The second step is that: selecting a target detection framework and a backbone network, constructing a network model, and training the network model by using joint training data to obtain a weight model; the method comprises the following steps:

selecting fast-RCNN as a target detection framework, adopting VGG16 with 13 convolutional layers, 3 full-link layers and 5 pooling layers as a backbone network, selecting an optimizer as SGD, selecting an Align method in an ROI pooling stage, training the above models by using joint training data D _ join, and finally obtaining the weight of the joint training model, namely a first weight model;

the third step: and D, fine-tuning the first weight model obtained in the second step by using fine-tuning training data to obtain a new weight model, namely a second weight model, wherein the method comprises the following steps:

training a first weight model by using fine tuning training data D _ ft, and setting an initial learning rate without setting an attenuation round; setting the training round as 5 rounds, selecting an optimizer as SGD, and selecting an Align method in the ROI pooling stage; freezing all feature layers of the VGG16, and only adjusting the classification layer; finally, the weight of the fine tuning training model is obtained and is called as a second weight model;

the fourth step: using the first weight model obtained in the second step as a student network, using the second weight model obtained in the third step as a teacher network, using the fine tuning training data again, performing knowledge distillation according to the test result of the teacher network on the fine tuning training data to realize fine tuning of the student network weight, obtaining a final third weight model, obtaining a detection result through the third model and outputting the detection result,

the fourth step comprises the following specific steps:

using the fine tuning training data D _ ft and the prior knowledge of the teacher network to perform knowledge distillation training on the student network, and setting an initial learning rate without setting an attenuation round; setting the training round as 8 rounds, selecting an optimizer as SGD, and selecting an Align method in the ROI pooling stage; freezing the first 10 feature layers of VGG16, and adjusting the next 3 feature layers and classifiers to improve the high-layer overfitting; and (3) calculating a total loss function when knowledge distillation is carried out on the student network according to the following formula by combining the prior knowledge of the teacher network and the real data label knowledge:

l _ cls is the classification difference between the teacher network and the student network; the L _ cls not only comprises the distribution difference with a teacher network, but also comprises the difference with a real label, and with the progress of distillation training, the distribution difference with the teacher network accounts for from high to low, and the distribution difference with the real label accounts for from low to high;

l _ reg is the difference of the prediction boxes between the teacher network and the student network, and the positioning capability of the teacher network and the student network is improved by regressing the distance between the prediction boxes and the true value of the student network;

l _ cpf is the inter-feature difference for the few sample classes; weighting the characteristic difference output by the teacher and student network models by generating an attention distribution map related to few sample classes so as to lead the student network to be more inclined to learn the characteristics of few sample data;

lambda and gamma are hyper-parameters for balancing different loss function terms;

and obtaining a final third weight model.

The invention is based on the idea of self-knowledge distillation and designs a training method suitable for a few-sample target detection framework. The idea is that the neural network model is retrained through knowledge distillation, the obtained model is more generalized than before, learning of few sample class data is highlighted, and the detection capability of the model on the few sample class data is comprehensively improved. Compared with other methods, the method has universality, can be expanded to different target detection frameworks and different data sets, and is simple and efficient.

Drawings

FIG. 1 is a schematic flow chart of a method for detecting a small-sample target based on self-knowledge distillation according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of data enhancement provided by an embodiment of the present invention;

FIG. 3 is a frame diagram of a knowledge distillation training process based on fast-RCNN according to an embodiment of the present invention;

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular structures, parameters, etc. in order to provide a more thorough understanding of the embodiments of the invention. However, it will be apparent to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In some general cases, detailed descriptions of well-known structures are omitted so as not to unnecessarily obscure the present invention.

In order to make the technical scheme of the invention clearer, the invention is described below with reference to the accompanying drawings and specific embodiments.

Fig. 1 is a schematic flow chart of an implementation of a method for detecting a small sample target based on self-knowledge distillation according to an embodiment of the present invention, which is detailed as follows:

the first step of the embodiment is to prepare a data set. The invention selects and uses the PASCAL VOC2007 data set to produce a plurality of less-sample tasks, and the less-sample target detection task generally takes any 5 types of data in the VOC2007 data set 20 types as the less-sample data set and refers to the less-sample data set as C_novelIn this embodiment, C is selected_novel＝[bird,bus,cow,motorbike,sofa]K pictures are taken for each category (K is usually equal to 1, 2, 3, 5, 10). The other 15 classes are called base class data, denoted C_baseUsually training with 5 classes of few sample data, where the training data is D_joint. Then, in the fine tuning and distilling stage, the same number of 15 types of base class data are selected and the model is fine tuned and distilled together with less sample data, and the training data is D_ft. To increase sample diversity at the data level, D_jointAnd D_ftEach picture in (a) is flipped to achieve data enhancement, see fig. 2.

Example the second step is joint training. This example is based on the fast-RCNN detection framework, and selects VGG16 having 13 Convolutional layers (volumetric Layer), 3 Fully connected layers (full connected Layer), and 5 pooling layers (Pool Layer) as a backbone network, and implements code using a deep learning framework pytorch. Joint training is training with 15 classes of base class data along with 5 other classes of few sample data, i.e., D_joint. The reason for performing such joint training is that the 15-class base class data can enable the model to better learn the shallow universal feature expression capability, which is beneficial to improving the detection performance of few samples. Specifically, the initial learning rate is set to be 0.001, and the attenuation of each 4 rounds is set to be 0.1; setting the training round as 10 rounds, selecting an optimizer as SGD, and selecting an Align method at the ROI pooling stage; setting sizes of 9 prior anchor point frames; at the same time training is started based on the pre-training weights vgg16_ noise. The training loss function at this time is the same as the original fast-RCNN. The model obtained by training at this time is referred to as model 1.

Example the third step is fine tuning training. Since the data sets in the joint training phase are severely class unbalanced, the model will produce a severe overfitting for those few sample data. Trimming is a common method of ameliorating this problem. Specifically, the invention fine-tunes the model 1 through a small data set, and selects 15 types of base class data with the same quantity and few sample data to fine-tune the model so as not to reinitialize the classification layer of the model, namely D_ft. During fine adjustment, a smaller learning rate is often selected, the initial learning rate is set to be 0.0001, and attenuation turns are not set; setting the training round as 5 rounds, selecting an optimizer as SGD, and selecting an Align method in the ROI pooling stage; freezing the characteristic layer of VGG16, and only adjusting the classification layer, which is beneficial for the classifier of the model to have better characteristic expression capability to less sample data; the fine tuning training model weights are obtained and are called as model 2.

Example the fourth step is knowledge distillation training. The invention uses fine-tuning training data D_ftAnd prior knowledge of the model 2 (teacher network), and knowledge distillation training is carried out on the model 1 (student network). This step is analyzed with reference to fig. 3. FIG. 3 shows the general detection process of fast-RCNN, which uses a branch to generate some propusals that may contain samples, and then performs classification and regression to improve detection performance. The main components of the device comprise: 1) a backbone for generating a base feature map, with the output f_base(x | W), where W represents a network parameter; 2) region for generating proposalsProposal Networks (RPN); 3) the network head (RCN) for classifying and regressing the region of interest (ROI) is set as p and R respectively, wherein p is the predicted logits of each class by the network. The invention sets the classification and regression result of teacher network output as p_tAnd R_tThe output of the student network is p_s、R_s. Since it is essentially the same model, it is called self-distillation. In the knowledge distillation training stage, the student network performs knowledge distillation on the teacher network through loss functions of three aspects, which is expressed by the invention as follows:

where λ and γ are hyper-parameters for balancing different loss function terms, λ is 1 and γ is 0.01 in this embodiment.

In the knowledge distillation, the invention uses a broader softmax function, as shown in equation (2).

Where p is the logits value output by the classifier. The general softmax obtains a distribution which is a vector close to the argmax output one-hot, and the output is too hard, so that the probability that too much attention is paid to the network learning is high, and the probability that the value is low is ignored. By the aid of the generalized softmax function, probability distribution of output classification results is softer, students learn similar mapping by approximating probability distribution of networks of the students and teachers, and network fitting degree is reduced. T is called a temperature coefficient, and the larger the value of T, the higher the output softening degree. When the knowledge transfer between models is carried out, only the T value needs to be properly increased, and the Cross entropy (Cross-entropy) of the distribution between networks is minimized, which is expressed as formula (3):

is the classification difference between models. The invention properly increases the value of the temperature T to soften the output in the training stage, so that the student network learns more knowledge and improves the network generalization capability. And then carrying out inter-model knowledge transfer by reducing output distribution difference among networks. Through the formula (3), the student model can well learn the classification knowledge of the teacher model. But the teacher's knowledge is not necessarily correct, and in order to ensure that the students learned correct and generalized knowledge, they need to incorporate the correct label provided by the group truth.

The formula (4):

mu is a hyperparameter representing the ratio of knowledge sources, CE represents the cross entropy function, y_clsRepresenting the true label value of the classification. At the start of the distillation training, the present experiment set μ to 0.5, and as the distillation training proceeded, the value of μ increased by 0.1 every two rounds, so that the distribution difference from the teacher network went from high to low. The trained model not only has prior knowledge obtained by distillation, reduces overfitting, but also approaches to real data, and the model effect is better

Is the difference in the prediction box between the models. The specificity of the object detection task is that besides classification, bounding boxes are used to mark out objects in the picture and adjust the position and size of the frame. Therefore, knowledge transfer of the capabilities of the prediction box is also required. However, the prediction of the bounding box is regressed according to the value of the exact position, and a soft label like softmax cannot be used. Since the teacher's prediction is difficult to match the ground routeAccurate, the results of using it to regress student networks can result in a poor effect. The invention provides that when the prediction effect of the student network is worse than that of the teacher network, an extra gradient is generated, so that the student network approaches to a ground route more quickly, and otherwise, no extra punishment exists. Expressed as formulas (5) and (6):

sL1 represents the Smooth L1 function, y_bbThe label truth value of the prediction box, sigma, is the weight value of the penalty item, is set to 0.5 in the invention, and m represents the edge over parameter, and is set to 0 in the invention.

Is for inter-feature differences in few sample classes. The first two losses do not perform special processing on less sample data, only the fitting degree of the network is improved, and the difference between the characteristics of the less sample class is added for better detecting the less sample target. The method is equivalent to an attention mechanism, and enables a student network to learn the characteristics of less sample data more preferentially. The invention calculates the IOU of boundary box truth value of anchorages and few samples generated by sliding on the characteristic diagram, selects the anchorages surrounding the few sample targets which meet the condition by setting a threshold value proportional to the maximum IOU, and then adds the anchorages to obtain an attention distribution diagram which is set as theta_ij. Finally, the feature differences between the models are weighted using this profile. Expressed as formula (7):

wherein f is_baseRepresenting backbone network trafficThe characteristic diagram is shown, i, j represents the width and height of the characteristic diagram, and c represents the number of channels.

Is a decision function defined to be valid only if the input class belongs to a few sample classes. N is a normalization parameter, the value of which can be represented by equation (8):

N＝∑_i∑_jΘ_ij (8)

has the advantage that it encourages student networks to pay more attention to knowledge of a few sample classes during distillation training. While the background can be suppressed to some extent. After knowledge distillation training is completed, the obtained model 3 has stronger Lupont performance, and the process of fitting less sample data is reduced. The model 3 is used for detecting less sample data to be detected, and the performance is superior to that of the model 2 and the model 1.

The above examples are merely illustrative of the technical solutions and limitations of the present invention, and the application of the present invention is not limited to the above examples, with many similar variations. Modifications and equivalents of the embodiments of the invention described herein will occur to those skilled in the art and are intended to be included within the scope of the claims appended hereto.

Claims

1. A few-sample target detection method based on knowledge distillation is characterized by comprising the following specific operation steps:

the first step is as follows: constructing a picture database with labels and tags to meet the condition of target detection of few samples, enhancing the diversity of the extended samples by certain data, and dividing the extended samples into combined training data D_jointAnd fine tuning training data D_ft。

selecting fast-RCNN as a target detection frameThe frame adopts VGG16 with 13 convolutional layers, 3 full-link layers and 5 pooling layers as a backhaul network, selects an optimizer as SGD, selects an Align method in the ROI pooling stage, and uses joint training data D_jointTraining the models to finally obtain the weight of the combined training model, which is called as a first weight model;

using Fine-tuning training data D_ftTraining a first weight model, setting an initial learning rate and not setting an attenuation round; setting the training round as 5 rounds, selecting an optimizer as SGD, and selecting an Align method in the ROI pooling stage; freezing all feature layers of the VGG16, and only adjusting the classification layer; finally, the weight of the fine tuning training model is obtained and is called as a second weight model;

the fourth step: and taking the first weight model obtained in the second step as a student network, taking the second weight model obtained in the third step as a teacher network, using the fine tuning training data again, carrying out knowledge distillation according to a test result of the teacher network on the fine tuning training data, realizing fine tuning of the student network weight, obtaining a final third weight model, and obtaining and outputting a detection result through the third model.

2. The method according to claim 1, wherein the fourth step is as follows:

using Fine-tuning training data D_ftAnd prior knowledge of a teacher network, performing knowledge distillation training on the student network, and setting an initial learning rate without setting an attenuation round; setting the training round as 8 rounds, selecting an optimizer as SGD, and selecting an Align method in the ROI pooling stage; freezing the first 10 feature layers of VGG16, and adjusting the next 3 feature layers and classifiers to improve the high-layer overfitting; and (3) calculating a total loss function when knowledge distillation is carried out on the student network according to the following formula by combining the prior knowledge of the teacher network and the real data label knowledge:

is the classification difference between the teacher network and the student network;

the distribution difference with the teacher network and the distribution difference with the real label are included, and along with the distillation training, the distribution difference with the teacher network is from high to low, and the distribution difference with the real label is from low to high;

the difference of the prediction boxes between the teacher network and the student network improves the positioning capability of the teacher network and the student network by regressing the distance between the student network prediction box and the true value;

is directed to inter-feature differences in few sample classes; weighting the characteristic difference output by the teacher and student network models by generating an attention distribution map related to few sample classes so as to lead the student network to be more inclined to learn the characteristics of few sample data;

and obtaining a final third weight model.