CN117351279A

CN117351279A - Self-distillation realization method for space-time distillation fusion

Info

Publication number: CN117351279A
Application number: CN202311305326.0A
Authority: CN
Inventors: 朱隆熙; 徐锋; 刘宁钟; 汪俊杰; 谭健; 王淑君
Original assignee: Jiangsu Lemote Information Technology Co ltd; Nanjing University of Aeronautics and Astronautics
Current assignee: Jiangsu Lemote Information Technology Co ltd; Nanjing University of Aeronautics and Astronautics
Priority date: 2023-10-10
Filing date: 2023-10-10
Publication date: 2024-01-05

Abstract

The invention relates to the technical field of computer vision, and provides a self-distillation realization method for space-time distillation fusion, which comprises the following steps: s1, acquiring a CIFAR data set, and dividing the CIFAR data set into a training set, a testing set and data augmentation; s2, constructing a distillation frame neural network, using a residual network as a backbone network, taking the characteristics of four stages as branches when constructing the network, adding a bottleck layer and an FC layer as predictions of a student network, and using the last layer as a teacher network for distillation; meanwhile, the prediction result of the previous round of model is used as a guide in the time dimension, so that model training is performed under the guidance of the softening hard tag; s3, sending the CIFAR data set subjected to data augmentation and division into the distillation framework neural network for training until the distillation framework neural network converges, and obtaining a weight file; and S4, detecting the classification accuracy in the test image by using the trained distillation framework neural network and the weight file. The invention improves the accuracy of the model under distillation.

Description

Self-distillation realization method for space-time distillation fusion

Technical Field

The invention relates to the technical field of computer vision, in particular to a self-distillation realization method for space-time distillation fusion.

Background

Deep learning has made great progress in recent years, but is limited by huge calculation amount and parameter amount, and is difficult to be practically applied to resource-constrained devices. In order to make the depth model more efficient, one explores the field of knowledge distillation. In 2006, bucilua et al first proposed the idea of migrating knowledge of a large model to a small model. In 2015, hinton officially proposed the well-known concept of knowledge distillation. The main idea of knowledge distillation is: the student model obtains the accuracy equivalent to that of the teacher model by imitating the teacher model, and a key problem is how to migrate the knowledge of the teacher model to the student model.

Traditional knowledge distillation can be categorized into response-based knowledge distillation and feature-based knowledge distillation. Response-based knowledge generally refers to the neural response of the last output layer of the teacher model. The main idea is to directly simulate the final prediction of the teacher model. Reaction-based knowledge distillation is a simple and effective model compression method and is widely applied to different tasks and applications.

Feature-based knowledge distillation comes from the middle layer, is a good extension of response-based knowledge, and can be used as knowledge for supervising student model training by using the feature map of the middle layer. The most direct idea is to match the activation function values of intermediate features, in particular, zagoruyko and Komodake (2017) propose to represent knowledge with an attention map; to match semantic information between a teacher and students, chen et al (2021) proposes cross-layer KD to adaptively assign layers in the teacher network to layers in each student network by attention localization. However, the two classical approaches described above have two disadvantages including: the first disadvantage is that knowledge transfer is inefficient, meaning that the student model uses little or no knowledge in the teacher model. It is still rare for an outstanding student model to perform better than its teacher model; another disadvantage is how to design and train the appropriate teacher mode. The existing distillation framework requires a lot of effort and experiments to find the best teacher model architecture, which takes a relatively long time, for example, the conventional distillation method takes 14.67 hours to train the teacher network ResNet152 on CIFAR100 and 12.31 hours to train the student network ResNet50 in the second step.

Disclosure of Invention

The invention aims to provide a self-distillation realization method for space-time distillation fusion.

The invention aims to solve the problems of the prior art that the pre-training teacher network in the distillation frame neural network consumes long time, and the large scale difference between the teacher network and the student network causes poor student precision.

Compared with the prior art, the technical scheme of the invention has the following beneficial effects:

a method for implementing self-distillation of space-time distillation fusion, comprising: s1, acquiring a CIFAR data set, and dividing the CIFAR data set into a training set, a testing set and data augmentation; s2, constructing a distillation frame neural network, using a residual network as a backbone network, taking the characteristics of four stages as branches when constructing the network, adding a bottleck layer and an FC layer as predictions of a student network, and using the last layer as a teacher network for distillation; meanwhile, the prediction result of the previous round of model is used as a guide in the time dimension, so that model training is performed under the guidance of the softening hard tag; s3, sending the CIFAR data set subjected to data augmentation and division into the distillation framework neural network for training until the distillation framework neural network converges, and obtaining a weight file; and S4, detecting the classification accuracy in the test image by using the trained distillation framework neural network and the weight file.

As a further improvement, in the step S1, the CIFAR10 and CIFAR100 data sets are used, and the training set and the test set are divided according to a ratio of five to one.

As a further improvement, in the step S2, a residual network is used as a backbone network, and the target convolutional neural network is first divided into several shallow segments according to the depth and the original structure thereof, the shallow layer network is regarded as a student model, and the deep layer network is regarded as a teacher model in concept.

As a further improvement, the step S2 includes: s21, regarding prediction results of different shallow networks in a residual network as student networks, and setting a bottleneck layer and a full connection layer which are only used for training and can be removed in reasoning after each shallow block; s22, sequencing feature predictions of the last layer of the sample, regarding individual top-1 categories of each position as one vote, aggregating the votes into a histogram, ranking the categories according to the occurrence frequency of the votes in the histogram, and guiding learning the shallow frequency by using frequency information.

As a further improvement, the step S3 includes: s31, aiming at the size of the target in the CIFAR data set, processing the CIFAR data set by using a data enhancement method of random clipping and random horizontal overturn; s32, optimizing by using a random gradient descent method, and attenuating the learning rate twice and attenuating from an initial value, so that a neural network can achieve a better distillation result; s33, different training super parameters are tried on the neural network, data augmentation and training are carried out on the input image, and when the loss function converges or the maximum iteration number is reached, training is stopped to obtain a self-distilled network file and a self-distilled weight file.

As a further improvement, the step S31 includes: prior to training the distillation frame neural network, assuming that the CIFAR100 dataset comprises a dataset of n samples, denoted x1, x2,., xn, the mean of the CIFAR100 dataset is expressed as the sum of the values of all samples divided by the number of samples, i.e., mean = (x1+x2..+ xn)/n; calculate the difference between each data point and the mean: (x 1-mean), (x 2-mean); the square of the difference is calculated: (x 1-mean)/(2), (x 2-mean)/(2), -x n-mean)/(2; calculating the mean value of the square difference value: [ (x 1-mean) ≡2+ (x 2-mean) fact2+ (xn-mean) fact2 ]/n; the data is normalized by converting the data into a distribution with a mean value of 0 and a standard deviation of 1, that is, normalized value= (original value-mean)/standard deviation.

As a further improvement, the step S32 includes: using the random weight as an initial weight for setting a learning rate, iteration number, batch_size, and the like; and in 100 and 150 rounds, the learning rate is attenuated from the initial value, so that the distillation frame neural network can achieve a better detection result.

As a further improvement, the step S33 includes: and (3) amplifying the input image, training, and stopping training to obtain a weight file after distillation when the loss function converges or the maximum iteration number is reached.

As a further improvement, the step S4 includes: s41, sending the test image into an improved residual error network backbone network, and acquiring convolution characteristics of four stages; s42, respectively carrying out weighted average and prediction on the convolution characteristics of the four stages; s43, obtaining the prediction results of the four stage sets through simple weighted average, and comparing the results of the four stages with the results of the five stages, and selecting the final result with high prediction accuracy.

The beneficial effects of the invention are as follows:

according to the invention, on the basis of the residual network backbone network, the deep network is used as a teacher network to distill the shallow student network, so that the shallow student network can learn deeper semantic information, and the classification accuracy of the model is enhanced.

According to the method for distilling loss in an improved mode, decoupling knowledge distillation is used, so that dark knowledge contained in non-target categories can be utilized more effectively, and the accuracy of target picture classification is improved.

The method well solves the problems that the pre-training time of a teacher network in the existing distillation frame is consuming and the accuracy of a small model is not up to standard, and improves the accuracy of the model under distillation.

Drawings

FIG. 1 is a schematic diagram of a method for realizing self-distillation of space-time distillation fusion according to an embodiment of the invention.

Fig. 2 is a schematic diagram of test results of a self-distillation implementation method of space-time distillation fusion according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, are intended to fall within the scope of the present invention.

In the description of the present invention, the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present invention, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.

Referring to fig. 1, a method for realizing self-distillation of space-time distillation fusion comprises the following steps: s1, acquiring a CIFAR data set, and dividing the CIFAR data set into a training set, a testing set and data augmentation; s2, constructing a distillation frame neural network, using a residual network as a backbone network, taking the characteristics of four stages as branches when constructing the network, adding a bottleck layer and an FC layer as predictions of a student network, and using the last layer as a teacher network for distillation; meanwhile, the prediction result of the previous round of model is used as a guide in the time dimension, so that model training is performed under the guidance of the softening hard tag; s3, sending the CIFAR data set subjected to data augmentation and division into the distillation framework neural network for training until the distillation framework neural network converges, and obtaining a weight file; and S4, detecting the classification accuracy in the test image by using the trained distillation framework neural network and the weight file.

In the step S1, the data sets of CIFAR10 and CIFAR100 are used, and the training set and the test set are divided according to a ratio of five to one.

In the step S2, a residual network is used as a backbone network, and the target convolutional neural network is firstly divided into a plurality of shallow segments according to the depth and the original structure thereof, the shallow layer network can be regarded as a student model, and the deep layer network can be regarded as a teacher model in concept; meanwhile, the prediction result of the model of the previous round is used as a teacher model of the time dimension, and valuable information in the teacher model is utilized to soften the hard tag; on the basis of a residual error network structure, four stage characteristics are used for constructing a space dimension teacher, and a predicted result of a model of the previous round is used as a teacher model of a time dimension to construct a new distillation frame neural network.

The step S2 includes: s21, regarding prediction results of different shallow networks in a residual network as student networks, and setting a bottleneck layer and a full connection layer which are only used for training and can be removed in reasoning after each shallow block; s22, utilizing the prediction result of the previous round of model as a guide in the time dimension to capture beneficial information contained in the model, and further adjusting the learning target and the label of the model in a soft guide mode in the training process, thereby optimizing the performance and the generalization capability of the model.

In the step S2, four branches are added to extract features, and then the bottleneck layer is utilized to extract features more effectively, and finally, prediction is performed through the FC layer.

The step S2 includes: extracting the characteristics from the first layer to the third layer of the residual network, and adding attention to enable the network to learn important characteristics; extracting the characteristics by using a bottleneck layer; then predicting the extracted features by using an FC layer; finally, the sample prediction of the previous round is mixed with the current hard tag to guide learning of rich information.

In the step S2, a residual network is constructed, and the residual network is composed of the following parts: 1. convolution layer: the residual network starts with three convolution layers, which are responsible for extracting features in an input image; 2. residual block: each residual block consists of three convolution layers, and realizes the residual learning of a network through jump connection; 3. pooling layer: the pooling layer in the residual error network is used for reducing the size and the parameter number of the feature map, so that the calculation efficiency is improved; 4. full tie layer: the output of the averaging pooling layer is connected to a predefined number of categories for performing a final classification task. Dividing the neural network into four layers of features with different depths according to four residual blocks, wherein the more the residual blocks pass through, the deeper the feature depths are, taking the features of the four stages as branches, adding a bottleneck layer and an FC layer as predictions of a student network, and distilling by using the last layer as a teacher network; meanwhile, deep features are sequenced and averaged, and frequency soft labels are generated to conduct guiding learning on shallow layers;

the step S3 includes: s31, aiming at the size of the target in the CIFAR data set, processing the CIFAR data set by using a data enhancement method of random clipping and random horizontal overturn; s32, optimizing by using a random gradient descent method, and attenuating the learning rate twice and attenuating from an initial value, so that a neural network can achieve a better distillation result; s33, different training super parameters are tried on the neural network, data augmentation and training are carried out on the input image, and when the loss function converges or the maximum iteration number is reached, training is stopped to obtain a self-distilled network file and a self-distilled weight file.

In the step S31, the original image is randomly cropped, and the cropping filling size is 4.

The step S31 includes:

before training the distillation frame neural network, assuming that the CIFAR100 dataset comprises a dataset of n samples, denoted x1, x2, xn, the mean of the CIFAR100 dataset is expressed as the sum of the values of all samples divided by the number of samples,

mean = (x1+x2+, +xn)/n;

calculate the difference between each data point and the mean: (x 1-mean), (x 2-mean);

the square of the difference is calculated: (x 1-mean)/(2), (x 2-mean)/(2), -x n-mean)/(2;

calculating the mean value of the square difference value: [ (x 1-mean) ≡2+ (x 2-mean) fact2+ (xn-mean) fact2 ]/n;

the data is normalized by converting the data into a distribution with a mean value of 0 and a standard deviation of 1, that is, normalized value= (original value-mean)/standard deviation.

The step S32 includes: using the random weight as an initial weight for setting a learning rate, iteration number, batch_size, and the like; and in 100 and 150 rounds, the learning rate is attenuated from the initial value, so that the distillation frame neural network can achieve a better detection result.

The step S33 includes: and (3) amplifying the input image, training, and stopping training to obtain a weight file after distillation when the loss function converges or the maximum iteration number is reached.

The step S4 includes: s41, sending the test image into an improved residual error network backbone network, and acquiring convolution characteristics of four stages; s42, respectively carrying out weighted average and prediction on the convolution characteristics of the four stages; s43, obtaining the prediction results of the four stage sets through simple weighted average, and comparing the results of the four stages with the results of the five stages, and selecting the final result with high prediction accuracy.

Fig. 2 shows the detection result of the method of the present invention, training and testing are performed on a total time series XP graphic card, the distillation temperature is set to be 4.0 during distillation, the weight attenuation in the random gradient descent algorithm is set to be 0.0001, the value of the loss function is output at the terminal during each round of training, the whole convergence condition is convenient to observe, and the test set is used for verification at the end of each round, the prediction result of each branch is also output during the training process, if Acc1-4 is represented as the prediction result of the first layer branch in the current four layers, ensembe represents the result of averaging after weighting different branches, the classification result of the fourth layer of the residual network is compared during verification accuracy, if the current verification result is greater than the historical optimal accuracy, the weight is updated, and the classification accuracy of 78.94% can be achieved on the cir 100 through verification.

The above examples are only for illustrating the technical scheme of the present invention and are not limiting. It will be understood by those skilled in the art that any modifications and equivalents that do not depart from the spirit and scope of the invention are intended to be within the scope of the appended claims.

Claims

1. A method for implementing self-distillation of temporal-spatial distillation fusion, comprising:

s1, acquiring a CIFAR data set, and dividing the CIFAR data set into a training set, a testing set and data augmentation;

s2, constructing a distillation frame neural network, using a residual network as a backbone network, taking the characteristics of four stages as branches when constructing the network, adding a bottleck layer and an FC layer as predictions of a student network, and using the last layer as a teacher network for distillation; meanwhile, the prediction result of the previous round of model is used as a guide in the time dimension, so that model training is performed under the guidance of the softening hard tag;

s3, sending the CIFAR data set subjected to data augmentation and division into the distillation framework neural network for training until the distillation framework neural network converges, and obtaining a weight file;

and S4, detecting the classification accuracy in the test image by using the trained distillation framework neural network and the weight file.

2. A method for realizing self-distillation of space-time distillation fusion according to claim 1, wherein in step S1,

the CIFAR10 and CIFAR100 data sets are used and the training and testing sets are partitioned according to a five to one ratio.

3. A method for realizing self-distillation of space-time distillation fusion according to claim 1, wherein in step S2,

the residual network is used as a backbone network, the target convolutional neural network is firstly divided into a plurality of shallow sections according to the depth and the original structure of the target convolutional neural network, the shallow layer network is regarded as a student model, and the deep layer network is regarded as a teacher model in concept.

4. The method for realizing self-distillation of space-time distillation fusion according to claim 1, wherein said step S2 comprises:

s21, regarding prediction results of different shallow networks in a residual network as student networks, and setting a bottleneck layer and a full connection layer which are only used for training and can be removed in reasoning after each shallow block;

s22, sequencing feature predictions of the last layer of the sample, regarding individual top-1 categories of each position as one vote, aggregating the votes into a histogram, ranking the categories according to the occurrence frequency of the votes in the histogram, and guiding learning the shallow frequency by using frequency information.

5. The method for realizing self-distillation of space-time distillation fusion according to claim 1, wherein said step S3 comprises:

s31, aiming at the size of the target in the CIFAR data set, processing the CIFAR data set by using a data enhancement method of random clipping and random horizontal overturn;

s32, optimizing by using a random gradient descent method, and attenuating the learning rate twice and attenuating from an initial value, so that a neural network can achieve a better distillation result;

s33, different training super parameters are tried on the neural network, data augmentation and training are carried out on the input image, and when the loss function converges or the maximum iteration number is reached, training is stopped to obtain a self-distilled network file and a self-distilled weight file.

6. The method for realizing self-distillation of temporal-spatial distillation fusion according to claim 5, wherein said step S31 comprises:

mean = (x1+x2+, +xn)/n;

7. The method for realizing self-distillation of temporal-spatial distillation fusion according to claim 5, wherein said step S32 comprises:

using the random weight as an initial weight for setting a learning rate, iteration number, batch_size, and the like; and in 100 and 150 rounds, the learning rate is attenuated from the initial value, so that the distillation frame neural network can achieve a better detection result.

8. The method for realizing self-distillation of temporal-spatial distillation fusion according to claim 5, wherein said step S33 comprises:

and (3) amplifying the input image, training, and stopping training to obtain a weight file after distillation when the loss function converges or the maximum iteration number is reached.

9. The method for realizing self-distillation of space-time distillation fusion according to claim 1, wherein said step S4 comprises:

s41, sending the test image into an improved residual error network backbone network, and acquiring convolution characteristics of four stages;

s42, respectively carrying out weighted average and prediction on the convolution characteristics of the four stages;

s43, obtaining the prediction results of the four stage sets through simple weighted average, and comparing the results of the four stages with the results of the five stages, and selecting the final result with high prediction accuracy.