CN113822339A

CN113822339A - Natural image classification method combining self-knowledge distillation and unsupervised method

Info

Publication number: CN113822339A
Application number: CN202110992616.1A
Authority: CN
Inventors: 杨新武; 刘伟
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2021-08-27
Filing date: 2021-08-27
Publication date: 2021-12-21

Abstract

The method discloses a natural image classification method combining self-knowledge distillation and an unsupervised method. Unsupervised learning aims to discover the characteristics of the data itself, and similar samples are represented similarly after being characterized. The unsupervised mode is introduced into the existing self-knowledge distillation method, so that the feature extraction capacity of each branch can be increased, and the accuracy of the model for classification is improved. In designing a branch structure, in order to further reduce the number of parameters and to improve the model estimation speed, a block convolution is adopted.

Description

Natural image classification method combining self-knowledge distillation and unsupervised method

Technical Field

The invention relates to the field of neural network model compression, unsupervised and image classification. And more particularly, to a natural image classification method that combines an unsupervised self-encoder method and a self-aware distillation method.

Background

In the deep neural network with huge parameters, not all parameters play a role in the model, and some parameters have limited functions and express redundancy, and even the performance of the model can be reduced. The large amount of parameters also makes the cost huge. The model compression technology aims to obtain a small-scale network which has less parameter quantity and less occupied resources compared with a large-scale network, but has good accuracy.

The advent of convolutional neural networks has enabled the performance of some computer vision and natural language processing tasks, such as image classification, object detection, text classification, etc., to be greatly improved. The performance of a deep network is often better than that of a shallow network, and the deep network has an excellent effect of capturing features and also brings problems, such as the increase of complex parameters of a model and the need of a large amount of calculation and memory resources. If ten million-level parameter models of the deep neural network are completely installed on some devices with limited resources, such as mobile devices, the resources of the devices are limited, and then corresponding inference tasks are performed, and the corresponding model inference tasks cannot be normally applied to the devices.

Knowledge distillation is a common model compression method that migrates knowledge of a complex model or multiple models into another lightweight model, making the model size smaller while minimizing performance loss. Existing deep learning distillation methods can be classified into a distillation method based on the final output result, and a distillation method based on an intermediate feature layer. In knowledge distillation, the traditional idea is to transmit the knowledge of teachers to students, and improve the abilities of the students.

In the existing model compression knowledge distillation technology, the self-knowledge distillation technology is an improvement on the distillation technology. In the conventional distillation technology, a large teacher network is required to be introduced for supervision, the large teacher network also requires a large amount of memory during loading, and the GPU is also required for forward inference. Self-aware distillation does not require the introduction of additional teacher structures and can be used as a teacher. This approach utilizes deep levels to train shallow levels. The method provides a mode of combining an unsupervised method and knowledge distillation. Unsupervised learning aims to find the characteristics of the data itself, and similar samples can be similar in representation after being subjected to feature extraction. Introducing this approach into existing self-knowledge distillation methods can improve the similarity of features between each branch, thereby improving the accuracy of the model for classification. When designing a branch structure, in order to further reduce the number of parameters and improve the model inference speed, a packet convolution is adopted.

Disclosure of Invention

The existing deep learning model has large parameter quantity and inflexible deployment and use, and in order to solve the problem, the invention adopts self-learning distillation combined with an unsupervised self-encoder mode, and the method improves the accuracy of branches by improving the feature extraction capability of each branch and increasing the similarity of features among the same types. When the method is applied, unnecessary parts can be cut off to reduce the parameter number, and multiple branches can be combined without considering the parameter number to further improve the model accuracy.

A natural image classification method combining self-knowledge distillation and unsupervised methods mainly comprises the following steps:

s1 data processing procedure

S1.1 preprocesses the data set to be trained using a simple data enhancement method.

S1.2, randomly disordering the data set, dividing the data set into different batches and sending the batches into a designed network;

s2 training procedure

And S2.1, inputting the preprocessed data into a designed network model to obtain a characteristic value of each branch structure before passing through a full connection layer.

S2.2, inputting the characteristic value into a designed decoder, and finally outputting a characteristic with the same size as the input data by the decoder, wherein the essence is to restore an initial picture and the MSE loss is solved. Separate codec structures design different weight MSE losses.

And S2.3, inputting the features before the full connection in the previous step into the full connection, and obtaining a predicted value through softmax. And solving the cross entropy loss by the predicted value and the real label.

S2.4 back-propagating loss, and repeating until the model converges.

S3 prediction process

S3.1 remove the decoder part and only preserve the trunk and branch structure.

And S3.2, inputting the pictures to be classified into a network for prediction.

For the step 2 training process, the resulting loss function is:

here, α and β are for balancing the respective losses. Beta stores the loss weight of decoding result of each branch.

Cross entropy loss function Cross Encopy (p)ⁱ,y)，pⁱAnd the value range of i is a plurality of branch values. KL () is a loss function of knowledge distillation, which is the Kullback-Leibler divergence between the two outputs, introducing a gentle distribution of temperature coefficients, and passing teacher information to a small student network through this loss. MSE is the loss of the decoded part to the original picture. Beta is ═ beta¹,β¹…,βⁱ]Different decoding loss weights are stored for each branch.

In the training process, the learning rate is attenuated in different rounds, and the convergence of the model is accelerated; and an L2 regularization was introduced.

Drawings

Fig. 1 is a model structure diagram according to the present invention.

Fig. 2 is a block diagram of an unsupervised self-encoder according to the present invention.

Fig. 3 is a flow chart according to the present invention.

Fig. 4 is a graph of extracted feature similarity in accordance with the present invention.

Detailed Description

For the purpose of promoting a better understanding of the objects, features and advantages of the invention, reference is made to the following detailed description taken in conjunction with the accompanying drawings.

S1 data part

The present embodiment uses the Cifar100 dataset as the image classification training dataset. The Cifar100 dataset contains 6 ten thousand training pictures, of which 5 ten thousand pictures are training set and 1 ten thousand pictures are test set, and there are 10 categories in total.

S1.1 data enhancement processing is carried out by using a simple random cutting and horizontal inversion mode

S1.2, carrying out a normaize operation on the data. Randomly disorganized and then divided into different batches.

S2 training part

S2.1, the deepest layer of the model divided according to the depth serves as a teacher network, and the output of the teacher network is used for monitoring other shallow branches in a knowledge distillation mode.

S2.2, combining the designed model divided according to the depth and the designed decoder part together to construct the whole network.

S2.3, inputting the preprocessed data into a designed network model to obtain a characteristic value of each branch structure before passing through a full connection layer

And S2.4, inputting the characteristic value into a designed decoder, and finally outputting a characteristic with the same size as the input data by the decoder, wherein the essence is to restore the initial picture. Separate codec structures design different weight MSE losses.

And S2.5, inputting the features before full connection into full connection, and obtaining a predicted value through softmax. And solving the cross entropy loss between the predicted value and the real label.

S2.6, the super-parameter temperature coefficient is required to be set in the training process, and the temperature coefficient can enable the final predicted output of the teacher and the student network to be gentle. The distribution after flattening finds the KL divergence to shift the deep knowledge to the shallow.

S2.7, in the training process, the learning rate is attenuated in different turns, and the convergence of the model is accelerated; and an L2 regularization was introduced.

Here, α and β are for balancing the respective losses. The beta stores the different weights of the branches.

Cross entropy loss function Cross Encopy (p)ⁱ,y)，pⁱAnd the value range of i is a plurality of branch values. KL () is a loss function of knowledge distillation, which is the Kullback-Leibler divergence between the two outputs, by which loss deep teacher information is passed to shallow, small student networks. MSE is the loss of the decoded part to the original picture. Beta is ═ beta¹,β¹…,βⁱ]Different decoding loss weights are stored for each branch. Different depths of individual self-encoder pairs use different β.

After the forward propagation is completed, m, p can be obtainedⁱThe loss value of this batch is found according to the above loss function. The entire network is trained using a random gradient descent method. This round is completed when all batches have been counter-propagated once.

S2.8 repeat the training according to the procedure described above until the model finally converges.

S3 test procedure

S3.1, the designed model is that the unsupervised self-encoder is combined with the original structure. When model prediction is required, the self-encoder part in the whole body can be removed, and only the trunk and branch structures are loaded.

And S3.2, inputting the pictures to be predicted into the model to obtain a classification result.

The method provides a natural image classification method combining an unsupervised method and a knowledge distillation method. Unsupervised learning aims to discover the characteristics of the data itself, and similar samples are represented similarly after being characterized. The unsupervised mode is introduced into the existing self-knowledge distillation method, so that the feature extraction capacity of each branch can be increased, and the accuracy of the model for classification is improved. In designing a branch structure, in order to further reduce the number of parameters and to improve the model estimation speed, a block convolution is adopted.

The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can understand that the modifications or substitutions are included in the scope of the present invention, and therefore, the scope of the present invention should be determined by the protection scope of the claims.

Claims

1. A natural image classification method combining self-knowledge distillation and an unsupervised method is characterized in that unsupervised extraction capacity of image features is as follows:

s1 data part

Using the Cifar100 dataset as an image classification training dataset; the Cifar100 data set comprises 6 ten thousand training pictures, wherein 5 ten thousand pictures are taken as a training set, 1 ten thousand pictures are taken as a test set, and 10 categories are contained in total;

S1.2, performing normaize operation on the data; randomly disorganizing and dividing into different batches;

s2 training part

S2.1, taking the deepest layer of the deeply divided model as a teacher network, and monitoring other shallow branches by using the output of the teacher network in a knowledge distillation mode;

s2.2, combining the designed model divided according to the depth and the designed decoder part together to construct a whole network;

S2.4, inputting the characteristic value into a designed decoder, and finally outputting a characteristic with the same size as the input data by the decoder, wherein the essence is to restore an initial picture; designing MSE loss of different weights by using a single coding and decoding structure;

s2.5, inputting the features before full connection into full connection, and obtaining a predicted value through softmax; solving cross entropy loss between the predicted value and the real label;

s2.6, in the training process, a super-parameter temperature coefficient is required to be set, and the temperature coefficient can enable the final predicted output of a teacher network and a student network to be relatively smooth; the KL divergence is calculated according to the gentle distribution, so that the deep knowledge is transferred to the shallow knowledge;

s2.7, in the training process, the learning rate is attenuated in different turns, and the convergence of the model is accelerated; and L2 regularization was introduced;

α, β are to balance the respective losses; beta stores different weights of each branch;

cross entropy loss function Cross Encopy (p)ⁱ,y)，pⁱRepresenting the predicted value of the network to the sample finally, wherein the value range of i is a plurality of branch values; KL () is a loss function of knowledge distillation, which is the Kullback-Leibler divergence between two outputs, and transmits deep teacher information to a shallow small student network through the loss; MSE is the loss of the decoded part to the original picture; beta is ═ beta¹,β¹…,βⁱ]Different decoding loss weights of all branches are stored; different depths of individual self-encoder pairs use different β;

after the forward propagation is completed, m, p can be obtainedⁱObtaining the loss value of the batch according to the loss function; training the whole network by using a random gradient descent method; when the reverse propagation of all batches is completed once, the round is finished;

s2.8, repeating the training according to the process described above until the model finally converges;

s3 test procedure

S3.1, the designed model is formed by combining an unsupervised self-encoder with the original structure; when model prediction is needed, a self-encoder part in the whole is removed, and only a trunk and branch structure is loaded;

2. The method for natural image classification by self-knowledge distillation and unsupervised method according to claim 1, wherein the method comprises the following steps: the loss generated by the unsupervised self-encoder adds to the total loss.

3. The method for natural image classification by self-knowledge distillation and unsupervised method according to claim 1, wherein the method comprises the following steps: the decoder is designed to match the characteristic outputs of the branch structure so that they can be concatenated together.

4. The method for natural image classification by self-knowledge distillation and unsupervised method according to claim 1, wherein the method comprises the following steps: multiple exits multiple branch structures share a decoder, increasing the similarity between features.