CN112784964A

CN112784964A - Image classification method based on bridging knowledge distillation convolution neural network

Info

Publication number: CN112784964A
Application number: CN202110107120.1A
Authority: CN
Inventors: 杜兰; 王震; 宋佳伦
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2021-01-27
Filing date: 2021-01-27
Publication date: 2021-05-11

Abstract

The invention discloses an image classification method based on a bridging knowledge distillation convolution neural network, which mainly solves the problem of low accuracy of student network image classification caused by information loss in the knowledge distillation process in the prior art, and comprises the following implementation steps: (1) constructing a teacher network and a student network; (2) generating a training set: (3) training a teacher network: (4) constructing a bridging structure: (5) training a student network: (6) and classifying the image to be classified. According to the invention, a bridge structure is constructed between the teacher network and the student network, and the student network is trained according to the KL divergence loss function and the cross entropy loss function, so that the student network has higher image classification accuracy and lower terminal deployment requirements at the same time, and can be used for classifying and identifying images on low-computing-power and low-storage terminal equipment.

Description

Image classification method based on bridging knowledge distillation convolution neural network

Technical Field

The invention belongs to the technical field of image processing, and further relates to an image classification method based on a bridging knowledge distillation convolution neural network in the technical field of image classification. The method can be used for classifying and identifying the images on low-computing-power and low-storage terminal equipment.

Background

The most classical network in the convolutional neural network is ResNet, and by introducing 'jump connection', ResNet effectively relieves the problem of gradient disappearance in neural network training and successfully trains hundreds or even thousands of layers of convolutional neural networks. Generally, the more the number of network layers, the larger the parameter quantity, the stronger the expression capability of the network, and the higher accuracy can be obtained in the image classification task. The large-scale convolutional neural network has long reasoning time and large storage cost, and in application scenes such as safe production, industrial quality inspection, intelligent hardware and the like, the memory capacity and the computing power of terminal equipment are limited, so that the deployment of the convolutional neural network is greatly hindered; although the direct design of a small-scale convolutional neural network is beneficial to terminal deployment, the problem of low classification accuracy exists. Therefore, the classification accuracy and efficiency of the convolutional neural network are difficult to obtain.

Knowledge distillation is a common model compression algorithm, and the method generally selects a network with large scale and high precision as a teacher network and selects a network with small scale and poor precision as a student network, and realizes the knowledge transfer from the teacher network to the student network by guiding the student network to simulate the output of the teacher network, so that the student network achieves the precision close to the teacher network.

An image classification method based on attention knowledge distillation is proposed In the published paper "Paying more annotation to annotation" by Zagoruyko (In International Conference on Learning Representations, 2017). The method converts the output characteristics of the network middle layers of the teacher network and the student network into the attention characteristics, enables the student network to learn the image area concerned by the teacher network middle layer by minimizing the difference between the attention characteristics of the teacher network and the student network, completes the knowledge transfer from the teacher network to the student network, and improves the classification accuracy of the student network images with small scale. The method has the following defects: the method converts the output characteristics of the network middle layers of the teacher network and the student network into attention characteristics, but the output of the middle layers contains two parts of information of space dimension and channel dimension.

Byeong Heo, In its published paper, "Knowledge transfer via partition of activation floors for used by high nodes" (In AAAI Conference on intellectual insight, 2019), proposed an image classification method that uses Knowledge distillation of neuron activation boundaries. The method judges whether hidden layer neurons of intermediate layers of a teacher network and a student network are activated or not by setting a threshold, if the neuron activation value is larger than the threshold, the neurons are considered to be in an activation state, if the neuron activation value is smaller than the activation value, the neurons are not activated, the method guides whether the neurons of the intermediate layers corresponding to the student network are activated or not, and the information is transmitted to the student network, so that the classification accuracy of the student network images with small scale is improved. The method has the following defects: the method judges whether the neurons of the teacher network middle layer are activated or not by setting a threshold value, and guides whether the corresponding neurons of the student network middle layer are activated or not, so that the method only teaches whether the corresponding neurons of the student network should be activated or not, does not teach amplitude information after the corresponding neurons of the student network are activated, does not transmit complete neuron activation information to the student network, and causes slow convergence of the student network.

Disclosure of Invention

The invention aims to provide an image classification method based on a bridging knowledge distillation convolution neural network aiming at overcoming the defects of the prior art and solving the problem of low accuracy of student network image classification caused by information loss in the knowledge distillation process in the prior art.

The technical idea for realizing the purpose of the invention is as follows: by establishing a bridging structure between the teacher network and the student network, the information of the middle layer of the student network is mapped into class probability characteristics by using the teacher network, so that the key information of the middle layer is extracted, and then the KL divergence loss function and the cross entropy loss function are used for training the student network to learn the knowledge in the teacher network with high image classification accuracy, so that the image classification accuracy similar to that of the teacher network is achieved.

The method comprises the following specific steps:

(1) constructing a teacher network and a student network:

(1a) build the teacher network of 14 layers and the student network of 14 layers that the structure is the same, its structure is in proper order: the device comprises an input layer, a first convolution layer, a first active layer, a first maximum pooling layer, a second convolution layer, a second active layer, a second maximum pooling layer, a third convolution layer, a third active layer, a third maximum pooling layer, a fourth convolution layer, a fourth active layer, a fifth convolution layer and an output layer;

(1b) the parameters of each layer of the teacher network are set as follows:

setting the numbers of the first to fifth convolution layer feature maps to 16, 32, 64, 128 and 10 respectively, and setting the sizes of convolution kernels to 5 × 5, 6 × 6, 5 × 5 and 3 × 3 respectively;

setting the pooling windows of the first to third largest pooling layers to be 2 multiplied by 2, and setting the step length to be 2;

setting the activation functions of the first to fourth activation layers as the ReLU activation functions;

(1c) the parameters of each layer of the student network are set as follows:

setting the numbers of the first to fifth convolution layer feature maps to 9, 10, 31, 8 and 10 respectively, and setting the sizes of convolution kernels to 5 × 5, 6 × 6, 5 × 5 and 3 × 3 respectively;

(2) generating a training set:

selecting at least 2 types of images, and forming a training set by at least 200 images in each type;

(3) training a teacher network:

inputting the training set into a teacher network to obtain the prediction category probability of each training image, calculating the loss between the prediction category probability of each image and the category label corresponding to the image by using a cross entropy loss function, and iteratively updating teacher network parameters by using a back propagation algorithm until the cross entropy loss function is converged to obtain the trained teacher network;

(4) constructing a bridging structure:

connecting the trained fourth convolution layer of the teacher network with the trained fourth convolution layer of the student network to obtain a bridging structure;

(5) training a student network:

(5a) simultaneously inputting the training set into a student network and a trained teacher network to obtain the output of the student network, the output of the teacher network and the output of the bridging structure;

(5b) calculating a KL divergence loss value between the output of the teacher network and the output of the bridging structure by using the KL divergence loss function;

(5c) calculating a cross entropy loss value between the output of the student network and the class label of the training image by using a cross entropy loss function;

(5d) and taking the sum of the KL divergence loss value and the cross entropy loss value as a total loss value, and iteratively updating parameters of the student network through a back propagation algorithm until the total loss value is converged to obtain the trained student network.

(6) Classifying the images to be classified:

and inputting the images to be classified into the trained student network to obtain the prediction class probability of the student network for the images to be classified, and selecting the class corresponding to the probability with the highest median value of the prediction class probability as the classification result of the images.

Compared with the prior art, the invention has the following advantages:

firstly, a bridging structure is constructed between a teacher network and a student network, and the structure can simultaneously utilize two parts of information, namely space dimension and channel dimension, in an intermediate layer between the teacher network and the student network, so that the problem that channel dimension features are lost in intermediate layer features extracted in the prior art is solved, and the image classification accuracy of the student network is improved.

Secondly, the KL divergence loss function and the cross entropy loss function are used when the student network is trained, the loss function can transfer the intermediate layer knowledge of the teacher network to the student network, the problem of slow convergence of the student network caused by neglecting amplitude information of the intermediate layer of the teacher network in the prior art is solved, and the convergence efficiency of the student network is effectively accelerated.

Drawings

FIG. 1 is a flow chart of the present invention;

fig. 2 is a schematic diagram of a bridging structure in the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

With reference to fig. 1, the specific steps of the implementation of the present invention are described in detail.

Step 1, a teacher network and a student network are constructed.

Build the teacher network of 14 layers and the student network of 14 layers that the structure is the same, its structure is in proper order: the multilayer chip comprises an input layer, a first convolution layer, a first active layer, a first maximum pooling layer, a second convolution layer, a second active layer, a second maximum pooling layer, a third convolution layer, a third active layer, a third maximum pooling layer, a fourth convolution layer, a fourth active layer, a fifth convolution layer and an output layer.

The parameters of each layer of the teacher network are set as follows:

the numbers of the first to fifth convolution layer feature maps are set to 16, 32, 64, 128, 10, respectively, and the convolution kernel sizes are set to 5 × 5, 6 × 6, 5 × 5, 3 × 3, respectively.

The pooling windows of the first to third largest pooling layers are all set to 2 × 2, and the step sizes are all set to 2.

The activation functions of the first to fourth activation layers are all set as the ReLU activation functions.

The parameters of each layer of the student network are set as follows:

the numbers of the first to fifth convolution layer feature maps are set to 9, 10, 31, 8, 10, respectively, and the convolution kernel sizes are set to 5 × 5, 6 × 6, 5 × 5, 3 × 3, respectively.

And 2, generating a training set.

At least 2 categories of at least 200 images in each category are selected to form a training set.

And step 3, training a teacher network.

Inputting the training set into a teacher network to obtain the prediction category probability of each training image, calculating the loss between the prediction category probability of each image and the category label corresponding to the image by using a cross entropy loss function, and iteratively updating teacher network parameters by using a back propagation algorithm until the cross entropy loss function is converged to obtain the trained teacher network.

The cross entropy loss function is as follows:

where J represents the cross entropy loss function, N represents the total number of images in the training set, Sigma represents the summation operation, i represents the number of images in the training set, Y represents the number of images in the training set_iRepresenting the class label corresponding to the ith image in the training set, log representing the base 2 logarithm operation, P_iThe predicted class probability obtained by inputting the ith image in the training set into the teacher network is shown.

And 4, constructing a bridging structure.

And connecting the trained fourth convolution layer of the teacher network with the trained fourth convolution layer of the student network to obtain the bridging structure.

The bridging structure constructed in the present invention is further described with reference to fig. 2.

In fig. 2, the left side is a schematic structural diagram of a teacher network, the right side is a schematic structural diagram of a student network, and a bridge structure is obtained by connecting the fourth convolutional layer of the trained teacher network and the fourth convolutional layer of the student network, as shown in the middle part of fig. 2. As can be seen in fig. 2, the bridging structure consists of layers below the fourth convolutional layer of the student network and layers above the fourth convolutional layer of the teacher network.

And 5, training a student network.

And simultaneously inputting the training set into the student network and the trained teacher network to obtain the output P of the student network, the output Q of the teacher network and the output B of the bridging structure.

And calculating a KL divergence loss value between the output of the teacher network and the output of the bridging structure by using the KL divergence loss function.

The KL divergence loss function is as follows:

wherein the content of the first and second substances,

representing KL divergence loss function, Q_iRepresenting the probability of the predicted class obtained by inputting the ith image in the training set into the teacher's network, B_iRepresenting the prediction class probability obtained by inputting the ith image in the training set into the bridge structure.

And calculating a cross entropy loss value between the output of the student network and the class label of the training image by using a cross entropy loss function.

The cross entropy loss function is as follows:

wherein J represents a cross entropy loss function, Y_iIndicates the class label, P, corresponding to the ith image in the training set_iRepresenting the probability of the predicted class obtained by inputting the ith image in the training set into the student network.

And taking the sum of the KL divergence loss value and the cross entropy loss value as a total loss value, and iteratively updating parameters of the student network through a back propagation algorithm until the total loss value is converged to obtain the trained student network.

And 6, classifying the images to be classified.

Claims

1. An image classification method based on a bridging knowledge distillation convolution neural network is characterized in that a bridging structure is constructed between a teacher network and a student network, and the student network is trained according to a KL divergence loss function and a cross entropy loss function, and the method comprises the following steps:

(1) constructing a teacher network and a student network:

(1b) the parameters of each layer of the teacher network are set as follows:

(1c) the parameters of each layer of the student network are set as follows:

(2) generating a training set:

(3) training a teacher network:

(4) constructing a bridging structure:

(5) training a student network:

(5d) taking the sum of the KL divergence loss value and the cross entropy loss value as a total loss value, and iteratively updating parameters of the student network through a back propagation algorithm until the total loss value is converged to obtain a trained student network;

(6) classifying the images to be classified:

2. The image classification method based on the bridge knowledge distillation convolutional neural network of claim 1, wherein the cross entropy loss function in the steps (3) and (5c) is as follows:

where J represents the cross entropy loss function, N represents the total number of images in the training set, Sigma represents the summation operation, i represents the number of images in the training set, Y represents the number of images in the training set_iRepresenting the class label corresponding to the ith image in the training set, log representing the base 2 logarithm operation, P_iAnd the prediction class probability obtained by inputting the ith image in the training set into the network is shown.

3. The method for image classification based on the bridge knowledge distillation convolutional neural network of claim 2, wherein the KL divergence loss function in step (5b) is as follows:

wherein the content of the first and second substances,