CN114022727B

CN114022727B - Depth convolution neural network self-distillation method based on image knowledge review

Info

Publication number: CN114022727B
Application number: CN202111221950.3A
Authority: CN
Inventors: 张逸; 王军; 徐晓刚; 何鹏飞; 虞舒敏; 徐凯
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2021-10-20
Filing date: 2021-10-20
Publication date: 2024-04-26
Anticipated expiration: 2041-10-20
Also published as: CN114022727A

Abstract

The invention discloses a deep convolutional neural network self-distillation method based on image knowledge review, which is characterized in that an auxiliary network is firstly arranged for a target network, branches are led out from a downsampling layer of the target network, knowledge review thought is adopted to sequentially fuse and connect all branches, and in the training process, the self-distillation purpose is achieved by supervised learning and a mode of learning from the downsampling layer of the target network to the led-out branching layer. According to the invention, a knowledge review thought is introduced in the self-distillation field of the deep convolutional neural network, so that the training precision of the deep convolutional neural network is improved; the self-distillation is carried out in an auxiliary network mode, and compared with a self-distillation method using data enhancement to pull in the intra-class distance, the method is simpler and more convenient in practical application.

Description

Depth convolution neural network self-distillation method based on image knowledge review

Technical Field

The invention relates to the technical field of artificial intelligence and computer vision, in particular to a deep convolutional neural network self-distillation method based on image knowledge review.

Background

Along with the development of intellectualization, a large-scale camera generates massive videos, and in order to improve the efficiency of video structuring, the actual scene has set high requirements on the processing capacity of the end side of the camera.

In view of this, compact small network models like MobileNet, shuffleNet, etc., greatly advance the development of end-side reasoning performance with the advantage of low resource consumption. MobileNet, shuffleNet and other compact small networks, so that the time consumption of end side reasoning and the consumption of hardware resources are effectively reduced, however, the reasoning precision has certain disadvantages relative to a large-scale network model, and the method has important significance in further improving the precision performance after training under the condition of network structure limitation.

Under the condition of limiting a network structure, a common method for improving the model training precision performance is knowledge distillation, namely, a teacher network with a large heterogeneous structure scale and a student network with a compact structure are set, and information learned by the teacher network is transmitted to the student network. However, in an actual scene, the teacher model may not be successfully acquired in all tasks, and some large networks are difficult to train normally due to the lack of data. Based on this, self-learning self-distillation training strategies have been developed.

At present, the self-distillation method is mainly divided into a data enhancement-based method and an auxiliary network-based method, wherein the data enhancement-based method is complex in the training process, lacks a certain simplicity in practical application, is simple in the auxiliary network-based method, but the precision improvement and the network complexity are often a pair of contradictions, and an auxiliary network with larger precision improvement and no complexity in the self-distillation training is needed to be sought, so that the deep convolutional neural network self-distillation training is applied in practical engineering, and the method is simple, convenient and has good precision improvement.

Disclosure of Invention

In order to solve the defects in the prior art, the semantic information of a deep network and the information of a shallow network are reviewed and fused by using a knowledge review method, so that the purposes of keeping the simplicity of an auxiliary network and improving the self-distillation training precision are achieved, and the invention adopts the following technical scheme:

A depth convolution neural network self-distillation method based on image knowledge review comprises the following steps:

S1, constructing an auxiliary network according to an original network structure of a target convolutional neural network, and leading out branch characteristics from a previous layer and a last full-connection layer of each downsampling layer;

S2, merging the branch feature A to be merged in the auxiliary network with the branch feature B in a shallower layer in the original network through an attention merging module, inputting the merged feature into the auxiliary network, and connecting the layers of the auxiliary network in sequence, wherein in order to shorten the intra-class distance, an attention mechanism is introduced to perform feature merging, and the merging comprises the following steps:

S21, the branch characteristics A to be fused in the auxiliary network are subjected to up-sampling operation and convolution operation, so that the width and the height of the branch characteristics A to be fused in the auxiliary network are the same as those of the branch characteristics B in a shallower layer in the original network;

S22, carrying out channel attention operation on A and B respectively, carrying out global average pooling and global maximum pooling operation in the width-height direction, adding the obtained features, multiplying the added features with the original branch features in the channel direction through sigmoid operation, and respectively obtaining a branch feature A1 to be fused in an auxiliary network after the channel attention operation and a branch feature B1 at a shallower layer in the original network;

S23, spatial attention fusion, namely performing splicing operation on A1 and B1 in a channel direction, performing global average pooling operation and global maximum pooling operation in the channel direction respectively, performing splicing operation and convolution operation in the channel direction again, and multiplying the two with A1 and B1 on a wide-high scale respectively after sigmoid operation to obtain a branch characteristic A2 to be fused in an auxiliary network after spatial attention fusion and a branch characteristic B2 of a shallower layer in an original network;

s24, adding the A2 and the B2 to obtain the fused characteristic;

S3, training the full connection layer in the original network and the full connection layer of the auxiliary network led out by the branches through respective softmax layers by taking a final truth value label as a target, simultaneously leading out the characteristics of the branches in the original network, performing distillation learning on each fused branch characteristic, and fusing shallow morphological characteristics and deep semantic characteristics with each fused branch characteristic, wherein the full connection layer has higher information content than the characteristics in the original network, so the full connection layer can be used as an object of distillation learning, and comprises the following steps:

S31, training is carried out on a full connection layer in an original network and a full connection layer of an auxiliary network led out by a branch respectively through respective softmax layers by taking a final truth value label as a target through a loss function L _k:

Wherein Y ^p is a predicted probability value, Y ^t is a true probability value corresponding to a true value label, and alpha is an adjustment parameter, so as to obtain a loss function L _k1 of the original network and a loss function L _k2 of the auxiliary network respectively;

S32, binding the branch characteristics led out from the original network to each fused branch characteristic, and performing distillation learning by adopting the following loss function:

wherein x represents an image input by an original network, f (-) represents a deep convolutional neural network, W _A,i represents weights required by calculating an ith feature map of a leading-out branch feature in the original network, and W _B,i represents weights required by calculating an ith feature map of each fused branch feature in an auxiliary network;

s33, performing self-distillation training through a total loss function:

wherein, gamma and theta are respectively adjusting coefficients;

and S4, after training is completed, removing the auxiliary network, and reasoning through the original network.

Further, in S2, the fused features are input into a layer corresponding to the auxiliary network, the output of the layer is downsampled by convolution, and added with the branch features to be fused in the deeper layer in the auxiliary network, so as to be used as the input of the deeper layer of the auxiliary network.

Further, in S2, each branch feature to be fused in the auxiliary network is sequentially fused with a branch feature of a shallower layer in the original network.

Further, in S2, the branching feature of the shallower layer is fused with the branching feature of the shallower layer in the original network.

Further, in S21, when the branch features of the full connection layer are fused with the branch features of the shallower layer in the original network, the two are remolded, so that the two are 0.5 times of the branch features of the shallower layer in the original network before the fusion in the width-height direction.

Further, in S22, the two features with sizes of 1×1×c after the pooling operation, where C is the number of channels of the branch feature a.

Further, in S23, the number of channels is changed to 2 by the splicing operation, and then the convolution operation of the convolution kernel with the size 1*1 and the step length of 1 is performed, so as to compress the number of channels from 2 to 1.

Further, the loss function L _k in S31 adopts a Focal loss.

Further, the loss function L _f in S32 adopts L1 loss.

The invention has the advantages that:

According to the invention, in the field of self-distillation of the deep convolutional neural network, the thought of knowledge review is introduced, so that the training precision of the deep convolutional neural network is improved; the invention adopts the auxiliary network mode to carry out self-distillation, and is simpler and more convenient in practical application compared with a self-distillation method using data enhancement to pull in the intra-class distance.

Drawings

Fig. 1 is a general schematic of the present invention.

Fig. 2 is a schematic structural diagram of an attention fusion module in the present invention.

Detailed Description

The following describes specific embodiments of the present invention in detail with reference to the drawings. It should be understood that the detailed description and specific examples, while indicating and illustrating the invention, are not intended to limit the invention.

The ship identification is a task of classifying and identifying ships based on ship images in a horizontal view angle, and has important significance in identifying enemy in military scenes and identifying ships in various civil maritime scenes.

Specifically, pytorch 1.6.6 is adopted for network training and test verification, the system environment is Ubuntu18.04, 2080Ti is adopted for training display card hardware, resnet is taken as an example of an original network structure, a data set adopts network collected ship images, the identified ship categories are divided into 10 categories of amphibious landing ships, dock landing ships, expelling ships, guard ships, aircraft carriers, replenishment ships, fishery ships, transportation ships and the like, 6000 pictures are adopted for training sets, and 1000 pictures are adopted for testing sets.

As shown in fig. 1, the deep convolutional neural network self-distillation method based on ship image knowledge review of the invention comprises the following steps:

Step 1: branches are led out at the previous layer and the last full-connection layer of each downsampling layer according to the original structure of the target convolutional neural network.

Specifically, as shown in fig. 1, S2, S3 are the downsampled previous layer in Resnet, S4 is the last fully connected layer, and the branches are fused to form an auxiliary network to distill the original network.

Step 2: each branch feature is sequentially fused with the branch features of the shallower layer in the original structure through an attention fusion module, the size of the fused features is the same as the size of the branch features of the shallower layer, and the fused layers are sequentially connected;

because the intra-class difference of the ships in the embodiment is larger, such as the difference of the shapes of the expelling ships in different countries, the difference of the shapes and the volume scales of the transportation ships is also larger, the discrimination of the intra-class difference and the inter-class difference becomes a troublesome problem, and the attention mechanism introduced in the method is helpful for enhancing the information of the key areas reflecting the inter-class difference of the ship images, weakening the information of the areas reflecting the intra-class difference, such as the image information of the head areas of the expelling ships and the transportation ships, weakening the morphological characteristic information of the middle part of the ship, and introducing the attention mechanism to perform characteristic fusion in order to shorten the intra-class distance.

Specifically, as shown in fig. 1, each branch feature is sequentially fused with a branch feature of a shallower layer in the original structure through an attention fusion module, that is, S4 replicates T4, and is fused with S3 to obtain T3, and is fused with S2 to obtain T2, T2 and S1, and T1 is obtained by fusing the fused feature with the branch feature of the shallower layer, that is, S1, S2, S3, S4 and T1, T2, T3 and T4 are respectively corresponding and identical in size, and are used for distillation in the training process.

The fusion process of the attention fusion module is as follows, and a schematic diagram is shown in fig. 2:

Step 2.1: the deep branch features to be fused are subjected to up-sampling operation firstly, and then a 1*1-size convolution operation with a 1-convolution kernel is performed, so that the width and height of the features to be fused are the same as those of the branches in shallower layers in the original structure, wherein before the T4 features and the S3 features are fused, reshape operation is performed on the T4 features and the S3 features, so that the features are 0.5 times of the original S3 features in the width and height directions;

Step 2.2: let the feature to be fused at this moment be A, the branch feature of the shallower layer in the original structure be B, and the following channel attention operations are performed respectively: and carrying out Global average Pooling and Global max Pooling in the width-height direction to obtain two features with the size of 1 x C, wherein C is the number of channels of the features to be fused, and C corresponding to S1, S2 and S3 is 64, 128 and 256 respectively. After the two are added, the two are multiplied with the original characteristic in the channel direction through sigmoid operation. The A1 and the B1 are respectively obtained through the above operation;

step 2.3: spatial attention fusion is performed: performing concat operation on the features A1 and B1 in the channel direction, performing Global average Pooling and Global max Pooling in the channel direction respectively, performing concat operation again in the channel direction to enable the number of channels to be 2, performing convolution operation with 1*1 size step length of 1 convolution kernel, compressing the number of channels from 2 to 1, performing sigmoid operation, and multiplying the compressed number of channels with the features A1 and B1 in a wide-high scale to obtain A2 and B2 features;

Step 2.4: adding the features A2 and B2 to obtain fused features;

Each layer after fusion is connected in sequence, and the following connection modules are specifically adopted by T1, T2 and T3: firstly, performing 3*3 convolution kernel convolution with the step length of 1, then performing downsampling through 3*3 convolution kernel convolution with the step length of 2, and adding with the feature to be fused through 3*3 convolution kernel convolution with the step length of 1;

The T3 and T4 features specifically adopt the following connection modules: t3 is firstly subjected to 3*3 convolution kernel convolution with the step length of 1, then subjected to 3*3 convolution kernel convolution with the step length of 2, downsampled, and then subjected to 3*3 convolution kernel convolution with the step length of 1, and is added with T4 for connection after passing through the full connection layer.

Step 3: the full connection layer in the original structure and the full connection layer led out by the branches pass through the softmax layer respectively, and training is carried out by taking ground truth label of the final ship image as a target. Simultaneously, distilling and learning the features of the leading-out branches in the original structure to the fused branch features;

each fused branch fuses the shallow morphological characteristics and the deep semantic characteristics of the ship, has higher information content than the characteristic layer in the original Resnet network, and can be used as a distillation learning object.

Specifically, the specific operation of S3 is as follows:

Setting the training strategy batch number as 32, scaling ship images to 640 x 640 size, inputting Resnet images into a network, adopting a randomly generated image input sequence, setting the training iteration number as 200, multiplying the learning rate by 0.1 when the initial network learning rate is 0.01, and the iteration numbers are 100 and 150, respectively multiplying the learning rate by 0.1, adopting an SGD optimizer, wherein the training batch number is tested to be 8, 16, 32 and 64 respectively, and the result is that the batch number 32 is the optimal;

The full connection layer in the original structure and the full connection layer led out by the branches respectively pass through the softmax layer, the output category number is the preset ship image identification number (10 categories), as shown in table 1 is the predicted probability value of three example ship images from top to bottom in fig. 1, the output probability of the auxiliary structure led out by the branches at the position of the true value label is higher than that of the original structure, and therefore the effectiveness of the auxiliary structure led out by the branches for self-distillation is proved by the side surface. The output probability of the original structure and the output probability of the auxiliary structure led out by the branches are trained by taking ground truth label of the final ship image as targets, so that losses L _k1 and L _k2 are obtained.

TABLE 1

The loss function adopts a formula of Focal loss and L _k as follows, Y ^p is a predicted probability value of the ship type, Y ^t is a probability value of the real ship type corresponding to the image truth value tag ground truth label, alpha is an adjustment parameter, and alpha is 2;

Meanwhile, features of the leading-out branches in the original structure are bound to the fused branch features, distillation learning is performed, namely, feature graphs S1, S2, S3 and S4 are respectively subjected to distillation learning to T1, T2, T3 and T4, and a loss function L _f adopts L1 loss as shown in the following formula:

Wherein x represents the input of the network, namely, the ship image scaled to 640 x 640 size, f (·) represents the deep convolutional neural network, W _A,i represents the weight required for calculating the ith feature map of the leading-out branch in the original network structure, W _B,i represents the weight required for calculating the ith feature map of each fused branch feature of the network, f (·) is the process of extracting the ship image feature, and the output is the required feature map. The characteristic extracted from the original network structure for the ship image is represented by f (x, W _A,i), the characteristic of the ship image enhanced by the auxiliary network attention mechanism is represented by f (x, W _B,i), and the distillation learning of the former to the latter enhances the identification of the original network for the differences between ships.

The total loss function during the self-distillation training is shown below, where γ and θ are adjustment coefficients, set to 0.5 and 0.7, respectively:

。

Step 4: after training, removing the auxiliary network led out by each branch, and returning to the original structure.

Specifically, after training is completed, the performance of the original Resnet for ship identification is improved due to the self-distillation effect, the auxiliary networks led out by each branch are removed, and the original structure is returned, namely the original Resnet network of the S1, S2, S3 and S4 paths is only reserved for the reasoning deployment of the ship identification.

The test adopts top1 accuracy as a final evaluation index, and table 2 shows the prediction accuracy of each category in the original structure and the prediction accuracy of the auxiliary network:

TABLE 2

For table 2, it can be seen that the prediction accuracy of the auxiliary network in each category is significantly higher than that of the original structure, which indicates that the auxiliary structure led out by the branches has stronger discrimination capability after the knowledge review method and the attention mechanism are introduced, and can effectively guide the original network structure to perform self-distillation, thereby improving the performance of the original network structure.

Table 3 is the average accuracy of the baseline and the method after removal of the auxiliary network in the ship image test set:

TABLE 3 Table 3

	Baseline Resnet network 18	The method Resnet is carried out by network self-distillation
			Ship identification accuracy	87.63%	89.32%

For Table 3, it can be seen that, with respect to the baseline, the classification accuracy of Resnet network for ship identification can be effectively improved from 87.63% to 89.32% by adopting the method, which proves the effectiveness of the method.

The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced with equivalents; such modifications and substitutions do not depart from the spirit of the technical solutions according to the embodiments of the present invention.

Claims

1. The deep convolutional neural network self-distillation method based on image knowledge review is characterized by comprising the following steps of:

S2, merging the branch characteristic A to be merged in the auxiliary network with the branch characteristic B of the shallower layer in the original network, inputting the merged characteristic into the auxiliary network and connecting the layers of the auxiliary network in sequence, wherein the merging introduces a attention mechanism and comprises the following steps:

S21, the width and height of the branch characteristic A to be fused in the auxiliary network are the same as those of the branch characteristic B in the shallower layer in the original network;

s24, adding the A2 and the B2 to obtain the fused characteristic;

S3, training the full connection layer in the original network and the full connection layer of the auxiliary network led out by the branches by using the respective softmax layers and taking the final truth value label as a target, and simultaneously, leading out the characteristics of the branches in the original network and carrying out distillation learning on the characteristics of the branches after fusion, wherein the method comprises the following steps:

S31, training is carried out on a full connection layer in an original network and a full connection layer of an auxiliary network led out by a branch respectively through respective softmax layers by taking a truth value label of a final image as a target through a loss function L _k:

Wherein Y ^p is a predicted probability value, Y ^t is a true probability value corresponding to a true value label of the image, and alpha is an adjustment parameter, so as to obtain a loss function L _k1 of the original network and a loss function L _k2 of the auxiliary network respectively;

s33, performing self-distillation training through a total loss function:

wherein, gamma and theta are respectively adjusting coefficients;

2. The method for self-distilling deep convolutional neural network based on image knowledge review as claimed in claim 1, wherein in S2, the fused features are input into a layer corresponding to the auxiliary network, the output of the layer is downsampled by convolution and added with the branch features to be fused in a deeper layer in the auxiliary network as the input of the deeper layer of the auxiliary network.

3. The method for self-distillation of deep convolutional neural network based on image knowledge review as claimed in claim 1, wherein in S2, each branch feature to be fused in the auxiliary network is fused with a branch feature of a shallower layer in the original network in sequence.

4. The method for self-distilling a deep convolutional neural network based on image knowledge review as recited in claim 1, wherein in said S2, the branch features of the shallower layer are fused with the branch features of the shallower layer in the original network.

5. The image knowledge review-based deep convolutional neural network self-distillation method according to claim 1, wherein in the step S21, when the branch features of the full-connection layer are fused with the branch features of the shallower layer in the original network, the two are remodelled so that the two are 0.5 times as large as the branch features of the shallower layer in the original network before the fusion in the width-height direction.

6. The method of claim 1, wherein in S22, the pooling operation is performed on two features with a size of 1×1×c, where C is the number of channels of the branch feature a.

7. The method for self-distillation of deep convolutional neural network based on image knowledge review as claimed in claim 1, wherein in S23, the number of channels is changed to 2 by splicing operation, and then the convolution operation of 1*1 size and step length 1 convolution kernel is performed to compress the number of channels from 2 to 1.

8. The method of self-distillation of deep convolutional neural network based on knowledge review of images according to claim 1, wherein said loss function L _k in S31 employs Focal loss.

9. The method of self-distillation of deep convolutional neural network based on knowledge review of images according to claim 1, wherein said loss function L _f in S32 is L1 loss.