CN117454971A

CN117454971A - Projection type knowledge distillation method based on self-adaptive mask weighting

Info

Publication number: CN117454971A
Application number: CN202311530381.XA
Authority: CN
Inventors: 王军; 秦新芳; 李玉莲; 申政文; 陈世海
Original assignee: China University of Mining and Technology CUMT
Current assignee: China University of Mining and Technology CUMT
Priority date: 2023-11-16
Filing date: 2023-11-16
Publication date: 2024-01-26

Abstract

The invention discloses a projection type knowledge distillation method based on self-adaptive mask weighting, which comprises the steps of firstly constructing a relation matrix of characteristics extracted by a student module based on a student network to enable information contained between adjacent pixels to be richer and more various, secondly respectively constructing a self-adaptive mask matrix based on the relation matrix and a characteristic diagram of the student network to carry out self-adaptive mask weighting, constructing a projection layer, projecting complete characteristics approaching to teacher characteristics under the guidance of a teacher network by utilizing the characteristics weighted by the mask, and finally utilizing the corresponding characteristic layer of the teacher network to supervise the corresponding characteristic layer of the student network and update the student model. The invention improves the expression capability of the student network model on learned rich information, solves the problems of limited student network characterization capability and insufficient information utilization caused by limited student characteristics of random mask and adjacent pixel receptive fields of the student characteristics, and improves the robustness and generalization capability of the knowledge distillation model.

Description

Projection type knowledge distillation method based on self-adaptive mask weighting

Technical Field

The invention relates to the field of computer vision, in particular to a projection type knowledge distillation method based on self-adaptive mask weighting.

Background

Deep convolutional neural networks have been widely used for a variety of computer vision tasks. Generally, the larger the model, the better the performance, but the slower the reasoning speed, which is difficult to deploy in situations where resources are limited. To overcome this problem, knowledge distillation has been proposed. Current feature-based distillation methods typically allow students to mimic the characteristics of teachers as much as possible, so that the student features have a greater characterization capability.

Yang et al, in Masked Generative Distillation, propose that improving the student's ability to characterize does not necessarily need to be accomplished by directly mimicking the teacher. From this point, yang et al modified the mimicking task to a generating task, namely, in the distillation process, by randomly masking the student features, the students generated stronger teacher features by virtue of their weaker features, so as to improve the characterization ability of the students. However, the mask area of the feature map is too randomized by carrying out the random masking on the student features, so that the subsequent recovery effect of adjacent pixels according to the mask area is affected, and the random masking operation is directly carried out on the features, so that the receptive fields of the adjacent pixels of the student features are limited, the complete features cannot be recovered effectively according to the receptive fields, namely, the characterization capability of the student network is still limited.

Disclosure of Invention

The invention aims to provide a projection type knowledge distillation method based on self-adaptive mask weighting, which solves the problems of limited student network characterization capability and insufficient information utilization caused by limited student characteristic and student characteristic adjacent pixel receptive field of a random mask, and improves the robustness and generalization capability of a knowledge distillation model.

The technical solution for realizing the purpose of the invention is as follows: a method of projective knowledge distillation based on adaptive mask weighting, comprising the steps of:

step 1, randomly acquiring K images with labels in CIFAR-100 data set, 10000<K is less than or equal to 60000, normalization processing is carried out on the K images, and the pixel size is unified to be h ₀ ×w ₀ Wherein h is ₀ Is the image height, w ₀ Is the image width; the images with uniform sizes are randomly divided into a training data set and a test data set according to the proportion of 5:1, the training data set is subjected to data enhancement to form a teacher-student network training data set, the teacher network is pre-trained by using the teacher-student network training data set, a pre-training teacher network is obtained, and the step 2 is carried out.

And step 2, dividing a teacher network into n teacher modules according to the depth of the convolution layer and the size of the feature map, dividing a student network into n student modules, and turning to step 3.

And 3, constructing n-1 relation matrixes based on the output characteristics of n student modules of the student network, and turning to step 4.

Step 4, constructing a corresponding self-adaptive mask matrix based on the relation matrix constructed in the step 3, and respectively carrying out self-adaptive mask relation weighting on the output characteristics of the first n-1 student modules of the student network by using the self-adaptive mask matrix to obtain first n-1 self-adaptive mask relation weighting characteristics; and (5) carrying out self-adaptive masking on the output characteristics of the nth student module of the student network to obtain self-adaptive masking characteristics, and turning to step 5.

Step 5, constructing a projection layer, guiding the corresponding projection layer by using a teacher network, enabling projections of n-1 self-adaptive mask relation weighting characteristics obtained by a student network to approach output characteristics of corresponding n-1 teacher modules, and calculating the self-adaptive mask relation weighted projection loss; and (3) enabling the projection of the nth self-adaptive mask feature to approach to the output feature of the nth teacher module, calculating the projection loss of the self-adaptive mask feature, and turning to step 6.

Step 6, calculating distillation loss of the traditional distillation method by using the output characteristics of the nth teacher module of the teacher network and the output characteristics of the nth student module of the student network; and calculating total distillation loss by using the traditional distillation loss and the projection type loss weighted by the self-adaptive mask, updating network parameters of the student network according to the total distillation loss, and finally obtaining the trained student network, and turning to step 7.

And 7, inputting the test data set into a trained student network, outputting a prediction result corresponding to each sample in the test set, and testing the accuracy of the trained student network.

Compared with the prior art, the invention has the advantages that:

1) Compared with the existing knowledge distillation method, the method focuses on improving the expression capacity of the student model, fully excavates and expresses the rich information contained in the dual characteristics of the relation matrix and the feature map by the student model under the guidance of the teacher model, and simultaneously solves the problems of insufficient utilization of the feature knowledge and larger difference of the expression capacities of the student model and the teacher model.

2) According to the invention, the projection type distillation model for carrying out self-adaptive mask relation weighting and self-adaptive mask output characteristic weighting on the extracted characteristics of each stage of the student network is constructed for the first time, the problems of limited student network characterization capability and insufficient information utilization caused by limited student characteristics and student characteristics adjacent pixel receptive fields are solved, and the robustness and generalization capability of the knowledge distillation model are improved.

Drawings

FIG. 1 is a model diagram of the present invention based on an adaptive mask weighted projective knowledge distilling method.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in further detail below.

Referring to fig. 1, a projection type knowledge distillation method based on adaptive mask weighting includes the following steps:

step 1,Randomly acquiring K tagged images in CIFAR-100 data set 10000<K is less than or equal to 60000, normalization processing is carried out on the K images, and the pixel size is unified to be h ₀ ×w ₀ Wherein h is ₀ Is the image height, w ₀ Is the image width; the images with uniform sizes are randomly divided into a training data set and a test data set according to the proportion of 5:1, the training data set is subjected to data enhancement to form a teacher-student network training data set, the teacher network is pre-trained by using the teacher-student network training data set, a pre-training teacher network is obtained, and the step 2 is carried out.

And 2, dividing a teacher network into n teacher modules according to the depth of the convolution layer and the size of the feature map, dividing a student network into n student modules, extracting features of each stage, and turning to the step 3.

Step 3, constructing n-1 relation matrixes based on output characteristics of n student modules of a student network, wherein the relation matrixes are specifically as follows:

first, the output characteristics of the ith student module are defined asI is more than or equal to 1 and less than or equal to n, S represents a student network, and H, W and C respectively represent the height, width and dimension of an output characteristic; defining the output characteristics of the ith teacher module asI is more than or equal to 1 and less than or equal to n, and T represents a teacher network; then the cavity convolution is utilized to perform the feature F _Si Sparse sampling is carried out to obtain a feature map +.>Will F _Si And->Feature fusion is carried out to increase the common receptive field contained between adjacent pixels, so that the projection layer is more beneficial to projecting the masked feature pixels, and the fused features are expressed as + -> Finally, the fused features are used->And->Building a relationship matrix G _N ：

Namely G _N Representing a relation matrix constructed by the characteristics of the output characteristics of the ith student module after characteristic fusion and the characteristics of the (i+1) th student module after characteristic fusion, wherein i is more than or equal to 1 and less than or equal to N-1, and N is more than or equal to 1 and less than or equal to N-1; h denotes a pixel position in the height dimension, and w denotes a pixel position in the width dimension. The construction of the relation matrix enables the relations of adjacent pixels to be more intimate, the overlapped receptive fields are increased, and projection of the pixels after masking is facilitated.

And (4) switching to step 4.

Step 4, constructing a corresponding self-adaptive mask matrix based on the relation matrix constructed in the step 3, and respectively carrying out self-adaptive mask relation weighting on the output characteristics of the first n-1 student modules of the student network by using the self-adaptive mask matrix to obtain first n-1 self-adaptive mask relation weighting characteristics; the output characteristics of the nth student module of the student network are independently subjected to self-adaptive mask weighting to obtain self-adaptive mask weighting characteristics, and the self-adaptive mask weighting characteristics are specifically as follows:

firstly, obtaining the score of a characteristic diagram of a relation matrix through a softmax function, sequencing the characteristic diagram according to the score from large to small, and selecting the top k with high score ₁ A value of k ₁ The values are in the original, unordered positions in the feature map, as the attention areas of the adaptive mask relationship, the remaining positions are assigned a value of 0,the adaptive mask relationship matrix is represented by the following expression:

wherein the method comprises the steps ofRepresenting a relationship matrix G _N Corresponding adaptive mask matrix,/->Top k representing high score corresponding to the relation matrix ₁ The original positions of the values, v and j, are the horizontal coordinates and the vertical coordinates of the relation matrix respectively; compared with the prior random masking operation, the self-adaptive masking matrix provided for the first time has more pertinence, and the self-adaptive feature with high score is reserved as the weighted weight, so that the target feature occupies higher proportion, and important features are projected better.

Then, the corresponding relation matrix is masked by using the self-adaptive mask matrix to obtain a weight matrix for self-adaptive mask relation weightingWherein ". Sup.H.is the Hadamard product.

Finally, the weight matrix is utilizedFeatures F extracted for the ith student module _Si Performing adaptive mask weighting to obtain adaptive mask relation weighting feature +.>

Similarly, the output characteristics of the nth student module of the student network are subjected to a softmax function to obtain the score of the characteristic map, the scores are ordered from large to small according to the scores, and the top k with high score is selected ₂ A value of k ₂ The individual values are original, not in the feature mapThe ordered positions, the attention area of the adaptive mask, and the remaining positions are assigned 0, and the adaptive mask matrix is expressed by the following expression:

wherein the method comprises the steps ofOutput characteristics F representing nth student module of student network _Sn A corresponding adaptive mask matrix is used to determine,representing the output characteristics F _Sn The corresponding top k with high score ₂ The original positions of the values, v, j are the output characteristics F _Sn Horizontal coordinates and vertical coordinates of (a); output features F of nth student module of student network using adaptive mask matrix _Sn Masking to obtain adaptive mask feature->

Go to step 5.

Step 5, constructing a projection layer, guiding the corresponding projection layer by using a teacher network, enabling projections of n-1 self-adaptive mask relation weighting characteristics obtained by a student network to approach output characteristics of corresponding n-1 teacher modules, and calculating the self-adaptive mask relation weighted projection loss; the projection of the weighting characteristic of the n-th self-adaptive mask approximates to the output characteristic of the n-th teacher module, and the projection loss of the self-adaptive mask characteristic is calculated as follows:

first, a projection layer is formed by convolution blocks and ReLU functionsThe structure of the system is a 3X 3 convolution block, a ReLU function layer and a 3X 3 convolution block which are connected in sequence; then weighting the adaptive mask relation weighting feature +.>Inputting the extracted characteristics F into a projection layer and extracting the characteristics F from the corresponding teacher module _Ti Under the guidance of (a), forcing the student network to project a shape and size approximate to F _Ti Relation projection features->Finally, calculating the output characteristic F of the corresponding teacher module _Ti And projection features->Is a projection type loss L weighted by the adaptive mask relation _admp1 The formula is as follows:

wherein F is _Ti Representing the features extracted by the ith teacher module divided by the teacher network,representing the feature F extracted by the ith student module divided by the student network _Si And carrying out self-adaptive mask relation weighting on the projected characteristics, wherein c represents the channel number, h represents the pixel position in the height dimension, and w represents the pixel position in the width dimension.

Similarly, the adaptive mask weighting featureInputting the extracted characteristics F into a projection layer and extracting the characteristics F from the corresponding teacher module _Tn Under the guidance of (a), forcing the student network to project a shape and size approximate to F _Tn Mask projection feature->Finally, calculating the extracted characteristic F of the corresponding teacher module _Tn And mask projection feature->Projection penalty L for adaptive masking features of (2) _admp2 The formula is as follows:

the projection type loss of the self-adaptive mask characteristic of the self-adaptive mask weighted projection type knowledge distillation method is reconstructed as follows:

L _admp ＝α ₁ L _admp1 +α ₂ L _admp2

alpha in the formula ₁ Is a weight super-parameter, alpha, for adjusting the projection loss weighted by the adaptive mask relation ₂ Is a weight super-parameter for adjusting the projection distillation loss of the self-adaptive mask feature; the loss function sub-module is used for correcting the deviation between the relation matrix of the mask student module, the projected characteristics of the output characteristics and the output characteristics of the corresponding teacher module, so that the teacher network achieves a better guiding effect, and the student network can better excavate learned information from the student network under the guidance of the teacher network and fully utilize the learned information.

Go to step 6.

Step 6, calculating distillation loss of the traditional distillation method by using the output characteristics of the nth teacher module of the teacher network and the output characteristics of the nth student module of the student network; and calculating total distillation loss by using the traditional distillation loss and the projection loss weighted by the self-adaptive mask, updating network parameters of the student network according to the total distillation loss, and finally obtaining the trained student network, wherein the method comprises the following steps of:

the loss of the most traditional feature-based knowledge distillation method is expressed as:

wherein F is _Tn Output characteristics of nth module, i.e., last teacher module, representing n teacher modules divided by teacher networkSign, F _Sn Representing output characteristics of an nth student module, namely a last student module, of n student modules divided by a student network;

the total loss can be expressed as: l (L) _totally ＝L _admp +L _classical 。

Go to step 7.

Example 1

Referring to fig. 1, the method for projection type knowledge distillation based on adaptive mask weighting according to the present invention comprises the following steps:

step 1, randomly collecting 60000 images with labels in a CIFAR-100 data set, carrying out normalization processing on the 60000 images, unifying the pixel sizes into 32 multiplied by 32, randomly dividing the images with unified sizes into a training data set and a test data set according to a ratio of 5:1, carrying out data enhancement on the training data set to form a teacher-student network training data set, and pre-training a teacher network by utilizing the teacher-student network training data set to obtain the teacher network, wherein the data enhancement operation comprises image scaling and random overturning, the image scaling is scaled inwards and outwards according to 10% of an original image, the random overturning angle is between-20 DEG and 20 DEG, and the number of image categories is 100.

And step 2, dividing a teacher network into 4 teacher modules according to the depth of the convolution layer and the size of the feature map, dividing a student network into 4 student modules, and turning to step 3.

And 3, constructing 3 relation matrixes based on the output characteristics of 4 student modules of the student network, and turning to the step 4.

Step 4, constructing a corresponding self-adaptive mask matrix based on the relation matrix constructed in the step 3, and respectively carrying out self-adaptive mask relation weighting on the output characteristics of the first 3 student modules of the student network by using the self-adaptive mask matrix to obtain first 3 self-adaptive mask relation weighting characteristics; and (5) carrying out self-adaptive mask feature weighting on the output features of the 4 th student module of the student network to obtain self-adaptive mask feature features, and turning to step 5.

Step 5, constructing a projection layer, guiding the corresponding projection layer by using a teacher network, enabling projections of 3 self-adaptive mask relation weighted features obtained by a student network to approach output features of 3 corresponding teacher modules, and calculating the self-adaptive mask relation weighted projection loss; and (3) enabling the projection of the 4 th self-adaptive mask weighting characteristic to approach to the output characteristic of the 4 th teacher module, calculating the projection loss of the self-adaptive mask characteristic, and turning to the step (6).

Step 6, calculating distillation loss of the traditional distillation method by using the output characteristics of the 4 th teacher module of the teacher network and the output characteristics of the 4 th student module of the student network; and calculating total distillation loss by using the traditional distillation loss and the projection type loss weighted by the self-adaptive mask, updating network parameters of the student network according to the total distillation loss, and finally obtaining the trained student network, and turning to step 7.

The method of the invention adopts the python programming language and the pytorch frame language to build a network frame on the Nvidia 2080Ti GPU host to carry out related experiments. For the classification task we calculate the sum of the losses from the conventional knowledge distillation losses and the adaptive mask weighted projection losses. The method of the invention uses two super parameters alpha ₁ 、α ₂ To balance the distillation loss of the equation. Setting up the super parameter { alpha } ₁ ＝0.000007，α ₂ =0.0000003 } for classification experiments. We trained all models for 240 epochs using the SGD optimizer with a momentum of 0.9 and a weight decay of 0.0001. We initialize the learning rate to 0.025 and decay every 30 periods. And training for multiple times on the training set to obtain the projection type knowledge distillation model based on the self-adaptive mask weighting.

In order to show the superior performance of the algorithm, the invention selects a knowledge distillation algorithm which is popular for several years as a comparison model, and comparison experiment results under objective conditions of ResNet-32×4 for a teacher network, resNet-8×4 for a student network, the same data set, the same equipment and the like are shown in Table 1:

TABLE 1 results of comparative experiments under objective conditions such as the same dataset

From the experimental results, the effectiveness of the method of the present invention can be seen.

Claims

1. A projection type knowledge distillation method based on self-adaptive mask weighting is characterized by comprising the following steps:

step 1, randomly acquiring K images with labels in CIFAR-100 data set, 10000<K is less than or equal to 60000, normalization processing is carried out on the K images, and the pixel size is unified to be h ₀ ×w ₀ Wherein h is ₀ Is the image height, w ₀ Is the image width; randomly dividing the image with the uniform size into a training data set and a test data set according to the proportion of 5:1, carrying out data enhancement on the training data set to form a teacher-student network training data set, and pre-training a teacher network by utilizing the teacher-student network training data set to obtain a pre-trained teacher network, and turning to step 2;

step 2, dividing a teacher network into n teacher modules according to the depth of the convolution layer and the size of the feature map, dividing a student network into n student modules, and turning to step 3;

step 3, constructing n-1 relation matrixes based on output characteristics of n student modules of a student network, and turning to step 4;

step 4, constructing a corresponding self-adaptive mask matrix based on the relation matrix constructed in the step 3, and respectively carrying out self-adaptive mask relation weighting on the output characteristics of the first n-1 student modules of the student network by using the self-adaptive mask matrix to obtain first n-1 self-adaptive mask relation weighting characteristics; performing adaptive masking on the output characteristics of the nth student module of the student network to obtain adaptive masking characteristics, and turning to step 5;

step 5, constructing a projection layer, guiding the corresponding projection layer by using a teacher network, enabling projections of n-1 self-adaptive mask relation weighting characteristics obtained by a student network to approach output characteristics of corresponding n-1 teacher modules, and calculating the self-adaptive mask relation weighted projection loss; enabling the projection of the nth self-adaptive mask feature to approach to the output feature of the nth teacher module, calculating the projection loss of the self-adaptive mask feature, and turning to the step 6;

step 6, calculating distillation loss of the traditional distillation method by using the output characteristics of the nth teacher module of the teacher network and the output characteristics of the nth student module of the student network; calculating total distillation loss by using the traditional distillation loss and the projection loss weighted by the self-adaptive mask, updating network parameters of the student network according to the total distillation loss, finally obtaining a trained student network, and turning to step 7;

2. The method of claim 1, wherein in step 3, n-1 relation matrices are constructed based on output characteristics of n student modules of the student network, specifically as follows:

first, the output characteristics of the ith student module are defined asS denotes a student network, H, W and C denote the height, width and dimension of the output feature, respectively; defining the output characteristics of the ith teacher module asT represents a teacher network; then the cavity convolution is utilized to perform the feature F _Si Sparse sampling is carried out to obtain a feature map +.>Will F _Si And->Performing feature fusion to increase receptive field, and expressing the fused features asFinally, the fused features are used->And->Building a relationship matrix G _N ：

Namely G _N Representing a relation matrix constructed by the characteristics of the output characteristics of the ith student module after characteristic fusion and the characteristics of the (i+1) th student module after characteristic fusion, wherein i is more than or equal to 1 and less than or equal to N-1, and N is more than or equal to 1 and less than or equal to N-1; h denotes a pixel position in the height dimension, and w denotes a pixel position in the width dimension.

3. The method for distilling knowledge based on projection of self-adaptive mask weighting according to claim 2, wherein in step 4, based on the relation matrix constructed in step 3, constructing a corresponding self-adaptive mask matrix, and respectively carrying out self-adaptive mask relation weighting on the output characteristics of the first n-1 student modules of the student network by using the self-adaptive mask matrix to obtain first n-1 self-adaptive mask relation weighting characteristics; and carrying out self-adaptive masking on the output characteristics of the nth student module of the student network to obtain self-adaptive masking characteristics, wherein the self-adaptive masking characteristics are as follows:

firstly, the relation matrix is subjected to softmax function to obtain the score of the characteristic graph, and thenThe method comprises the steps of sorting the scores from big to small, and selecting the top k with high score ₁ A value of k ₁ The original, unordered positions of the values in the feature map are used as the attention area of the adaptive mask relation, the rest positions are assigned to 0, and the adaptive mask relation matrix is expressed by the following expression:

wherein the method comprises the steps ofRepresenting a relationship matrix G _N Corresponding adaptive mask matrix,/->Top k representing high score corresponding to the relation matrix ₁ The original positions of the values, v and j, are the horizontal coordinates and the vertical coordinates of the relation matrix respectively;

then masking the corresponding relation matrix by using the self-adaptive mask matrix to obtain a weight matrix for self-adaptive mask relation weightingWherein "+.;

finally, the weight matrix is utilizedFeatures F extracted for the ith student module _Si Performing adaptive mask relation weighting to obtain adaptive mask relation weighting characteristics->

Similarly, the output characteristics of the nth student module of the student network are subjected to a softmax function to obtain the score of the characteristic map, the scores are ordered from large to small according to the scores, and the front with high score is selectedk ₂ A value of k ₂ The original, unordered positions of the values in the feature map are used as the attention area of the adaptive mask, the rest positions are assigned 0, and the adaptive mask matrix is expressed by the following expression:

4. The method of claim 3, wherein the adaptive mask weighting based projective knowledge distillation method,

5. the method for projection type knowledge distillation based on self-adaptive mask weighting according to claim 3 wherein in step 5, a projection layer is constructed, a teacher network is utilized to guide the corresponding projection layer, projections of n-1 self-adaptive mask relation weighting characteristics obtained by a student network are made to approach output characteristics of corresponding n-1 teacher modules, and projection losses of self-adaptive mask relation weighting are calculated; the projection of the nth adaptive mask feature is approximated to the output feature of the nth teacher module, and the projection loss of the adaptive mask weighting is calculated as follows:

wherein F is _Ti Representing the features extracted by the ith teacher module divided by the teacher network,representing the feature F extracted by the ith student module divided by the student network _Si The projection characteristics obtained by carrying out self-adaptive mask relation weighting and then projecting are characterized in that c represents the channel number, h represents the pixel position in the height dimension, and w represents the pixel position in the width dimension；

reconstructing the adaptive mask projection loss of the adaptive mask relation matrix weighted projection type knowledge distillation method into:

L _admp ＝α ₁ L _admp1 +α ₂ L _admp2

alpha in the formula ₁ Is a weight super-parameter, alpha, for adjusting the projection loss weighted by the adaptive mask relation ₂ Is a weight super-parameter that adjusts the projected penalty of the adaptive mask feature.

6. The adaptive mask weighting based projective knowledge distilling method of claim 4, wherein α ₁ ＝0.000007，α ₂ ＝0.0000003。

7. The adaptive mask weighting based projected knowledge distillation method as set forth in claim 5, wherein in step 6, the output characteristics of the nth teacher module of the teacher network and the output characteristics of the nth student module of the student network are used to calculate the conventional distillation loss of the conventional distillation method; and calculating total distillation loss by using the traditional distillation loss and the projection loss weighted by the self-adaptive mask, updating network parameters of the student network according to the total distillation loss, and finally obtaining the trained student network, wherein the method comprises the following steps of:

wherein F is _Tn Representing output characteristics of the nth, i.e., last, of n teacher modules divided by the teacher network, F _Sn Representing output characteristics of an nth student module, namely a last student module, of n student modules divided by a student network;

the total loss can be expressed as:

L _totally ＝L _admp +L _classical 。

8. the method of claim 1, wherein n=4.