CN117454971A - Projection type knowledge distillation method based on self-adaptive mask weighting - Google Patents

Projection type knowledge distillation method based on self-adaptive mask weighting Download PDF

Info

Publication number
CN117454971A
CN117454971A CN202311530381.XA CN202311530381A CN117454971A CN 117454971 A CN117454971 A CN 117454971A CN 202311530381 A CN202311530381 A CN 202311530381A CN 117454971 A CN117454971 A CN 117454971A
Authority
CN
China
Prior art keywords
student
network
self
adaptive mask
teacher
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311530381.XA
Other languages
Chinese (zh)
Inventor
王军
秦新芳
李玉莲
申政文
陈世海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China University of Mining and Technology CUMT
Original Assignee
China University of Mining and Technology CUMT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China University of Mining and Technology CUMT filed Critical China University of Mining and Technology CUMT
Priority to CN202311530381.XA priority Critical patent/CN117454971A/en
Publication of CN117454971A publication Critical patent/CN117454971A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/096Transfer learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0495Quantised networks; Sparse networks; Compressed networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Multimedia (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a projection type knowledge distillation method based on self-adaptive mask weighting, which comprises the steps of firstly constructing a relation matrix of characteristics extracted by a student module based on a student network to enable information contained between adjacent pixels to be richer and more various, secondly respectively constructing a self-adaptive mask matrix based on the relation matrix and a characteristic diagram of the student network to carry out self-adaptive mask weighting, constructing a projection layer, projecting complete characteristics approaching to teacher characteristics under the guidance of a teacher network by utilizing the characteristics weighted by the mask, and finally utilizing the corresponding characteristic layer of the teacher network to supervise the corresponding characteristic layer of the student network and update the student model. The invention improves the expression capability of the student network model on learned rich information, solves the problems of limited student network characterization capability and insufficient information utilization caused by limited student characteristics of random mask and adjacent pixel receptive fields of the student characteristics, and improves the robustness and generalization capability of the knowledge distillation model.

Description

Projection type knowledge distillation method based on self-adaptive mask weighting
Technical Field
The invention relates to the field of computer vision, in particular to a projection type knowledge distillation method based on self-adaptive mask weighting.
Background
Deep convolutional neural networks have been widely used for a variety of computer vision tasks. Generally, the larger the model, the better the performance, but the slower the reasoning speed, which is difficult to deploy in situations where resources are limited. To overcome this problem, knowledge distillation has been proposed. Current feature-based distillation methods typically allow students to mimic the characteristics of teachers as much as possible, so that the student features have a greater characterization capability.
Yang et al, in Masked Generative Distillation, propose that improving the student's ability to characterize does not necessarily need to be accomplished by directly mimicking the teacher. From this point, yang et al modified the mimicking task to a generating task, namely, in the distillation process, by randomly masking the student features, the students generated stronger teacher features by virtue of their weaker features, so as to improve the characterization ability of the students. However, the mask area of the feature map is too randomized by carrying out the random masking on the student features, so that the subsequent recovery effect of adjacent pixels according to the mask area is affected, and the random masking operation is directly carried out on the features, so that the receptive fields of the adjacent pixels of the student features are limited, the complete features cannot be recovered effectively according to the receptive fields, namely, the characterization capability of the student network is still limited.
Disclosure of Invention
The invention aims to provide a projection type knowledge distillation method based on self-adaptive mask weighting, which solves the problems of limited student network characterization capability and insufficient information utilization caused by limited student characteristic and student characteristic adjacent pixel receptive field of a random mask, and improves the robustness and generalization capability of a knowledge distillation model.
The technical solution for realizing the purpose of the invention is as follows: a method of projective knowledge distillation based on adaptive mask weighting, comprising the steps of:
step 1, randomly acquiring K images with labels in CIFAR-100 data set, 10000<K is less than or equal to 60000, normalization processing is carried out on the K images, and the pixel size is unified to be h 0 ×w 0 Wherein h is 0 Is the image height, w 0 Is the image width; the images with uniform sizes are randomly divided into a training data set and a test data set according to the proportion of 5:1, the training data set is subjected to data enhancement to form a teacher-student network training data set, the teacher network is pre-trained by using the teacher-student network training data set, a pre-training teacher network is obtained, and the step 2 is carried out.
And step 2, dividing a teacher network into n teacher modules according to the depth of the convolution layer and the size of the feature map, dividing a student network into n student modules, and turning to step 3.
And 3, constructing n-1 relation matrixes based on the output characteristics of n student modules of the student network, and turning to step 4.
Step 4, constructing a corresponding self-adaptive mask matrix based on the relation matrix constructed in the step 3, and respectively carrying out self-adaptive mask relation weighting on the output characteristics of the first n-1 student modules of the student network by using the self-adaptive mask matrix to obtain first n-1 self-adaptive mask relation weighting characteristics; and (5) carrying out self-adaptive masking on the output characteristics of the nth student module of the student network to obtain self-adaptive masking characteristics, and turning to step 5.
Step 5, constructing a projection layer, guiding the corresponding projection layer by using a teacher network, enabling projections of n-1 self-adaptive mask relation weighting characteristics obtained by a student network to approach output characteristics of corresponding n-1 teacher modules, and calculating the self-adaptive mask relation weighted projection loss; and (3) enabling the projection of the nth self-adaptive mask feature to approach to the output feature of the nth teacher module, calculating the projection loss of the self-adaptive mask feature, and turning to step 6.
Step 6, calculating distillation loss of the traditional distillation method by using the output characteristics of the nth teacher module of the teacher network and the output characteristics of the nth student module of the student network; and calculating total distillation loss by using the traditional distillation loss and the projection type loss weighted by the self-adaptive mask, updating network parameters of the student network according to the total distillation loss, and finally obtaining the trained student network, and turning to step 7.
And 7, inputting the test data set into a trained student network, outputting a prediction result corresponding to each sample in the test set, and testing the accuracy of the trained student network.
Compared with the prior art, the invention has the advantages that:
1) Compared with the existing knowledge distillation method, the method focuses on improving the expression capacity of the student model, fully excavates and expresses the rich information contained in the dual characteristics of the relation matrix and the feature map by the student model under the guidance of the teacher model, and simultaneously solves the problems of insufficient utilization of the feature knowledge and larger difference of the expression capacities of the student model and the teacher model.
2) According to the invention, the projection type distillation model for carrying out self-adaptive mask relation weighting and self-adaptive mask output characteristic weighting on the extracted characteristics of each stage of the student network is constructed for the first time, the problems of limited student network characterization capability and insufficient information utilization caused by limited student characteristics and student characteristics adjacent pixel receptive fields are solved, and the robustness and generalization capability of the knowledge distillation model are improved.
Drawings
FIG. 1 is a model diagram of the present invention based on an adaptive mask weighted projective knowledge distilling method.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in further detail below.
Referring to fig. 1, a projection type knowledge distillation method based on adaptive mask weighting includes the following steps:
step 1,Randomly acquiring K tagged images in CIFAR-100 data set 10000<K is less than or equal to 60000, normalization processing is carried out on the K images, and the pixel size is unified to be h 0 ×w 0 Wherein h is 0 Is the image height, w 0 Is the image width; the images with uniform sizes are randomly divided into a training data set and a test data set according to the proportion of 5:1, the training data set is subjected to data enhancement to form a teacher-student network training data set, the teacher network is pre-trained by using the teacher-student network training data set, a pre-training teacher network is obtained, and the step 2 is carried out.
And 2, dividing a teacher network into n teacher modules according to the depth of the convolution layer and the size of the feature map, dividing a student network into n student modules, extracting features of each stage, and turning to the step 3.
Step 3, constructing n-1 relation matrixes based on output characteristics of n student modules of a student network, wherein the relation matrixes are specifically as follows:
first, the output characteristics of the ith student module are defined asI is more than or equal to 1 and less than or equal to n, S represents a student network, and H, W and C respectively represent the height, width and dimension of an output characteristic; defining the output characteristics of the ith teacher module asI is more than or equal to 1 and less than or equal to n, and T represents a teacher network; then the cavity convolution is utilized to perform the feature F Si Sparse sampling is carried out to obtain a feature map +.>Will F Si And->Feature fusion is carried out to increase the common receptive field contained between adjacent pixels, so that the projection layer is more beneficial to projecting the masked feature pixels, and the fused features are expressed as + -> Finally, the fused features are used->And->Building a relationship matrix G N
Namely G N Representing a relation matrix constructed by the characteristics of the output characteristics of the ith student module after characteristic fusion and the characteristics of the (i+1) th student module after characteristic fusion, wherein i is more than or equal to 1 and less than or equal to N-1, and N is more than or equal to 1 and less than or equal to N-1; h denotes a pixel position in the height dimension, and w denotes a pixel position in the width dimension. The construction of the relation matrix enables the relations of adjacent pixels to be more intimate, the overlapped receptive fields are increased, and projection of the pixels after masking is facilitated.
And (4) switching to step 4.
Step 4, constructing a corresponding self-adaptive mask matrix based on the relation matrix constructed in the step 3, and respectively carrying out self-adaptive mask relation weighting on the output characteristics of the first n-1 student modules of the student network by using the self-adaptive mask matrix to obtain first n-1 self-adaptive mask relation weighting characteristics; the output characteristics of the nth student module of the student network are independently subjected to self-adaptive mask weighting to obtain self-adaptive mask weighting characteristics, and the self-adaptive mask weighting characteristics are specifically as follows:
firstly, obtaining the score of a characteristic diagram of a relation matrix through a softmax function, sequencing the characteristic diagram according to the score from large to small, and selecting the top k with high score 1 A value of k 1 The values are in the original, unordered positions in the feature map, as the attention areas of the adaptive mask relationship, the remaining positions are assigned a value of 0,the adaptive mask relationship matrix is represented by the following expression:
wherein the method comprises the steps ofRepresenting a relationship matrix G N Corresponding adaptive mask matrix,/->Top k representing high score corresponding to the relation matrix 1 The original positions of the values, v and j, are the horizontal coordinates and the vertical coordinates of the relation matrix respectively; compared with the prior random masking operation, the self-adaptive masking matrix provided for the first time has more pertinence, and the self-adaptive feature with high score is reserved as the weighted weight, so that the target feature occupies higher proportion, and important features are projected better.
Then, the corresponding relation matrix is masked by using the self-adaptive mask matrix to obtain a weight matrix for self-adaptive mask relation weightingWherein ". Sup.H.is the Hadamard product.
Finally, the weight matrix is utilizedFeatures F extracted for the ith student module Si Performing adaptive mask weighting to obtain adaptive mask relation weighting feature +.>
Similarly, the output characteristics of the nth student module of the student network are subjected to a softmax function to obtain the score of the characteristic map, the scores are ordered from large to small according to the scores, and the top k with high score is selected 2 A value of k 2 The individual values are original, not in the feature mapThe ordered positions, the attention area of the adaptive mask, and the remaining positions are assigned 0, and the adaptive mask matrix is expressed by the following expression:
wherein the method comprises the steps ofOutput characteristics F representing nth student module of student network Sn A corresponding adaptive mask matrix is used to determine,representing the output characteristics F Sn The corresponding top k with high score 2 The original positions of the values, v, j are the output characteristics F Sn Horizontal coordinates and vertical coordinates of (a); output features F of nth student module of student network using adaptive mask matrix Sn Masking to obtain adaptive mask feature->
Go to step 5.
Step 5, constructing a projection layer, guiding the corresponding projection layer by using a teacher network, enabling projections of n-1 self-adaptive mask relation weighting characteristics obtained by a student network to approach output characteristics of corresponding n-1 teacher modules, and calculating the self-adaptive mask relation weighted projection loss; the projection of the weighting characteristic of the n-th self-adaptive mask approximates to the output characteristic of the n-th teacher module, and the projection loss of the self-adaptive mask characteristic is calculated as follows:
first, a projection layer is formed by convolution blocks and ReLU functionsThe structure of the system is a 3X 3 convolution block, a ReLU function layer and a 3X 3 convolution block which are connected in sequence; then weighting the adaptive mask relation weighting feature +.>Inputting the extracted characteristics F into a projection layer and extracting the characteristics F from the corresponding teacher module Ti Under the guidance of (a), forcing the student network to project a shape and size approximate to F Ti Relation projection features->Finally, calculating the output characteristic F of the corresponding teacher module Ti And projection features->Is a projection type loss L weighted by the adaptive mask relation admp1 The formula is as follows:
wherein F is Ti Representing the features extracted by the ith teacher module divided by the teacher network,representing the feature F extracted by the ith student module divided by the student network Si And carrying out self-adaptive mask relation weighting on the projected characteristics, wherein c represents the channel number, h represents the pixel position in the height dimension, and w represents the pixel position in the width dimension.
Similarly, the adaptive mask weighting featureInputting the extracted characteristics F into a projection layer and extracting the characteristics F from the corresponding teacher module Tn Under the guidance of (a), forcing the student network to project a shape and size approximate to F Tn Mask projection feature->Finally, calculating the extracted characteristic F of the corresponding teacher module Tn And mask projection feature->Projection penalty L for adaptive masking features of (2) admp2 The formula is as follows:
the projection type loss of the self-adaptive mask characteristic of the self-adaptive mask weighted projection type knowledge distillation method is reconstructed as follows:
L admp =α 1 L admp12 L admp2
alpha in the formula 1 Is a weight super-parameter, alpha, for adjusting the projection loss weighted by the adaptive mask relation 2 Is a weight super-parameter for adjusting the projection distillation loss of the self-adaptive mask feature; the loss function sub-module is used for correcting the deviation between the relation matrix of the mask student module, the projected characteristics of the output characteristics and the output characteristics of the corresponding teacher module, so that the teacher network achieves a better guiding effect, and the student network can better excavate learned information from the student network under the guidance of the teacher network and fully utilize the learned information.
Go to step 6.
Step 6, calculating distillation loss of the traditional distillation method by using the output characteristics of the nth teacher module of the teacher network and the output characteristics of the nth student module of the student network; and calculating total distillation loss by using the traditional distillation loss and the projection loss weighted by the self-adaptive mask, updating network parameters of the student network according to the total distillation loss, and finally obtaining the trained student network, wherein the method comprises the following steps of:
the loss of the most traditional feature-based knowledge distillation method is expressed as:
wherein F is Tn Output characteristics of nth module, i.e., last teacher module, representing n teacher modules divided by teacher networkSign, F Sn Representing output characteristics of an nth student module, namely a last student module, of n student modules divided by a student network;
the total loss can be expressed as: l (L) totally =L admp +L classical
Go to step 7.
And 7, inputting the test data set into a trained student network, outputting a prediction result corresponding to each sample in the test set, and testing the accuracy of the trained student network.
Example 1
Referring to fig. 1, the method for projection type knowledge distillation based on adaptive mask weighting according to the present invention comprises the following steps:
step 1, randomly collecting 60000 images with labels in a CIFAR-100 data set, carrying out normalization processing on the 60000 images, unifying the pixel sizes into 32 multiplied by 32, randomly dividing the images with unified sizes into a training data set and a test data set according to a ratio of 5:1, carrying out data enhancement on the training data set to form a teacher-student network training data set, and pre-training a teacher network by utilizing the teacher-student network training data set to obtain the teacher network, wherein the data enhancement operation comprises image scaling and random overturning, the image scaling is scaled inwards and outwards according to 10% of an original image, the random overturning angle is between-20 DEG and 20 DEG, and the number of image categories is 100.
And step 2, dividing a teacher network into 4 teacher modules according to the depth of the convolution layer and the size of the feature map, dividing a student network into 4 student modules, and turning to step 3.
And 3, constructing 3 relation matrixes based on the output characteristics of 4 student modules of the student network, and turning to the step 4.
Step 4, constructing a corresponding self-adaptive mask matrix based on the relation matrix constructed in the step 3, and respectively carrying out self-adaptive mask relation weighting on the output characteristics of the first 3 student modules of the student network by using the self-adaptive mask matrix to obtain first 3 self-adaptive mask relation weighting characteristics; and (5) carrying out self-adaptive mask feature weighting on the output features of the 4 th student module of the student network to obtain self-adaptive mask feature features, and turning to step 5.
Step 5, constructing a projection layer, guiding the corresponding projection layer by using a teacher network, enabling projections of 3 self-adaptive mask relation weighted features obtained by a student network to approach output features of 3 corresponding teacher modules, and calculating the self-adaptive mask relation weighted projection loss; and (3) enabling the projection of the 4 th self-adaptive mask weighting characteristic to approach to the output characteristic of the 4 th teacher module, calculating the projection loss of the self-adaptive mask characteristic, and turning to the step (6).
Step 6, calculating distillation loss of the traditional distillation method by using the output characteristics of the 4 th teacher module of the teacher network and the output characteristics of the 4 th student module of the student network; and calculating total distillation loss by using the traditional distillation loss and the projection type loss weighted by the self-adaptive mask, updating network parameters of the student network according to the total distillation loss, and finally obtaining the trained student network, and turning to step 7.
And 7, inputting the test data set into a trained student network, outputting a prediction result corresponding to each sample in the test set, and testing the accuracy of the trained student network.
The method of the invention adopts the python programming language and the pytorch frame language to build a network frame on the Nvidia 2080Ti GPU host to carry out related experiments. For the classification task we calculate the sum of the losses from the conventional knowledge distillation losses and the adaptive mask weighted projection losses. The method of the invention uses two super parameters alpha 1 、α 2 To balance the distillation loss of the equation. Setting up the super parameter { alpha } 1 =0.000007,α 2 =0.0000003 } for classification experiments. We trained all models for 240 epochs using the SGD optimizer with a momentum of 0.9 and a weight decay of 0.0001. We initialize the learning rate to 0.025 and decay every 30 periods. And training for multiple times on the training set to obtain the projection type knowledge distillation model based on the self-adaptive mask weighting.
In order to show the superior performance of the algorithm, the invention selects a knowledge distillation algorithm which is popular for several years as a comparison model, and comparison experiment results under objective conditions of ResNet-32×4 for a teacher network, resNet-8×4 for a student network, the same data set, the same equipment and the like are shown in Table 1:
TABLE 1 results of comparative experiments under objective conditions such as the same dataset
From the experimental results, the effectiveness of the method of the present invention can be seen.

Claims (8)

1. A projection type knowledge distillation method based on self-adaptive mask weighting is characterized by comprising the following steps:
step 1, randomly acquiring K images with labels in CIFAR-100 data set, 10000<K is less than or equal to 60000, normalization processing is carried out on the K images, and the pixel size is unified to be h 0 ×w 0 Wherein h is 0 Is the image height, w 0 Is the image width; randomly dividing the image with the uniform size into a training data set and a test data set according to the proportion of 5:1, carrying out data enhancement on the training data set to form a teacher-student network training data set, and pre-training a teacher network by utilizing the teacher-student network training data set to obtain a pre-trained teacher network, and turning to step 2;
step 2, dividing a teacher network into n teacher modules according to the depth of the convolution layer and the size of the feature map, dividing a student network into n student modules, and turning to step 3;
step 3, constructing n-1 relation matrixes based on output characteristics of n student modules of a student network, and turning to step 4;
step 4, constructing a corresponding self-adaptive mask matrix based on the relation matrix constructed in the step 3, and respectively carrying out self-adaptive mask relation weighting on the output characteristics of the first n-1 student modules of the student network by using the self-adaptive mask matrix to obtain first n-1 self-adaptive mask relation weighting characteristics; performing adaptive masking on the output characteristics of the nth student module of the student network to obtain adaptive masking characteristics, and turning to step 5;
step 5, constructing a projection layer, guiding the corresponding projection layer by using a teacher network, enabling projections of n-1 self-adaptive mask relation weighting characteristics obtained by a student network to approach output characteristics of corresponding n-1 teacher modules, and calculating the self-adaptive mask relation weighted projection loss; enabling the projection of the nth self-adaptive mask feature to approach to the output feature of the nth teacher module, calculating the projection loss of the self-adaptive mask feature, and turning to the step 6;
step 6, calculating distillation loss of the traditional distillation method by using the output characteristics of the nth teacher module of the teacher network and the output characteristics of the nth student module of the student network; calculating total distillation loss by using the traditional distillation loss and the projection loss weighted by the self-adaptive mask, updating network parameters of the student network according to the total distillation loss, finally obtaining a trained student network, and turning to step 7;
and 7, inputting the test data set into a trained student network, outputting a prediction result corresponding to each sample in the test set, and testing the accuracy of the trained student network.
2. The method of claim 1, wherein in step 3, n-1 relation matrices are constructed based on output characteristics of n student modules of the student network, specifically as follows:
first, the output characteristics of the ith student module are defined asS denotes a student network, H, W and C denote the height, width and dimension of the output feature, respectively; defining the output characteristics of the ith teacher module asT represents a teacher network; then the cavity convolution is utilized to perform the feature F Si Sparse sampling is carried out to obtain a feature map +.>Will F Si And->Performing feature fusion to increase receptive field, and expressing the fused features asFinally, the fused features are used->And->Building a relationship matrix G N
Namely G N Representing a relation matrix constructed by the characteristics of the output characteristics of the ith student module after characteristic fusion and the characteristics of the (i+1) th student module after characteristic fusion, wherein i is more than or equal to 1 and less than or equal to N-1, and N is more than or equal to 1 and less than or equal to N-1; h denotes a pixel position in the height dimension, and w denotes a pixel position in the width dimension.
3. The method for distilling knowledge based on projection of self-adaptive mask weighting according to claim 2, wherein in step 4, based on the relation matrix constructed in step 3, constructing a corresponding self-adaptive mask matrix, and respectively carrying out self-adaptive mask relation weighting on the output characteristics of the first n-1 student modules of the student network by using the self-adaptive mask matrix to obtain first n-1 self-adaptive mask relation weighting characteristics; and carrying out self-adaptive masking on the output characteristics of the nth student module of the student network to obtain self-adaptive masking characteristics, wherein the self-adaptive masking characteristics are as follows:
firstly, the relation matrix is subjected to softmax function to obtain the score of the characteristic graph, and thenThe method comprises the steps of sorting the scores from big to small, and selecting the top k with high score 1 A value of k 1 The original, unordered positions of the values in the feature map are used as the attention area of the adaptive mask relation, the rest positions are assigned to 0, and the adaptive mask relation matrix is expressed by the following expression:
wherein the method comprises the steps ofRepresenting a relationship matrix G N Corresponding adaptive mask matrix,/->Top k representing high score corresponding to the relation matrix 1 The original positions of the values, v and j, are the horizontal coordinates and the vertical coordinates of the relation matrix respectively;
then masking the corresponding relation matrix by using the self-adaptive mask matrix to obtain a weight matrix for self-adaptive mask relation weightingWherein "+.;
finally, the weight matrix is utilizedFeatures F extracted for the ith student module Si Performing adaptive mask relation weighting to obtain adaptive mask relation weighting characteristics->
Similarly, the output characteristics of the nth student module of the student network are subjected to a softmax function to obtain the score of the characteristic map, the scores are ordered from large to small according to the scores, and the front with high score is selectedk 2 A value of k 2 The original, unordered positions of the values in the feature map are used as the attention area of the adaptive mask, the rest positions are assigned 0, and the adaptive mask matrix is expressed by the following expression:
wherein the method comprises the steps ofOutput characteristics F representing nth student module of student network Sn A corresponding adaptive mask matrix is used to determine,representing the output characteristics F Sn The corresponding top k with high score 2 The original positions of the values, v, j are the output characteristics F Sn Horizontal coordinates and vertical coordinates of (a); output features F of nth student module of student network using adaptive mask matrix Sn Masking to obtain adaptive mask feature->
4. The method of claim 3, wherein the adaptive mask weighting based projective knowledge distillation method,
5. the method for projection type knowledge distillation based on self-adaptive mask weighting according to claim 3 wherein in step 5, a projection layer is constructed, a teacher network is utilized to guide the corresponding projection layer, projections of n-1 self-adaptive mask relation weighting characteristics obtained by a student network are made to approach output characteristics of corresponding n-1 teacher modules, and projection losses of self-adaptive mask relation weighting are calculated; the projection of the nth adaptive mask feature is approximated to the output feature of the nth teacher module, and the projection loss of the adaptive mask weighting is calculated as follows:
first, a projection layer is formed by convolution blocks and ReLU functionsThe structure of the system is a 3X 3 convolution block, a ReLU function layer and a 3X 3 convolution block which are connected in sequence; then weighting the adaptive mask relation weighting feature +.>Inputting the extracted characteristics F into a projection layer and extracting the characteristics F from the corresponding teacher module Ti Under the guidance of (a), forcing the student network to project a shape and size approximate to F Ti Relation projection features->Finally, calculating the output characteristic F of the corresponding teacher module Ti And projection features->Is a projection type loss L weighted by the adaptive mask relation admp1 The formula is as follows:
wherein F is Ti Representing the features extracted by the ith teacher module divided by the teacher network,representing the feature F extracted by the ith student module divided by the student network Si The projection characteristics obtained by carrying out self-adaptive mask relation weighting and then projecting are characterized in that c represents the channel number, h represents the pixel position in the height dimension, and w represents the pixel position in the width dimension;
Similarly, the adaptive mask weighting featureInputting the extracted characteristics F into a projection layer and extracting the characteristics F from the corresponding teacher module Tn Under the guidance of (a), forcing the student network to project a shape and size approximate to F Tn Mask projection feature->Finally, calculating the extracted characteristic F of the corresponding teacher module Tn And mask projection feature->Projection penalty L for adaptive masking features of (2) admp2 The formula is as follows:
reconstructing the adaptive mask projection loss of the adaptive mask relation matrix weighted projection type knowledge distillation method into:
L admp =α 1 L admp12 L admp2
alpha in the formula 1 Is a weight super-parameter, alpha, for adjusting the projection loss weighted by the adaptive mask relation 2 Is a weight super-parameter that adjusts the projected penalty of the adaptive mask feature.
6. The adaptive mask weighting based projective knowledge distilling method of claim 4, wherein α 1 =0.000007,α 2 =0.0000003。
7. The adaptive mask weighting based projected knowledge distillation method as set forth in claim 5, wherein in step 6, the output characteristics of the nth teacher module of the teacher network and the output characteristics of the nth student module of the student network are used to calculate the conventional distillation loss of the conventional distillation method; and calculating total distillation loss by using the traditional distillation loss and the projection loss weighted by the self-adaptive mask, updating network parameters of the student network according to the total distillation loss, and finally obtaining the trained student network, wherein the method comprises the following steps of:
the loss of the most traditional feature-based knowledge distillation method is expressed as:
wherein F is Tn Representing output characteristics of the nth, i.e., last, of n teacher modules divided by the teacher network, F Sn Representing output characteristics of an nth student module, namely a last student module, of n student modules divided by a student network;
the total loss can be expressed as:
L totally =L admp +L classical
8. the method of claim 1, wherein n=4.
CN202311530381.XA 2023-11-16 2023-11-16 Projection type knowledge distillation method based on self-adaptive mask weighting Pending CN117454971A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311530381.XA CN117454971A (en) 2023-11-16 2023-11-16 Projection type knowledge distillation method based on self-adaptive mask weighting

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311530381.XA CN117454971A (en) 2023-11-16 2023-11-16 Projection type knowledge distillation method based on self-adaptive mask weighting

Publications (1)

Publication Number Publication Date
CN117454971A true CN117454971A (en) 2024-01-26

Family

ID=89579826

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311530381.XA Pending CN117454971A (en) 2023-11-16 2023-11-16 Projection type knowledge distillation method based on self-adaptive mask weighting

Country Status (1)

Country Link
CN (1) CN117454971A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118278501A (en) * 2024-05-31 2024-07-02 安徽农业大学 Feature distillation method based on teacher classifier sharing and projection integration

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118278501A (en) * 2024-05-31 2024-07-02 安徽农业大学 Feature distillation method based on teacher classifier sharing and projection integration

Similar Documents

Publication Publication Date Title
CN110210551B (en) Visual target tracking method based on adaptive subject sensitivity
CN111860386B (en) Video semantic segmentation method based on ConvLSTM convolutional neural network
CN108664893B (en) Face detection method and storage medium
CN113158862B (en) Multitasking-based lightweight real-time face detection method
CN109829541A (en) Deep neural network incremental training method and system based on learning automaton
CN107480206A (en) A kind of picture material answering method based on multi-modal low-rank bilinearity pond
CN113963165B (en) Small sample image classification method and system based on self-supervision learning
CN117454971A (en) Projection type knowledge distillation method based on self-adaptive mask weighting
CN108154235A (en) A kind of image question and answer inference method, system and device
CN116229319A (en) Multi-scale feature fusion class behavior detection method and system
CN113205103A (en) Lightweight tattoo detection method
CN114638408A (en) Pedestrian trajectory prediction method based on spatiotemporal information
CN116796810A (en) Deep neural network model compression method and device based on knowledge distillation
Zając et al. Split batch normalization: Improving semi-supervised learning under domain shift
CN117473041A (en) Programming knowledge tracking method based on cognitive strategy
CN111985560B (en) Knowledge tracking model optimization method, system and computer storage medium
CN114169385A (en) MSWI process combustion state identification method based on mixed data enhancement
CN112131403B (en) Knowledge graph representation learning method in dynamic environment
CN110047088B (en) HT-29 image segmentation method based on improved teaching and learning optimization algorithm
CN115994576A (en) Human attention mechanism imitation learning method in social scene
CN115719497A (en) Student concentration degree identification method and system
CN113962332A (en) Salient target identification method based on self-optimization fusion feedback
CN117036698B (en) Semantic segmentation method based on dual feature knowledge distillation
CN109409226A (en) A kind of finger vena plot quality appraisal procedure and its device based on cascade optimization CNN
Yu et al. Using Multi-feature Embedding towards Accurate Knowledge Tracing.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination