CN114626504A

CN114626504A - Model compression method based on group relation knowledge distillation

Info

Publication number: CN114626504A
Application number: CN202210030247.2A
Authority: CN
Inventors: 杨赛; 杨慧; 周伯俊; 胡彬
Original assignee: Nantong University
Current assignee: Nantong University
Priority date: 2022-01-11
Filing date: 2022-01-11
Publication date: 2022-06-14

Abstract

The invention discloses a model compression method based on group relation knowledge distillation, which is characterized in that after a data set is preprocessed, a large-capacity convolutional neural network is initialized randomly as a teacher network, and the network is pre-trained by utilizing a cross entropy loss function; and then, in a knowledge distillation stage, randomly initializing a small-capacity convolutional neural network as a student network, carrying out K-means clustering on the image sample characteristics by the teacher network and the student network respectively, calculating the relation among all groups by using the maximum mean difference, and training the student network by using the weighting sum of cross entropy and a group relation loss function. And finally, carrying out classification decision on the test image by using the trained network. The method can guide the student network to imitate the grouping capability of the teacher to the samples, so that the performance of the student network approaches the performance of the teacher network.

Description

Model compression method based on group relation knowledge distillation

Technical Field

The invention relates to a model compression method based on group relation knowledge distillation, and belongs to the technical field of computer vision.

Background

In recent years, deep convolutional neural networks have enjoyed unprecedented success in a number of artificial intelligence fields, such as computer vision, natural language processing, speech recognition, and the like. However, these successes are heavily dependent on powerful computing power and huge memory resources to a large extent, so that deep convolutional networks cannot be widely deployed in embedded and mobile systems at all. In order to reduce the computation cost while maintaining excellent performance, knowledge distillation technology is proposed to transfer the knowledge learned in a teacher network with large scale and many parameters to a student network with small scale and few parameters, so as to enable a network model with smaller complexity to approach or exceed the performance of a complex network as much as possible.

Knowledge distillation is an effective model compression technique, and aims to improve the performance of lightweight student networks by transferring knowledge from a large-capacity teacher network trained in advance. The core idea is to train a complex teacher network model in advance, and then train a smaller student network by using the output of the teacher network model and the real label of the data. However, the output of the pre-trained teacher network contains information similar to the authentic labels, thereby allowing limited knowledge to be utilized in the knowledge distillation process. Subsequently, several distillation methods based on knowledge of the characteristics implicit in the network intermediate layers were proposed in succession. For example, Romero et al propose directly matching feature outputs of a student network and a teacher network by minimizing the L2 distance between peer-level features. However, the conditions for the student network to simulate the overall characteristics of the teacher network are too stringent to adversely affect the performance and convergence of the model. Consequently, follow-up work is undertaken on how to encode features near-efficiently on feature knowledge to close the gap between student and teacher networks. Zagaruyko and the like convert the characteristic diagram into a space attention diagram and guide students to learn a teacher network attention area in a network way; yim et al propose as knowledge the Gram matrix between adjacent layer features within the same residual block. Although the above works achieve good performance, they still suffer from the problem of ignoring the relationship between the different samples. Recent work has therefore focused on how to use the relationships between sample features for knowledge distillation. For example, Tung et al calculates the similarity between different sample features to obtain a similarity matrix, and keeps the similarity between the sample features consistent between the student network and the teacher through training. Zhu et al propose to use both the features and their gradient information to obtain better knowledge of the relationship.

Knowledge distillation methods based on sample characteristic relationships can achieve good performance when model compression is performed on low volumes, and several work of modeling sample characteristic relationships has been performed at present. However, the current work is still only based on establishing the relationship among sample individuals, and neglecting the group relationship among samples. The group relationship established according to the similarity between the samples is an important knowledge and is important for improving the performance of the model.

Disclosure of Invention

In view of the problems in the prior art, the present invention provides a model compression method based on group relation knowledge distillation, so as to solve the above technical problems.

In order to achieve the purpose, the invention adopts the technical scheme that: a model compression method based on group relation knowledge distillation comprises the following steps:

the method comprises the following steps: processing image data;

step two: training a teacher network;

step three: constructing teacher network group relation knowledge;

step four: constructing student network group relation knowledge;

step five: training a student network;

step six: and (5) testing the student network.

Further, the specific steps of processing the image data in the first step are as follows:

s11: for a given image data set, randomly dividing the given image data set into three subsets, namely a training set, a verification set and a test set, and respectively using the three subsets for training a model, verifying a hyper-parameter and testing the performance of the model;

s12: the number of image classification classes is C, and the mth image sample in the training set is represented as x_mIts corresponding label y_m(ii) a The ith image sample in the test set is denoted x_i。

Further, the training of the teacher network in the second step includes the specific steps of:

s21: randomly initializing a large-capacity convolutional neural network, which consists of a plurality of convolutional layers, a plurality of fully-connected layers and a Softmax layer; the convolutional layer is represented as

For extracting features of image samples, where θ_tIs a parameter of the convolutional layer; the fully-connected layer is represented as

For transformation and classification of features, wherein W_tA parameter matrix of a full connection layer; the Softmax layer converts the classification score into a classification probability output value

S22: randomly taking M image samples in the training set and inputting the M image samples into the teacher network, wherein the M image sample is represented by a convolutional layer as:

s23: the classification score obtained after the characteristics of the image sample pass through the full connection layer is expressed as follows:

wherein

Representing the weight vectors in the teacher network full-connection layer, wherein the total number of the weight vectors is C, and the weight vectors are consistent with the number of the image classification categories;

s24: the calculation formula of the Softmax layer for converting the classification score into the probability output value belonging to the c-th class is as follows:

wherein

Score a classification

The (j) th component of (a),

for a classifier parameter matrix W_tThe jth weight vector of

S25: the cross entropy loss function between the classification output probability value and the real label is:

wherein

In order to output a probability value for the classification,

is a full connection layer parameter matrix W_tM is the number of image samples.

Further, the specific steps of the construction of teacher network group relation knowledge in the third step are as follows:

s31: randomly extracting M image samples in a training set, inputting the M image samples into a teacher network to extract features, and clustering the features of the M image samples into K groups by using a K-means algorithm;

s32: in the k group, there is N_kSamples, wherein the characteristics of the kth sample are expressed as

In group I there are N_lSamples, wherein the characteristic of the first sample is expressed as

Then, the formula for calculating the coefficient of relationship between the kth group of sample features and the lth group of sample features based on the maximum mean difference is:

further, the concrete steps of the construction of the student network group relation knowledge in the fourth step are as follows:

s41: randomly initializing a small-capacity convolutional neural network;

s42: for the same batch of M image samples randomly extracted in a training set, the same batch of M image samples are taken into a student network to extract features, and the features of the M image samples are clustered into K groups by using a K-means algorithm;

s43: in the k group, there is N_kSamples, wherein the characteristics of the kth sample are expressed as

In group I there are N_lA sample, wherein the characteristic of the first sample is expressed as

Then, the calculation formula of the coefficient of relationship between the kth group of sample features and the l group of sample features based on the maximum mean difference is:

further, the concrete steps of training the student network in the fifth step are as follows:

s51: for the same batch of M image samples randomly drawn in the training set and input into the student network, the mth image sample is represented by the convolution layer as:

s52: the classification score obtained after the characteristics of the image sample pass through the full connection layer is expressed as:

wherein

Representing weight vectors in the full-connection layer, wherein the total number of the weight vectors is C, and the weight vectors are consistent with the number of image classification categories;

s53: the calculation formula for converting the Softmax layer classification score into the probability output value belonging to the c category is as follows:

wherein

Score a classification

The (j) th component of (a),

for a classifier parameter matrix W_sThe jth weight vector of (1);

s54: the cross entropy loss function between the classification output probability value and the real label is:

wherein

A probability value is output for the classification of the student network,

is a full connection layer parameter matrix W_sThe jth weight vector of (1), M being the number of image samples;

s55: based on a relation matrix R_tAnd R_sF norm therebetween to obtain a group relationship distillation loss function

The calculation formula of (c) is:

s56: the total loss function for the student network optimization is:

where α and β are adjustable parameters, the total loss function based on the above equation is applied to the parameter θ in the student network_sAnd W_sAnd optimizing to complete the model compression process.

Further, the step six includes the specific steps of testing the student network:

s61: parameter θ in fixed student network_sAnd W_s(ii) a Randomly extracting any image in the test set and inputting the image into the student network, wherein the ith image sample is represented as x_iThe features of the convolutional layer through the student network are represented as:

s62: the classification score obtained after the characteristics of the image sample pass through the full connection layer is expressed as:

wherein

s63: the calculation formula for converting the Softmax layer classification score into the probability output value belonging to the c category is as follows:

wherein

Score a classification

The (j) th component of (a),

is the jth weight vector in the classifier parameter matrix Ws.

The invention has the beneficial effects that: the invention discloses a model compression method based on group relation knowledge distillation, which is characterized in that features extracted from samples are clustered by using a K-means algorithm based on a teacher network, and then a student network is guided to simulate the grouping capability of the teacher on the samples, so that the performance of the student network approaches the performance of the teacher network.

Drawings

FIG. 1 is a schematic structural diagram of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and examples. It should be understood, however, that the description herein of specific embodiments is only intended to illustrate the invention and not to limit the scope of the invention.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs, and the terms used herein in the specification of the present invention are for the purpose of describing particular embodiments only and are not intended to limit the present invention.

As shown in fig. 1, the present invention is a method for compressing a model based on group relation knowledge distillation, comprising the following steps:

the method comprises the following steps: processing image data;

s11: for a given image data set, randomly dividing the image data set into three subsets, namely a training set, a verification set and a test set, and respectively using the three subsets for training a model, verifying a hyper-parameter and testing the performance of the model;

Step two: training a teacher network;

For transforming and classifying features, wherein W_tA parameter matrix of a full connection layer; the Softmax layer converts the classification score into a classification probability output value

s23: the classification score obtained after the characteristics of the image sample pass through the full connection layer is expressed as:

wherein

wherein

Score a classification

The (j) th component of (a),

for a classifier parameter matrix W_tThe jth weight vector of

wherein

In order to output a probability value for the classification,

Step three: constructing teacher network group relation knowledge;

step four: constructing student network group relation knowledge;

s41: randomly initializing a small-capacity convolutional neural network;

step five: training a student network;

s51: for the same batch of M image samples randomly drawn in the training set and input into the student network, the mth image sample is represented by the convolutional layer as:

wherein

wherein

Score a classification

The (j) th component of (a),

for a classifier parameter matrix W_sThe jth weight vector of (a);

wherein

A probability value is output for the classification of the student network,

is a full connection layer parameter matrix W_sThe jth weight vector in (1), M being the number of image samples;

s55: based on a relation matrix R_tAnd R_sF norm of between to obtain a group relation distillation loss function

The calculation formula of (2) is as follows:

s56: the total loss function for the student network optimization is:

Step six: and (5) testing the student network.

S61: parameter θ in fixed student network_sAnd W_s(ii) a Randomly extracting any image in the test set and inputting the image into the student network, wherein the ith image sample is represented as xi, and the characteristic expression of passing through the convolution layer in the student network is as follows:

wherein

wherein

Score a classification

The (j) th component of (a),

is the jth weight vector in the classifier parameter matrix Ws.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents or improvements made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A model compression method based on group relation knowledge distillation is characterized by comprising the following steps:

the method comprises the following steps: processing image data;

step two: training a teacher network;

randomly initializing a large-capacity convolutional neural network, wherein the convolutional neural network consists of a plurality of convolutional layers, a plurality of fully-connected layers and a Softmax layer; the convolutional layer and the fully-connected layer are respectively represented as

And

wherein theta is_tAs a parameter of the convolutional layer, W_tA parameter matrix of a full connection layer; randomly extracting M image samples in a training set, and inputting the M image samples into a teacher network; the m-th image sample is represented as a convolution layer

Through the full connection layer is shown as

Finally, the Softmax layer converts the probability output value of the probability output value into the c category, and the probability output value is p_t(y_m＝c|x_m) (ii) a The probability output value p is then calculated_t(y_m＝c|x_m) With the authenticity label y of the specimen_mCross entropy objective function between

Optimizing parameters in the teacher network;

step three: constructing teacher network group relation knowledge;

randomly extracting M image samples in a training set, inputting the M image samples into a teacher network to extract features, and clustering the features of the M image samples into K groups by using a K-means algorithm, wherein N is arranged in the K group_kSamples, wherein the characteristics of the kth sample are expressed as

Calculating the relation coefficient between any two groups by using the maximum mean difference

Finally obtaining a relation matrix R_t；

Step four: constructing student network group relation knowledge;

randomly initializing a small-capacity convolutional neural network, which consists of a small number of convolutional layers, a small number of full-connection layers and a Softmax layer; the convolutional layer and the fully-connected layer are respectively represented as

And

wherein theta is_sAs a parameter of the convolution layer, W_sIs a parameter matrix of the full connection layer. For the same batch of M image samples randomly extracted in a training set, the M image samples are taken into a student network to extract features, a K-means algorithm is used for clustering the features of the M image samples into K groups, and N image samples are assumed to exist in the K group_kSamples, wherein the characteristics of the kth sample are expressed as

Group IIn which is N_lSamples, wherein the characteristic of the first sample is expressed as

Finally obtaining a relation matrix R_s；

Step five: training a student network;

for the same batch of M image samples randomly extracted in the training set, the M image sample is input into the student network to obtain the probability output value belonging to the c category, and the probability output value is represented as p_s(y_m＝c|x_m) Calculating a probability output value p_s(y_m＝c|x_m) With the authenticity label y of the specimen_mCross entropy objective function between

And simultaneously calculating a relationship matrix R_tAnd R_sF norm of between as a function of group-related distillation loss

Will be provided with

And

as a function of the total loss, optimizing parameters in the student network;

step six: testing a student network;

parameter θ in fixed student network_sAnd W_sRandomly extracting any image in the test set and inputting the image into the student network, wherein the ith image sample is represented as x_iFinally obtaining the summary belonging to the c-th category through the convolution layer, the full connection layer and the Softmax layer in the student networkThe value of the rate output is expressed as p_s(y_i＝c|x_i)。

2. The method for compressing a model based on knowledge distillation of group relationships according to claim 1, wherein the image data in the first step is processed by the following specific steps:

3. The group-relationship-knowledge-distillation-based model compression method as claimed in claim 1, wherein the training of the teacher network in the second step comprises the specific steps of:

wherein

wherein

Score a classification

The (j) th component of (a),

for a classifier parameter matrix W_tThe jth weight vector of

wherein

In order to output a probability value for the classification,

4. The group relationship knowledge distillation-based model compression method as claimed in claim 1, wherein the construction of teacher network group relationship knowledge in the third step comprises the following specific steps:

s32: in the kth group there is N_kSamples, wherein the characteristics of the kth sample are expressed as

。

5. the group relationship knowledge distillation-based model compression method as claimed in claim 1, wherein the concrete steps of the construction of the group relationship knowledge of the student network in the fourth step are as follows:

s41: randomly initializing a small-capacity convolutional neural network;

6. the group relationship knowledge distillation-based model compression method according to claim 1, wherein the training of the student network in the fifth step comprises the following specific steps:

wherein

wherein

Score a classification

The (j) th component of (a),

for a classifier parameter matrix W_sThe jth weight vector of (a);

wherein

A probability value is output for the classification of the student network,

The calculation formula of (c) is:

s56: the total loss function for the student network optimization is:

7. The group relationship knowledge distillation-based model compression method as claimed in claim 1, wherein the concrete steps of the test of the student network in the sixth step are as follows:

wherein

Representing the weight vectors in the full-link layer, C in total, and the number of image classification categoriesThe purposes are consistent;

wherein

Score a classification

The (j) th component of (a),

is the jth weight vector in the classifier parameter matrix Ws.