CN115409157A

CN115409157A - Non-data knowledge distillation method based on student feedback

Info

Publication number: CN115409157A
Application number: CN202211028120.3A
Authority: CN
Inventors: 王灿; 罗诗雅; 陈德仿; 冯雁; 史麒豪
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2022-08-25
Filing date: 2022-08-25
Publication date: 2022-11-29

Abstract

A non-data knowledge distillation method based on student feedback concretely relates to a non-data knowledge distillation method based on student feedback for image classification. The method comprises the following steps: s1: initializing the student model, and adding an auxiliary classifier S2 after the feature extractor of the student model: feeding back the current learning ability of the student model by using the auxiliary classifier, and simultaneously training a noise vector and a generator in a combined manner according to loss functions fed back by students and teachers so as to obtain an optimal synthetic picture; s3: training a student model by knowledge distillation by using the synthetic picture obtained in the S2, and simultaneously independently training an auxiliary classifier to learn an auxiliary task; s4: s2 and S3 are repeated until the student model is trained to converge. Under the condition of no original training data, the invention adaptively adjusts the content of the synthetic picture according to the current state of the student model, and tailors the synthetic picture for the student model, thereby training the student model more effectively and improving the final performance.

Description

Non-data knowledge distillation method based on student feedback

Technical Field

The invention relates to the technical field of knowledge distillation, in particular to a non-data knowledge distillation method based on student feedback.

Background

Convolutional neural networks have enjoyed significant success in a variety of practical applications in recent years. But their expensive storage and computational costs make it difficult to deploy the models on the mobile device. Therefore, hinton et al propose knowledge distillation techniques to achieve model compression, with the main idea of migrating dark knowledge from pre-trained heavyweight teacher models to lightweight student models.

Typical knowledge distillation methods are based on a strong premise that the raw data used to train the teacher model can be used directly to train the student model. However, in some practical scenarios, data is not shared publicly for reasons of privacy, intellectual property or the bulkiness of data sets, and thus, dataless intellectual distillation is proposed to solve this problem. The existing related work mainly utilizes feedback of a teacher model to realize picture synthesis, and then utilizes a synthesized picture to replace an original picture to carry out a knowledge distillation process.

However, existing work does not explicitly consider the learning ability of students during picture composition, and the synthesized pictures may get into a situation too simple relative to the current ability of students, resulting in that the student models do not learn new knowledge, thus impairing the final performance of the models.

Disclosure of Invention

The invention aims to overcome the defects in the prior art and provides a non-data knowledge distillation method based on student feedback, which estimates the current learning ability of a student by using an auxiliary task enhanced by self-supervision, so as to adaptively adjust the content of a synthetic picture, generate a sample difficult for a student model, ensure that the student model continuously acquires new knowledge and improve the final performance of the student model.

The invention adopts the following technical scheme:

a distillation method without data knowledge based on student feedback comprises the following steps:

s1: initializing a student model, and adding an auxiliary classifier behind a feature extractor of the student model;

s2: feeding back the current learning ability of the student model by using the auxiliary classifier, and simultaneously training a noise vector and a generator in a combined manner according to loss functions fed back by students and teachers so as to obtain an optimal synthetic picture;

s3: training a student model by knowledge distillation by using the synthetic picture obtained in the S2, and simultaneously independently training an auxiliary classifier to learn an auxiliary task;

s4: s2 and S3 are repeated until the student model is trained to converge.

Specifically, in S2, the current learning ability of the student model is fed back by using the auxiliary classifier, and the specific process includes:

randomly generating a noise vector z input into a generator network

Can obtain a composite picture

Then the composite sheet is synthesized

Rotating a certain angle to obtain a rotated picture

Input to student model

So that the obtained features are represented

Input to an auxiliary classifier

Output results using an auxiliary classifier

Calculating a loss function so as to quantify the current learning ability of the student model, namely the loss function fed back by the student, specifically:

wherein k represents a class label of the self-supervision enhancing task, and the self-supervision enhancing task treats an self-supervision rotating task and an original image classification task as a joint task.

Specifically, the category of the self-supervision enhanced task is specifically defined as follows:

the method comprises the steps that the total number of categories of original image classification tasks is given as N, and the total number of categories of self-supervision rotation tasks is given as M; hypothetical composite picture

In the image classification task, n classes are provided, and in the self-supervision rotation task, M classes are provided, so that in the self-supervision enhancement task, n × M + M classes are provided.

Specifically, the loss function fed back by the teacher in S2 is specifically:

wherein,

for teacher model

Output of (2)

And labels for predefined image classification tasks

Cross entropy between, which is formulated as:

the l2 norm distance between the feature statistics of the synthetic image and the real image is expressed by the formula:

wherein,

and

are respectively a composite image

Mean and variance of feature maps at the ith layer of the teacher model; mu.s _l And

the mean value and the variance of the feature map stored in the ith layer of the teacher model, namely the feature statistical information representing the real image.

Specifically, in S2, the noise vector sum generator is jointly trained according to loss functions of student feedback and teacher feedback, where the total loss function is:

where α is the weight of the hyperparameter used to balance the two loss terms.

Specifically, in S3, the overall loss function for training the student model through knowledge distillation is as follows:

wherein β is the weight of the hyperparameter used to balance the three loss terms;

the method is a conventional loss item in an original image classification task and is used for calculating the cross entropy between student model output and a predefined label;

the KL divergence between the teacher and student outputs is formulated as:

where σ (·) is the softmax function, τ is the hyper-parameter of the smoothed output distribution;

feature maps for the last layer of the teacher model

And feature map of last layer of student model

The mean square error between them, which is formulated as:

where r (-) is a mapping operation in order to align the dimensions between feature maps.

Specifically, in S3, the independently training the auxiliary classifier specifically includes: after the student completes each training iteration, the parameters of the student model are fixed and then the parameters are fixed according to the loss function

Training updates the parameters of the auxiliary classifier.

From the above description, the advantages of the present invention compared to the prior art are:

in the process of picture synthesis, the student model also serves as one of contributors, the content of the picture is adaptively adjusted and integrated according to the current ability fed back by the student, a sample which is more difficult for the current ability of the student is generated, the phenomenon that the student model cannot learn new knowledge all the time due to the too simple sample is avoided, and the student is trained more effectively, so that the final performance is improved.

Under the condition of no original training data, the invention adaptively adjusts the content of the synthetic picture according to the current state of the student model, and tailors the synthetic picture for the student model, thereby training the student model more effectively and improving the final performance.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

The process of the present invention is further described below with reference to the accompanying drawings:

as shown in fig. 1, a distillation method without data knowledge based on student feedback comprises the following steps:

in a specific implementation, the auxiliary classifier is composed of two fully connected layers.

S2: feeding back the current learning ability of the student model by using an auxiliary classifier, and simultaneously training a noise vector sum generator in a combined manner according to loss functions fed back by students and fed back by teachers, thereby obtaining an optimal synthetic picture;

and (3) feeding back the current learning ability of the student model by using the auxiliary classifier, wherein the specific process comprises the following steps:

randomly generating a noise vector z input into a generator network

Can obtain a composite picture

Then, the composite sheet is combined

Rotate a certain angle to rotate the rotated picture

Input to student model

So that the obtained features are represented

Input to an auxiliary classifier

Output results using an auxiliary classifier

wherein k represents a class label of the self-supervision enhancing task, and the self-supervision enhancing task regards an self-supervision rotating task and an original image classification task as a joint task.

It is necessary to adaptively synthesize pictures according to the current ability of the student model, the ability of the model to capture semantic information can be well used as an index of the ability of the student model, and an auxiliary task can reflect the degree of understanding of the semantic information by the student model from the side. If only the self-supervised rotation task is used as the auxiliary task, the capability evaluation may be inaccurate, for example, the number "6" is rotated by 180 ° for the data "9", and is rotated by 0 ° for the number "6" itself. Therefore, the self-supervision enhancing task is adopted as an auxiliary task of the method, so that the model can identify the category while identifying the rotation angle.

In the picture synthesis process, the optimization target is enlarged

To generate difficult samples, i.e., those samples for which the student model has difficulty understanding on the semantic information.

The category of the self-supervision enhanced task is specifically defined as follows:

the total classification number of the original image classification task is given as N, and the total classification number of the self-supervision rotation task is given as M; hypothetical composite picture

In the image classification task, n classes, and in the self-supervision rotation task, M classes, and in the self-supervision enhancement task, n × M + M classes.

However, if the picture synthesis is only based on student feedback, the distribution of the synthesized picture is far from the distribution of the real picture due to lack of prior knowledge of the original data, so that the picture synthesis also needs to be based on teacher feedback in order to consider the quality of the synthesized picture, and the loss function is specifically as follows:

wherein,

a one-hot assumption is expressed that if the composite picture has the same distribution as the original training picture, the output of the teacher model for the composite picture will resemble a one-hot vector form. Therefore, the first and second electrodes are formed on the substrate,

is defined as a teacher model

Output of (2)

And labels for predefined image classification tasks

Cross entropy between them, which is expressed by the formula:

statistical data stored in a batch normalization layer of the teacher model is effectively utilized as data prior information. Therefore, it is possible to

Is defined as the l2 norm distance between the composite image and the real image feature statistics, and is formulated as:

wherein,

and

are respectively a composite image

Therefore, in the picture synthesis process, the noise vector sum generator is jointly trained according to the loss functions of student feedback and teacher feedback, and the total loss function is as follows:

where α is the weight of the hyperparameter used to balance the two loss terms. In a specific implementation, α is set to 10.

S3: training a student model by knowledge distillation by using the synthetic picture obtained in the S3, and simultaneously independently training an auxiliary classifier to learn an auxiliary task;

the overall loss function for training the student model for knowledge distillation is:

where β is the weight of the hyperparameter used to balance the three loss terms, and in a specific implementation, β is set to 30;

the KL divergence between the teacher and student outputs is formulated as:

wherein σ (·) is a softmax function, τ is a hyperparameter of the smoothed output distribution, and τ is set to 20 in specific implementations;

feature maps for the last layer of the teacher model

And feature map of last layer of student

The mean square error between them, which is formulated as:

where r (-) is a mapping operation, in order to align the dimensions between feature maps, in a specific implementation, the mapping operation consists of three layers of convolution blocks of 1 × 1,3 × 3,1 × 1.

The independently trained auxiliary classifier specifically comprises: after the student completes each training iteration, the parameters of the student model are fixed and then the parameters are fixed according to the loss function

And training and updating parameters of the auxiliary classifier, so as to improve the evaluation capability of the auxiliary classifier, thereby more accurately estimating the learning capability of students in the picture synthesis process.

S4: s2 and S3 are repeated until the student model is trained to converge.

Experiments were performed on two published image classification datasets, CIFAR10 and CIFAR100, respectively, using a no-data knowledge distillation method based on student feedback. Their picture sizes are all 32 x 32.CIFAR10 is composed of 10 classes, each class including 5000 training pictures and 1000 test pictures. The CIFAR100 is composed of 100 classes, each class including 500 training pictures and 100 test pictures. The training pictures are only used to train the teacher model to obtain a pre-trained teacher model, which is not visible to the student model. The test pictures are used to evaluate the prediction accuracy of the model. The teacher model selects the network structure of WRN-40-2, and the student model selects the network structure of WRN-16-1.

Firstly, in order to prove the superiority of the invention, compared with other prior art methods, the experimental results are shown in table 1, the invention is obviously superior to other technical methods, and a student model with better performance is obtained.

TABLE 1 prediction accuracy of the respective methodological models

Wherein, DFAL is from Chen, handing, yunhe Wang, chang Xu, ZHaohui Yang, chuanjian Liu, boxin Shi, chunjing Xu, chao Xu, qi Tian, data-Free Learning of Student Networks;

ZSKT is from Micaelli, paul, amos J.Storkey, zero-shot Knowledge Transfer via Adversal Belief Matching;

ADI from Yin, hongxu, pavlo Molchanov, zhizhong Li, jos e Manual

Arun Mallya,Derek Hoiem,Niraj Kumar Jha,Jan Kautz,Dreaming to Distill:Data-Free Knowledge Transfer via DeepInversion；

CMI is from Fang, gongfan, jie Song, xinchao Wang, chen Shen, xingen Wang, mingli Song, contrast Model Inversion for Data-Free Knowledge Distillation.

To further prove the importance of student feedback in the picture synthesis processIt is applied to ablation experiment, namely, in the process of data synthesis, when training noise vector and generator, the feedback loss function of student is removed

Based only on teacher feedback. The experimental results are shown in table 2, which can show that better performing student models are obtained with student feedback.

Table 2 ablation experimental results

Another embodiment of the present invention provides a method for implementing image classification using a student feedback-based distillation method without data knowledge, comprising:

obtaining a pre-training teacher model: obtaining from a model that has been trained on a training set of image classification datasets;

picture synthesis: simultaneously, jointly training a noise vector and a generator according to the feedback of the student model and the teacher model, inputting the noise vector to the generator after training is finished, and obtaining output which is the required synthetic picture;

knowledge distillation: the student models are trained by knowledge distillation using the synthesized pictures, and simultaneously the auxiliary classifiers used to feed back the student status in the picture synthesis stage are independently trained. And the picture synthesis process and the knowledge distillation process are alternately carried out until the student model converges.

Image classification: and inputting the picture to be predicted into the trained student model, wherein the obtained output is the category of the picture.

Claims

1. A distillation method without data knowledge based on student feedback comprises the following steps:

s4: s2 and S3 are repeated until the student model is trained to converge.

2. The distillation method for the dataless knowledge based on the student feedback as claimed in claim 1, wherein the step S2 of feeding back the current learning ability of the student model by using an auxiliary classifier comprises the following steps:

randomly generating a noise vector z input into a generator network

Can obtain a composite picture

Then, the composite sheet is combined

Rotate a certain angle to rotate the rotated picture

Input to student model

So that the obtained features are represented

Input to an auxiliary classifier

Output results using an auxiliary classifier

3. The distillation method without data knowledge based on student feedback as claimed in claim 2, wherein the self-supervision enhancing task is specifically defined as follows:

In the image classification task, n classes, and in the self-supervised rotation task, M classes, then in the self-supervised enhancement task, n x M + M classes.

4. The distillation method without data knowledge based on student feedback as claimed in claim 3, wherein the loss function of teacher feedback in S2 is specifically:

wherein,

for teacher model

Output of (2)

And labels for predefined image classification tasks

Cross entropy between them, which is expressed by the formula:

the l2 norm distance between the synthesized image and the real image feature statistics is expressed by the formula:

wherein,

and

are respectively a composite image

the bands represent feature statistics of real images for the mean and variance of the feature maps stored at layer i of the teacher model.

5. The method of claim 4, wherein in the step S2, the noise vector sum generator is jointly trained according to loss functions of student feedback and teacher feedback, and the overall loss function is:

6. The student feedback-based dataless knowledge distillation method as claimed in claim 5, wherein in the step S3, the overall loss function for training the student model through knowledge distillation is as follows:

wherein,

for the KL divergence between the teacher and student outputs, the formula is:

feature maps for the last layer of the teacher model

And feature map of last layer of student model

The mean square error between them, which is formulated as:

7. The method for distillation without data knowledge based on student feedback according to claim 6, wherein in the step S3, the independently training the auxiliary classifier specifically comprises: after the student completes each training iteration, the parameters of the student model are fixed and then the parameters are fixed according to the loss function

Training updates the parameters of the auxiliary classifier.