CN113112020B

CN113112020B - Model network extraction and compression method based on generation network and knowledge distillation

Info

Publication number: CN113112020B
Application number: CN202110320646.8A
Authority: CN
Inventors: 曾一锋; 林晓晴; 杨帆
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2021-03-25
Filing date: 2021-03-25
Publication date: 2022-06-28
Anticipated expiration: 2041-03-25
Also published as: CN113112020A

Abstract

The invention discloses a model network extraction and compression method based on a generation network and knowledge distillation, which comprises the following steps: training a loss function of the generated network by using the trained teacher network to obtain a trained generated network; generating a plurality of generated pictures according to the generation network; inputting the generated pictures into a trained teacher network and a trained student network, and carrying out knowledge distillation on the student network; updating the student network; when facing a large network, the method provided by the invention can only learn the classification knowledge of specific categories in the large network according to different tasks and migrate the classification knowledge to a smaller network. Meanwhile, the method of the invention can rely on the data less, and the knowledge distillation is carried out under the condition of no data, thereby reducing the dependence of the original knowledge distillation on the real data.

Description

Model network extraction and compression method based on generation network and knowledge distillation

Technical Field

The invention relates to the field of error compensation, in particular to a model network extraction and compression method based on a generation network and knowledge distillation.

Background

In the field of artificial intelligence, in order to solve different problems, people propose more and more complex network structures, the network scale is larger and larger, and meanwhile, the problem is that in practical projects, due to the limitation of hardware resource computing capacity and the like, a large-scale network with excellent performance is difficult to apply, so that a plurality of methods such as knowledge distillation and the like can compress and accelerate the trained large-scale network. Meanwhile, for a trained network, according to the actual requirements, in many cases, the requirements of the desired network may not be all task targets of the original network, but only some task targets thereof, for example, a large network is available to implement 1000 classes of classification tasks on ImageNet, however, in actual applications, the task targets may not need the above 1000 classes but 10 classes of task targets.

For network compression and speed-up technologies, some classical methods are researched and improved at present. Related researchers put forward a new hash coding mode for the neural network to accelerate the operation of the network, and carry out hash mapping on parameters, wherein the parameters in the same hash bucket contribute to a weight value. The positions of a model pruning filter and a full connection layer are determined by evaluating the filter and the full connection neurons in the trained network. In addition, regularization induction updating is carried out on the weights through kernel sparsification, so that the weights of the kernels are more sparse and then the kernels are conveniently cut. Besides reducing the capacity of the models by pruning in the trained models through various methods, Hinton proposes the concept of knowledge distillation, and the aim of transferring the knowledge learned by the teacher network to a smaller student network is fulfilled by fitting the teacher model by enabling the output labels of the student models and the output labels of the teacher model to be as close as possible. Compared with the method, the method can be separated from the limit of the model structure between model compression.

At present, some classical network compression methods are researched, some methods are pruning operations based on an original model, some methods are network acceleration by using kernel sparseness, and some methods are compression by redesigning a smaller network by using distillation.

Although the model after network compression can be free from the constraint of the original trained model directly through technologies such as distillation, the distillation process has strong dependence on the original data set. And the task targets of the network before and after distillation are not changed, and only the knowledge of the network part can not be migrated.

Disclosure of Invention

The invention mainly aims to overcome the defects in the prior art, and provides a model network extraction and compression method based on a generated network and knowledge distillation, so that when a large network is faced, only specific classification knowledge in the large network can be learned according to different tasks and transferred to a smaller network; meanwhile, the data can be relied on less, knowledge distillation is carried out under the condition of no data, and the dependence of original knowledge distillation on real data is reduced.

The invention adopts the following technical scheme:

A model network extraction and compression method based on generation network and knowledge distillation comprises the following steps:

training a loss function of the generated network by using the trained teacher network to obtain a trained generated network;

generating a plurality of generated pictures according to the generation network;

inputting the generated pictures into a trained teacher network and a trained student network, and carrying out knowledge distillation on the student network;

and updating the student network.

Specifically, the method for obtaining the trained generated network by using the trained teacher network to train the loss function of the generated network specifically includes:

outputting the classification result of the teacher network generating the network generated picture by using the trained teacher network as feedback;

generating a loss function of the network by using feedback calculation;

the gradient of the loss function is calculated and the parameters of the generator network are updated. And when the output of the picture generated by the generating network to the teacher network and the classification result of the teacher network to the real picture output meet the set requirements, obtaining the trained generating network.

Specifically, a trained teacher network is used for training a loss function of a generated network to obtain the trained generated network, where the loss function specifically is as follows:

wherein,

The teacher network generates cross entropy loss of the picture to the generator;

outputting the information entropy of the target task;

generating the probability that the image is judged as the target category by the teacher network;

distance of the network output feature map; alpha, beta, gamma and delta are the weights of the three loss functions, and the value range is 0 to 1.

In particular, in the loss function

The method comprises the following specific steps:

wherein,

for the output of the teacher network for generating the picture,

outputting the obtained pseudo label by the teacher network for generating the picture; n is the number of pictures a generator generates a batch.

In particular, in the loss function

And

the method specifically comprises the following steps:

wherein N is the total number of the trained model task categories; m is the number of task categories of the target part, M<N；p_iThe n pictures are distinguished as the frequency of the ith category for the teacher network.

In particular, in the loss function

The method specifically comprises the following steps:

wherein the real image is defined as

The image generated by the generator is defined as

Generating a mean value of the pictures;

the variance of the picture is generated, l is the first of the network.

Specifically, the generated pictures are input into a teacher network and a student network, and knowledge distillation is carried out on the student network, and the method specifically comprises the following steps:

a set of n random vectors z ¹,z²,…,zⁿInputting the result into the generated network, and obtaining the output result of the generated network as follows:

respectively inputting the generated pictures into a teacher network and a student network to obtain the output of the teacher network

And output of student network

With knowledge distillation, the optimization objective function of the student network is:

wherein W_SIs a parameter of the student network.

As can be seen from the above description of the present invention, compared with the prior art, the present invention has the following advantages:

(1) the model network extraction and compression method based on the generated network and knowledge distillation provided by the invention utilizes the trained teacher network to train the loss function of the generated network to obtain the trained generated network; generating a plurality of generated pictures according to the generation network; inputting the generated pictures into a trained teacher network and a trained student network, and carrying out knowledge distillation on the student network; updating the student network; the invention combines the picture generation technology for generating the network and the knowledge distillation technology in network compression, can purposefully distill all kinds of knowledge learned by a large network, and only extracts a part of interested target knowledge to be in a smaller network. Meanwhile, the technology utilizes the generation network to design the loss function which accords with the classification distribution in the middle of the original teacher network, and reduces the dependence on real data in the small network training process.

Drawings

FIG. 1 is a model network extraction and compression method based on a generation network and knowledge distillation provided by an embodiment of the invention.

The invention is described in further detail below with reference to the figures and specific examples.

Detailed Description

The invention is further described below by means of specific embodiments.

Knowledge distillation is a process proposed by Hinton et al that enables knowledge learning between networks, which may be of different structures or of similar structures but of different capacities. Conventional knowledge distillation requires a trained, well-behaved network as a teacher network, which is typically complex, and a smaller network designed based on task requirements as a student network. The knowledge distillation considers that the output result of the last layer of the teacher network contains rich knowledge learned by the model, and the knowledge is reflected by the output distribution, so that the output of the student network is expected to learn the distribution output by the last layer of the teacher network, and the process of transferring the knowledge of the teacher network to a smaller student network is realized by the way.

The invention combines the picture generation technology for generating the network and the knowledge distillation technology in network compression, can purposefully distill all kinds of knowledge learned by a large network, and only extracts a part of interested target knowledge to be in a smaller network. Meanwhile, the technology utilizes the generation network to design the loss function which accords with the classification distribution in the middle of the original teacher network, and reduces the dependence on real data in the small network training process.

Referring to fig. 1, a flowchart of a model network extraction and compression method based on a generated network and knowledge distillation provided in an embodiment of the present invention specifically includes the following steps:

s101: training a loss function of the generated network by using the trained teacher network to obtain a trained generated network;

when a trained teacher network is owned, the teacher network is supposed to contain valuable information in the training process, and the valuable information of the knowledge is represented in the output result of the teacher network for the calculation of input data in the network. The goal is to let the generator learn the output expression of the teacher's network, let the picture generated by the generator be more likely to be considered as a "normal" image by the teacher's network, and be able to successfully identify the category result of the small task object, thereby completing the process of knowledge extraction and transfer. Therefore, the output of the teacher network on the synthesized image can be used as an important index for the learning of the generator, so that the output result of the image generated by the generator on the teacher network, even the result of the middle layer, can approach the result of the flow of the real image in the teacher network;

the learning process of generating the network is as follows: and training parameters of the generated network by using the output result of the teacher network on the generated network generated picture as feedback, so that the output result of the generated picture on the teacher network by the generator is as close as possible to the output result of the real picture on the teacher network. The label for generating the network generated picture is obtained from the result of the teacher network.

Training a loss function of the generated network by using the trained teacher network to obtain the trained generated network, wherein the loss function specifically comprises the following steps:

wherein,

outputting the information entropy of the target task;

Said loss boxIn counting

The method specifically comprises the following steps:

wherein,

for the output of the teacher network for generating the picture,

In the loss function

And

the method specifically comprises the following steps:

wherein N is the total number of the trained model task categories; m is the number of task categories of the target part, M<N；p_iThe frequency for distinguishing n pictures as the ith category for the teacher network.

In the loss function

The method specifically comprises the following steps:

wherein the real image is defined as

The image generated by the generator is defined as

Generating a mean value of the pictures;

generating a variance of the picture, l being the first of the network;

specifically; the loss function for the design generator is:

the three components of the loss function have their respective optimization objectives:

It is possible to make the generated image a composite image in the output layer that is fully compatible with the teacher's network, in other words, by

The generator can be made to learn how to make the generated picture more recognizable by the teacher network to success. Considering the pseudo label given to the generator by the teacher network as a real label, then

And

from the angle of the information entropy, the information entropy of the target category is obtained by calculating the output distribution of the teacher network to the target category: in information theory, how much a network output value contains the required information is expressed by quantifying the probability distribution of the output. If the probability that the image generated by the generator is judged to be the target category is higher after the image enters the teacher network, according to the information entropy theory, the uncertainty of the network for outputting the generated image to be the target category is smaller, the information quantity is smaller, the obtained result is also smaller,

the smaller the size, but the entropy of the learning target class is still insufficient, and when the class imbalance occurs, the minimum value may still be obtained, but in this case, the balance of the number of samples of the generated pictures cannot be guaranteed. In view of this, introduce

When the frequency distribution of each category is equal, the loss value is the minimum, namely, the generation network can generate the image of each category in the task target according to the average probability, so that the purpose of generating the image categories in a balanced manner is achieved.

And

the specific expression form is as follows:

to take into account the quality of the generated picture, a regularization term for the image is added to the loss function of the generator

Suppose a true image is defined as

The image generated by the generator is defined as

In order to ensure the similarity of the extracted features of the generated image and the real image in the middle layer of the teacher network, the target problem is converted into minimizing the distance between the feature maps of the generated image and the real image in the middle layer. Assuming that the features extracted by the intermediate layer follow a Gaussian distribution, the regularization of the distances between feature maps can be defined as

When the calculation of the real image is lacked, the mean and variance of the data distribution of the real image can be obtained by using the output of a BN layer in the teacher network, and the formula is expressed as follows:

although it is unknown how the teacher network is trained, it can be known that when the network introduces batch processing, it can capture the mean and variance of the network after batch processing the input, so the mean and variance of the network with respect to the real data can be obtained approximately, and therefore the distance of the network output feature map can be defined as:

s102: generating a plurality of generated pictures according to the generation network;

s103: inputting the generated pictures into a trained teacher network and a trained student network, and carrying out knowledge distillation on the student network;

Knowledge distillation is a process proposed by Hinton et al to achieve knowledge learning between networks, which may be of different structures or of similar structures but of different capacities. Traditional knowledge distillation requires a trained, well-behaved network as the teacher's network, which is typically complex, while a smaller network designed based on task requirements is the student's network. The knowledge distillation considers that the output result of the last layer of the teacher network contains rich knowledge learned by the model, and the knowledge is reflected by the output distribution, so that the output of the student network is expected to learn the output distribution of the last layer of the teacher network.

In the technology of the invention, the distillation process comprises the following specific steps: for a set of n random vectors z¹,z²,…,zⁿInputting the set into the generation network, and obtaining the output result of the generation network as:

the generated pictures are respectively input into a teacher network and a student network, so that the output of the teacher network can be obtained

And output of student network

Wherein W_SIs a parameter of the student network.

S104: and updating the student network.

Experiments were performed on three general Image type datasets using model network knowledge extraction and compression techniques based on a combination of generative networks and knowledge distillation, with cifar10, cifar100 and Natural Scene Image Classification, with cifar10 and cifar100 Image data each being 32 x 3 in size and Natural Scene Image Classification Image data being 112 x 3 in size. The task goal of the trained model is image classification, and the task goal of the smaller network model is to achieve classification of some of the image classes in the dataset. The teacher network selects a trained Resnet34 network structure and the student network selects a Resnet18 network structure. The results are shown in the following table:

the method has the advantages that the generated network is utilized to directly transfer part of the task knowledge needed by the trained network knowledge of the large-scale teacher, so that a good partial knowledge distillation process can be performed on the accuracy of part of the task targets of the original model. Moreover, partial task knowledge migration with different classification quantities on the original model has a better effect.

The network knowledge extraction and compression technology based on the combination of the generation network and the knowledge distillation, which is provided by the invention, can only learn the classification knowledge of a specific class in a large network according to different tasks when facing the large network, and migrate the classification knowledge to a smaller network. Meanwhile, the method can rely on the data less, knowledge distillation is carried out under the condition of no data, and the dependence of original knowledge distillation on real data is reduced.

In addition, the method mainly focuses on the directions of model compression and task object category extraction, and due to the improvement of computing capacity of high-speed computing equipment and the sharing convenience of network resources, people can obtain a trained network more and more easily, however, how to extract part of task knowledge in the network and move the task knowledge to a smaller network is a practical problem, the two problems can be solved simultaneously by the method, and the method can flexibly change different practical application requirements. On the basis of possessing a large-scale network which is trained, the method can reduce the possibility that the training of a small-scale network has a better classification effect on part of task targets, and can be more conveniently applied to various systems.

The above description is only an embodiment of the present invention, but the design concept of the present invention is not limited thereto, and any insubstantial modifications made by using the design concept should fall within the scope of the invention.

Claims

1. A model network extraction and compression method based on generation network and knowledge distillation is characterized by comprising the following steps:

training the trained teacher network to generate a loss function of the network by using a cifar10, a cifar100 and a Natural Scene Image Classification Image data set to obtain a trained generated network, wherein the trained teacher network task target is Image Classification;

updating the student network;

wherein,

outputting the information entropy of the target task;

distance of the network output feature map; α, β, γ, δ are

And

weights of the four loss functions in the loss function of the generator range from 0 to 1;

the generated pictures are input into a teacher network and a student network, knowledge distillation is carried out on the student network, and the method specifically comprises the following steps:

a set of n random vectors z¹,z²,…,zⁿAnd inputting the data into the generated network, wherein the output result of the generated network is as follows:

And output of student network

Optimization objective using knowledge distillation, student networkThe standard function is:

wherein W_SIs a parameter of the student network.

2. The model network extraction and compression method based on generation network and knowledge distillation as claimed in claim 1, wherein the trained teacher network is used to train the loss function of the generation network to obtain the trained generation network, and the method specifically comprises:

generating a loss function of the network by using feedback calculation;

calculating the gradient of the loss function, and updating the parameters of the generator network; and when the output of the picture generated by the generating network to the teacher network and the classification result of the teacher network to the real picture output meet the set requirements, obtaining the trained generating network.

3. The method of claim 1, wherein the loss function is a function of model network extraction and compression based on a generation network and knowledge distillation

The method specifically comprises the following steps:

wherein,

for the output of the teacher network for generating the picture,

outputting the obtained pseudo label by the teacher network for generating the picture; m is the number of pictures that the generator generates a batch.

4. The method of claim 3, wherein the loss function is a function of model network extraction and compression based on a generation network and knowledge distillation

And

the method specifically comprises the following steps:

wherein N is the total number of the trained model task categories; m is the number of task categories of the target part, M<N；p_iThe m pictures are distinguished as the frequency of the ith category for the teacher network.

5. The method of claim 3, wherein the loss function is a function of model network extraction and compression based on a generation network and knowledge distillation

The method specifically comprises the following steps:

wherein, the real image is defined as x ∈ χ, and the image generated by the generator is defined as

Generating a mean value of the picture;

to generate the variance of the picture, l is the ith layer of the network.