CN114898165A - Deep learning knowledge distillation method based on model channel cutting - Google Patents
Deep learning knowledge distillation method based on model channel cutting Download PDFInfo
- Publication number
- CN114898165A CN114898165A CN202210697905.3A CN202210697905A CN114898165A CN 114898165 A CN114898165 A CN 114898165A CN 202210697905 A CN202210697905 A CN 202210697905A CN 114898165 A CN114898165 A CN 114898165A
- Authority
- CN
- China
- Prior art keywords
- model
- convolution
- channel
- teacher
- teacher model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/082—Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a deep learning knowledge distillation method based on model channel clipping, and particularly relates to a deep learning knowledge distillation method based on model channel clipping for image classification. Inputting the images to be classified into a teacher model, and sequencing the images from large to small by using the average rank of convolution channels in each convolution layer of the teacher model; calculating the parameter quantity average value of the teacher model and the student model, and taking the parameter quantity average value and the parameter quantity of the teacher model as the total compression ratio of channel cutting in a changing proportion; cutting redundant convolution channels by utilizing a channel cutting technology to obtain an intermediate model; and (4) knowledge distillation is carried out on the student model by utilizing the intermediate model to obtain a new knowledge distillation objective function, and the student model is trained to obtain a trained student model. Belongs to the field of knowledge distillation.
Description
Technical Field
The invention relates to a knowledge distillation method, in particular to a deep learning knowledge distillation method based on model channel cutting for image classification, and belongs to the field of knowledge distillation.
Background
In the academic research of deep learning, a deep learning model represented by a convolutional neural network gradually shows strong performance and excellent development potential in the application fields of image classification, target detection, automatic driving and the like, and in the process of practical application, because the parameter quantity and the calculation quantity of the deep learning model are overlarge and the calculation capacity of hardware equipment is limited, the requirements of a calculation task on real-time performance and accuracy are difficult to meet simultaneously when the deep learning model is applied, so that the compression of the model while the accuracy is kept unchanged is a prominent challenge in the application process of the deep learning model.
The traditional deep model compression method generally adopts a clipping method based on the importance of the connection weight of a neural network, the methods show obvious limitations in the implementation and training process along with the continuous structural development of the deep neural network, and for the structured network model, students propose a structured clipping scheme based on a convolution channel and a compression scheme such as knowledge distillation introduced from transfer learning. The methods can compress the deep learning model to a certain extent while keeping the model precision as much as possible, but in specific application, for example, in image classification, because the network structures or parameters and calculated quantities of a teacher model (an original model) and a student model (a target model) in the adopted knowledge distillation compression method are too different, the methods are difficult to obtain good effects, so that the accuracy of the student model for image classification is rapidly reduced, and finally the accuracy of the image classification is low.
Disclosure of Invention
The invention provides a deep learning knowledge distillation method based on model channel clipping, and aims to solve the problem that the accuracy of image classification is low due to the fact that parameter quantities of a teacher model and student models are greatly different when a knowledge distillation compression method is adopted in the conventional image classification.
The technical scheme adopted by the invention is as follows:
it comprises the following steps:
s1, obtaining images to be classified, inputting the images to be classified into the teacher model to obtain the average rank of the feature map output by each convolution channel in each convolution layer of the teacher model, and sequencing the convolution channels in each convolution layer from large to small according to the corresponding average rank to obtain the sequenced convolution channels of each convolution layer of the teacher model;
s2, calculating parameter quantity average values of the teacher model parameter quantity and the student model parameter quantity according to the parameter quantity of the teacher model and the parameter quantity of the student model, and taking the change proportion of the parameter quantity average values and the teacher model parameter quantity as the channel cutting total compression rate;
s3, channel cutting is carried out on the redundant convolution channels in each convolution layer of the teacher model according to the convolution channels obtained in the S1 after each convolution layer is sequenced and the overall compression rate of channel cutting in the S2 by utilizing a channel cutting technology, and an intermediate model is obtained, wherein the intermediate model is a deep convolution neural network model;
s4, carrying out knowledge distillation on the student model by using the intermediate model, summing the feature maps output by each stage of the intermediate model in the knowledge distillation process, adding the summed feature maps and the original knowledge distillation objective function to obtain a new knowledge distillation objective function, and training the student model by using the new knowledge distillation objective function to obtain a trained student model;
and S5, inputting the images to be classified into the trained student model, and outputting the classification results of the images.
Preferably, in S1, the image to be classified is obtained, the image to be classified is input into the teacher model, an average rank of the feature map output by each convolution channel in each convolution layer of the teacher model is obtained, the convolution channels in each convolution layer are sorted from large to small according to the corresponding average rank, and the sorted convolution channels of each convolution layer of the teacher model are obtained, which specifically includes:
s11, C convolution channels are arranged in a certain convolution layer in the assumed teacher model;
and S12, obtaining A images to be classified, inputting the A images to be classified into the teacher model to obtain the average rank of the feature maps output by each convolution channel in the convolution layer in S11, sequencing the convolution channels according to the average rank of the feature maps output by each convolution channel from large to small to obtain the sequenced convolution channels of the convolution layers, and repeating the steps until the sequenced convolution channels of each convolution layer of the teacher model are obtained.
Preferably, in S3, the channel clipping technique is used to perform channel clipping on the redundant convolution channels in each convolution layer of the teacher model according to the ordered convolution channels in each convolution layer of the teacher model obtained in S1 and the total compression rate of channel clipping in S2, so as to obtain an intermediate model, and the specific process is as follows:
and distributing the compression rate of each convolution layer of the teacher model by using hyper-parameter adjustment according to the channel cutting total compression rate in S2, performing channel cutting on redundant convolution channels in each convolution layer of the teacher model according to the sequenced convolution channels of each convolution layer of the teacher model obtained in S1 and the distributed compression rate of each convolution layer of the teacher model by using a channel cutting technology, and enabling the parameter quantity of the teacher model after channel cutting to be equal to the parameter quantity average value in S2 to obtain an intermediate model.
Preferably, the specific process of the super-parameter adjustment is as follows:
setting a hyper-parameter p, p ═ p 1 ,p 2 ,…,p i ,…,p n The hyperparameter p needs to satisfy the requirement formula:
wherein n represents the total number of convolutional layers, i is 1,2, …, n;
Pr(C i ) Representing the parameter quantity of each convolution layer;
m represents the number of parameters of the teacher model;
q represents the global compression ratio;
calculating the hyperparameter p according to the compression ratio of the channel clipping population in S2 and the formula (1) i =[0,1]。
Preferably, the compression ratio of each convolution layer of the teacher model is distributed by means of hyper-parameter adjustment according to the channel cutting total compression ratio in S2, the channel cutting technology is used for channel cutting of the redundant convolution channels in each convolution layer of the teacher model according to the sequenced convolution channels of each convolution layer of the teacher model obtained in S1 and the distributed compression ratio of each convolution layer of the teacher model, the number of parameters of the teacher model after channel cutting is equal to the mean value of the parameter values in S2, and an intermediate model is obtained, and the specific process is as follows:
will exceed the parameter p i As the compression rate of each convolution layer of the teacher model, the channel cutting technology is used to obtain the convolution channel and the hyper-parameter p after sequencing according to each convolution layer of the teacher model obtained in S1 i And performing channel cutting on redundant convolution channels in each convolution layer of the teacher model, wherein the parameter quantity of the teacher model after the channel cutting is equal to the parameter quantity mean value in the S2, and obtaining an intermediate model.
Preferably, the new knowledge distillation objective function in S4:
L=t*L f +(1-t)*L s (2)
wherein L represents a new knowledge distillation objective function;
t represents a balance parameter;
L f representing the original knowledge distillation objective function;
L s and (4) representing the added feature map.
Preferably, the original knowledge distillation objective function:
wherein m represents the total number of input data;
α represents a parameter that balances the "soft target" and the "hard target";
d (-) represents cross entropy;
y j a true tag representing the jth input;
t represents the softening parameter of logits;
Preferably, the added feature profile is:
wherein s represents the number of stages of the intermediate model;
Preferably, a model channel clipping-based deep learning knowledge distillation compression system comprises a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements any one of the steps of a model channel clipping-based deep learning knowledge distillation method when executing the computer program.
Preferably, a computer readable storage medium stores a computer program which when executed by a processor implements any of the steps of the method as a model channel clipping based deep learning knowledge distillation method.
Has the advantages that:
the invention fuses a channel cutting method provided by HRank Filter Pruning High-Rank Feature Map and a multilayer Distillation method provided by Improved Knowledge Distillation via Teacher Assistant to obtain a Knowledge Distillation compression method with channel cutting, which specifically comprises the following steps: acquiring images to be classified, inputting the images to be classified into a teacher model, and sequencing convolution channels according to the average rank of a feature map output by each convolution channel in each convolution layer of the teacher model (original model) from big to small to obtain the sequenced convolution channels of each convolution layer of the teacher model; calculating parameter quantity mean values of the known teacher model parameter quantity and the student model parameter quantity according to the known teacher model parameter quantity and the known student model parameter quantity, and taking the change proportion of the parameter quantity mean values and the teacher model parameter quantity as a channel cutting total compression rate; the compression ratio of each layer of the teacher model is reasonably distributed by utilizing hyper-parameter adjustment according to the channel cutting total compression ratio, the channel cutting technology is utilized to carry out channel cutting on the redundant convolution channel of each layer of the teacher model according to the sequenced convolution channel of each layer of the teacher model and the compression ratio distributed by each layer of the teacher model, the main convolution channel of the teacher model is reserved, the parameter quantity of the teacher model after channel cutting is equal to the parameter quantity mean value, namely, the intermediate position of the parameter quantity of the teacher model after channel cutting in the teacher model (original model) and the parameter quantity of the student model (target model) is reserved as far as possible in the process, and therefore the compression ratio of each layer of the teacher model is convenient to balance. And restoring the precision of the teacher model by fine tuning to obtain an intermediate model, wherein the intermediate model is a deep convolutional neural network model, knowledge distillation is performed on the student models by using the intermediate model, the feature maps output by each stage of the intermediate model are added in the knowledge distillation process, the added feature maps are added with the original knowledge distillation objective function to obtain a new knowledge distillation objective function, the student models are trained by using the new knowledge distillation objective function to obtain trained student models (namely the objective models) so as to conveniently recognize pictures, and finally, the images to be classified are input into the trained student models to output the classification results of the images.
The invention integrates the advantages of the channel cutting compression method of the deep learning model and the knowledge distillation compression method of the teacher-student model, and improves the knowledge distillation compression method. When the knowledge distillation compression method is applied, the student models (target models) can simultaneously refer to the final output of the teacher model (original model) and the output result of the intermediate model in the knowledge distillation training process to learn more accurate teacher knowledge, so that the accuracy of image classification of the student models is improved, better performance of the student models can be obtained when the parameter difference between the teacher model (original model) and the student models (target models) is too large, the result of final image classification is more accurate, and the accuracy of classification is improved. And successfully verified experimentally, as in example 2.
Drawings
FIG. 1 is a flow diagram of the structure of distillation according to the knowledge of the present invention;
FIG. 2 is a block diagram of an intermediate model;
FIG. 3 is a diagram of a new knowledge distillation objective function;
Detailed Description
The first embodiment is as follows: the present embodiment is described with reference to fig. 1 to fig. 3, and the present embodiment describes a deep learning knowledge distillation method based on model channel clipping, which includes the following steps:
s1, obtaining images to be classified, inputting the images to be classified into the teacher model, obtaining an average rank of a feature map output by each convolution channel in each convolution layer of the teacher model, sequencing the convolution channels in each convolution layer from large to small according to the corresponding average rank, and obtaining the sequenced convolution channels of each convolution layer of the teacher model, wherein the specific process is as follows:
the images are forest image maps (the classification result is the type of forest creatures), underwater images (the classification result is the type of underwater creatures), and facial expression images (the classification result is the expression of human faces).
S11, C convolution layers in the teacher model are assumedA convolution channel, the dimension of the characteristic spectrum output by the convolution layer is C x N h *N w If the size of the feature map output by any convolution channel in the convolution layer is N h *N w A two-dimensional matrix of (a);
and S12, obtaining A images to be classified, inputting the A images to be classified into the teacher model to obtain the average rank of the feature map output by each convolution channel in the convolution layer in S11, sequencing the C convolution channels according to the average rank of the feature map output by each convolution channel from large to small, determining which convolution channels are removed when the channels are cut according to the sequencing to obtain the sequenced convolution channels of the convolution layers, and repeating the steps until the sequenced convolution channels of each convolution layer of the teacher model are obtained.
S2, calculating parameter quantity average values of the teacher model parameter quantity and the student model parameter quantity according to the known parameter quantity of the teacher model and the parameter quantity of the student model, and taking the change proportion of the parameter quantity average values and the teacher model parameter quantity as the channel cutting total compression rate;
the model parameter number is measured by number, for example, the parameter number of the model is 600M, which indicates that the model has 600M parameters. The common default model is stored with the precision of FP32, that is, 1 parameter is stored with 32 bits (bit), and then 600M parameters need to be stored with bits of 600M × 32 ═ 19200M. Assuming that the parameter quantity of the teacher model is 400M and the parameter quantity of the student model is 200M, the parameter quantity average value of the parameter quantity of the teacher model and the parameter quantity average value of the student model is (400M +200M)/2, which is 300M, and then dividing the parameter quantity 400M of the teacher model and the parameter quantity average value 300M to obtain the channel cutting total compression rate. S3, channel clipping is carried out on the redundant convolution channels in each convolution layer of the teacher model according to the convolution channels obtained in S1 after each convolution layer is sequenced and the overall compression rate of channel clipping in S2 by utilizing a channel clipping technology, an intermediate model is obtained, the intermediate model is a deep convolution neural network model, and the specific process is as follows:
reasonably distributing the compression ratio of each convolution layer of the teacher model by utilizing hyper-parameter adjustment according to the channel cutting total compression ratio in S2, carrying out channel cutting on a redundant convolution channel in each convolution layer of the teacher model according to the convolution channel obtained in S1 after sequencing each convolution layer of the teacher model and the compression ratio distributed by each convolution layer of the teacher model, wherein the parameter quantity of the teacher model after channel cutting is equal to the parameter quantity mean value in S2, and recovering the precision of the teacher model by utilizing fine tuning to obtain an intermediate model, and the specific process is as follows:
setting a hyper-parameter p, p ═ p 1 ,p 2 ,…,p i ,…,p n The hyperparameter p needs to satisfy the requirement formula:
wherein n represents the total number of convolutional layers, i is 1,2, …, n;
Pr(C i ) Representing the parameter quantity of each convolution layer;
m represents the number of parameters of the teacher model;
q represents a global compression ratio, which is the ratio of model parameters before and after compression;
the process of the hyper-parameter adjustment is to set a series of p values on the premise of meeting the formula (1), then try all the p values to obtain the corresponding model accuracy, and select a group of p values with the best accuracy to establish an intermediate model. The formula (1) can ensure that the sum of the parameter quantities of all the trimmed convolution layers meets the overall trimming rate.
The invention obtains the hyperparameter p by calculating according to the channel clipping total compression ratio in S2 and the formula (1) i =[0,1]Will exceed the parameter p i As the compression rate of each convolution layer of the teacher model, the channel cutting technology is used to obtain the convolution channel and the hyper-parameter p after sequencing according to each convolution layer of the teacher model obtained in S1 i Channel cutting is carried out on redundant convolution channels in each convolution layer of the teacher model, main convolution channels of the teacher model are reserved, the parameter quantity of the teacher model after channel cutting is equal to the parameter quantity mean value in S2, namely the parameter quantity of the middle model is reserved to the middle position of the parameter quantity of the original model and the parameter quantity of the target model,for example, the teacher model parameter number is 800M, the student model parameter number is 400M, and the middle model parameter number is 600M. This arrangement facilitates equalizing the compression ratio and ensures an equal distribution of compression ratios per layer. And recovering the precision of the teacher model by fine tuning to obtain an intermediate model. The teacher model obtains a teacher model with a simplified convolution channel structure by cutting a structured channel of a deep convolution neural network, the teacher model is used as an intermediate model at the moment, namely the intermediate model is the deep convolution neural network model, and a feature map (feature map) output by each stage of the intermediate model can be obtained according to the inherent structural features of a plurality of stages of the deep convolution network.
S4, carrying out knowledge distillation on the student model by using the intermediate model, adding the feature maps (featuremap) output by each stage (stage) of the intermediate model in the knowledge distillation process, adding the added feature maps (featuremap) and the original knowledge distillation objective function to obtain a new knowledge distillation objective function, training the student model by using the new knowledge distillation objective function to obtain a trained student model, and obtaining a final student model (objective model), wherein the teacher model and the student model are deep convolution neural network models, and the specific process is as follows:
the new knowledge distillation objective function obtained by adding the added feature map and the original knowledge distillation objective function not only considers the soft target and the hard target of the final output layer of the teacher model in the original knowledge distillation method, but also considers the output feature map of each stage of the intermediate model. When the knowledge distillation compression method is applied, the student model (target model) can simultaneously refer to the final output of the teacher model (original model) and the output result of the intermediate model in the knowledge distillation training process, more accurate teacher knowledge can be learned, the accuracy of image classification of the student model is improved, and better performance of the student model can be obtained when the parameter quantity difference between the teacher model (original model) and the student model (target model) is too large.
The original knowledge distillation objective function:
the original knowledge distillation objective function considers the difference (left term) between the output of the teacher model and the real label of the data and the difference (right term) between the output of the teacher model and the final output of the teacher model, and m represents the total amount of input data; α represents a parameter for weighing the teacher model "soft goal" and "hard goal"; d (-) represents cross entropy;representing the logits output generated by the student model for the jth input; y is j A true tag representing the jth input; t represents the softening parameter of logits;representing the logits output generated by the teacher model for the jth input.
The feature map after the addition of the intermediate model is as follows:
the formula (3) is used for measuring the difference of the characteristic maps of the teacher-student model intermediate output (intermediate model). Where s is the number of stages (typically 3 or 4) of the intermediate model, as shown in FIG. 2;representing the feature map output by the student model at the kth stage,and representing the characteristic map output by the teacher model at the kth stage.
L=t*L f +(1-t)*L s (4)
L is the final global objective function, i.e. the new knowledge distillation objective function after the addition of the added feature map (featuremap) and the original knowledge distillation objective function, and takes into account the final output of the teacher modelOutputting a feature map with each stage of the intermediate model, wherein t is used as a balance parameter for adjusting the global target L f And a phase target L s Of importance in between.
And S5, inputting the images to be classified into the trained student model, and outputting the classification results of the images.
The second embodiment is as follows: the present embodiment is described with reference to fig. 1 to fig. 3, and the deep learning knowledge distillation compression system based on model channel clipping in the present embodiment includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the processor implements any one of the steps in a deep learning knowledge distillation method based on model channel clipping.
The third concrete implementation mode: the present embodiment is described with reference to fig. 1 to 3, and the computer readable storage medium of the present embodiment stores a computer program which, when executed by a processor, implements any one of the steps of a deep learning knowledge distillation method based on model channel clipping.
Example 1
The invention combines a multilayer Distillation scheme provided by Improved Knowledge Distillation via Teacher Assistant with a structured channel cutting technology of a deep convolutional neural network provided by Filter planning High-Rank Feature Map. A Teacher model (an original model) is cut by a structured channel of a deep convolutional neural network to obtain the Teacher model with a simplified convolutional channel structure, the Teacher model at the moment is used as an intermediate model, then a multilayer Knowledge Distillation scheme provided by Improved Knowledge Distillation via teachers Assistant is used for reference, the size (parameter and calculated amount) of the intermediate model is set to be approximate to that (parameter and calculated amount) of an Assistant model, and the intermediate model trains a student model (a target model) (a Knowledge Distillation stage). In the process, the intermediate model is closer to the original model, so more original knowledge can be kept in the process of training the student model, the number of parameters of the student model is small, an optimal solution can be obtained by training an optimization function of the student model more easily, and a more excellent distillation effect can be obtained theoretically. In the knowledge distillation process of the last stage, according to the inherent structural characteristics of a plurality of stage convolution channels of the deep convolution network, the number of stages of the teacher model is consistent with that of the student model, but the depth and the number of the convolution channels of each stage of the teacher model are different, and due to the existence of the depth difference, certain optimization difficulty exists in the distillation process. Therefore, the invention improves the objective function of knowledge distillation, and sums the feature maps after each stage of the intermediate model and adds the feature maps to the objective function of original knowledge distillation, so that the student model (the objective model) can simultaneously refer to the final output of the teacher model (the original model) and the output result of the intermediate model in the training process to learn more accurate teacher knowledge.
The traditional deep learning model compression method is represented by a channel cutting model compression method and a knowledge distillation model compression method, the two model compression methods have respective advantages and disadvantages, the model compression method based on channel cutting can realize a larger compression ratio and reserve the depth of the model, but in the compression process, particularly under the condition of a large compression ratio, the compression ratio of each layer of the model is difficult to distribute, and a plurality of rounds of iterative fine tuning are needed. The model compression method based on knowledge distillation has more balanced sub-model structure, but the depth of the original model is lost in the distillation process, and a satisfactory effect on processing the knowledge distillation problem with large span is difficult to obtain.
The present invention uses a two-stage process. In the first stage, a channel cutting technology is adopted to cut redundant convolution channels in a teacher model (an original model), main convolution channels of the teacher model are reserved, the precision of the teacher model is recovered by finely adjusting the teacher model, the teacher model after the channels are cut is used as an intermediate model, parameters of the intermediate model are reserved until the parameter difference value between the parameter of the teacher model and the parameter of a student model (a target model) is consistent, the compression ratio of the two stages is balanced conveniently set, and the balanced distribution of the compression ratio of each layer is ensured. And in the second stage, compressing the intermediate model into a target model by adopting a knowledge distillation method, introducing an improved multi-stage target function in the process, summing the feature maps (characteristic maps) of the intermediate model after each stage, adding the summed feature maps to the target function of the original knowledge distillation to obtain a new knowledge distillation target function, training the student model by using the new knowledge distillation target function to obtain a trained student model (target model), and applying the trained student model to subsequent specific application to improve the application effect and accuracy.
The structure (intermediate model) of the structured convolutional neural network is generally divided into 3-4 stages as shown in fig. 2, each stage is then downsampled to reduce the feature map of the featuremap, the number of convolution channels is doubled, the difference between the teacher model (original model) and the student model (target model) is reflected in the depth and width of the convolution channel inside each stage, and in order to make the distillation process in the second stage of the invention easier to optimize, the featuremap output by each stage of the intermediate model is added into the objective function of original knowledge distillation, so that the student model can learn intermediate link information (intermediate model) of the teacher model reasoning process in addition to learning the 'soft target' and 'hard target' of the teacher model.
Example 2
Specific experiments, as exemplified by Resnet:
to verify the effectiveness of the present invention, experiments were conducted with Resnet as the teacher model. The data set used in this experiment was cfar-10, and the experimental goal was to compress the original model Resnet56 to a parameter size of the target model Resnet 32. By way of comparison, we have introduced the following reference schemes, respectively:
1. obtaining model accuracy 93.26 through the original Resnet56 after pre-training;
2. directly compressing Resnet56 into Resnet32 by using a traditional knowledge distillation method, and training for 150 turns to obtain model accuracy 92.474;
3. compressing Resnet56 into Resnet44 by using a traditional knowledge distillation method, compressing Resnet44 into Resnet32, and training turns 150+150 to obtain model accuracy 93.018;
4. the channel clipping is directly used to obtain a model with the approximate parameter number of resnet32, and the training round 300 obtains the model precision of 92.32, which shows that the precision loss of the channel clipping is large under the condition of large parameter span.
By using the two-stage processing method, Resnet56 is compressed into Resnet44, Resnet44 is compressed into Resnet32, and the model accuracy 93.265 is obtained through training turns 150+ 150.
It can be shown from the above experiments that it is indeed effective to add an intermediate link model between the original model and the target model. Compared with other schemes, the invention achieves the best knowledge distillation effect. As shown in table 1:
TABLE 1
Resnet56 | Resnet56-32KD | Resnet56-44-32KD | This scheme | Channel tailoring |
93.26 | 92.474 | 93.018 | 93.265 | 92.32 |
Claims (10)
1. A deep learning knowledge distillation method based on model channel cutting is characterized in that: it comprises the following steps:
s1, obtaining images to be classified, inputting the images to be classified into the teacher model to obtain the average rank of the feature map output by each convolution channel in each convolution layer of the teacher model, and sequencing the convolution channels in each convolution layer from large to small according to the corresponding average rank to obtain the sequenced convolution channels of each convolution layer of the teacher model;
s2, calculating parameter quantity average values of the teacher model parameter quantity and the student model parameter quantity according to the parameter quantity of the teacher model and the parameter quantity of the student model, and taking the change proportion of the parameter quantity average values and the teacher model parameter quantity as the channel cutting total compression rate;
s3, channel clipping is carried out on the redundant convolution channels in each convolution layer of the teacher model according to the convolution channels obtained in S1 after each convolution layer of the teacher model is sequenced and the overall compression rate of channel clipping in S2 by means of a channel clipping technology, and an intermediate model is obtained;
s4, carrying out knowledge distillation on the student model by using the intermediate model, summing the feature maps output by each stage of the intermediate model in the knowledge distillation process, adding the summed feature maps and the original knowledge distillation objective function to obtain a new knowledge distillation objective function, and training the student model by using the new knowledge distillation objective function to obtain a trained student model;
and S5, inputting the images to be classified into the trained student model, and outputting the classification results of the images.
2. The method of distilling knowledge learned based on model channel clipping as set forth in claim 1, wherein: the method comprises the steps of obtaining images to be classified in S1, inputting the images to be classified into a teacher model, obtaining an average rank of a feature map output by each convolution channel in each convolution layer of the teacher model, sequencing the convolution channels in each convolution layer from large to small according to the corresponding average rank, and obtaining the sequenced convolution channels of each convolution layer of the teacher model, wherein the specific process is as follows:
s11, C convolution channels are arranged in a certain convolution layer in the assumed teacher model;
and S12, obtaining A images to be classified, inputting the A images to be classified into the teacher model to obtain the average rank of the feature maps output by each convolution channel in the convolution layer in S11, sequencing the convolution channels according to the average rank of the feature maps output by each convolution channel from large to small to obtain the sequenced convolution channels of the convolution layers, and repeating the steps until the sequenced convolution channels of each convolution layer of the teacher model are obtained.
3. The method of distilling knowledge learned deeply based on model channel clipping as set forth in claim 2, wherein: in S3, channel clipping is performed on the redundant convolution channels in each convolution layer of the teacher model according to the ordered convolution channels in each convolution layer of the teacher model obtained in S1 and the total compression rate of channel clipping in S2 by using a channel clipping technique, so as to obtain an intermediate model, and the specific process is as follows:
and distributing the compression rate of each convolution layer of the teacher model by using hyper-parameter adjustment according to the channel cutting total compression rate in S2, performing channel cutting on redundant convolution channels in each convolution layer of the teacher model according to the sequenced convolution channels of each convolution layer of the teacher model obtained in S1 and the distributed compression rate of each convolution layer of the teacher model by using a channel cutting technology, and enabling the parameter quantity of the teacher model after channel cutting to be equal to the parameter quantity average value in S2 to obtain an intermediate model.
4. The method of claim 3, wherein the model channel clipping-based deep learning knowledge distillation method comprises: the specific process of the super-parameter adjustment is as follows:
setting a hyper-parameter p, p ═ p 1 ,p 2 ,…,p i ,…,p n The hyperparameter p satisfies the requirement formula:
wherein n represents the total number of convolutional layers, i is 1,2, …, n;
Pr(C i ) Each layer is expressedParameter quantities of convolutional layers;
m represents the number of parameters of the teacher model;
q represents the global compression ratio;
calculating the hyperparameter p according to the channel clipping total compression ratio in S2 and the formula (1) i =[0,1]。
5. The method of distilling knowledge learned deeply based on model channel clipping as set forth in claim 4, wherein: the method comprises the following steps of distributing the compression rate of each convolution layer of the teacher model by utilizing hyper-parameter adjustment according to the channel cutting total compression rate in S2, carrying out channel cutting on a redundant convolution channel in each convolution layer of the teacher model according to the sequenced convolution channel of each convolution layer of the teacher model obtained in S1 and the distributed compression rate of each convolution layer of the teacher model by utilizing a channel cutting technology, and obtaining an intermediate model by enabling the parameter quantity of the teacher model after channel cutting to be equal to the parameter quantity mean value in S2, wherein the specific process is as follows:
will exceed the parameter p i As the compression rate of each convolution layer of the teacher model, the channel cutting technology is used to obtain the convolution channel and the hyper-parameter p after sequencing according to each convolution layer of the teacher model obtained in S1 i And performing channel cutting on redundant convolution channels in each convolution layer of the teacher model, wherein the parameter quantity of the teacher model after the channel cutting is equal to the parameter quantity mean value in the S2, and obtaining an intermediate model.
6. The method of distilling knowledge learned deeply based on model channel clipping as set forth in claim 5, wherein: the new knowledge distillation objective function in S4:
L=t*L f +(1-t)*L s (2)
wherein L represents a new knowledge distillation objective function;
t represents a balance parameter;
L f representing the original knowledge distillation objective function;
L s and (4) representing the added feature map.
7. The method of claim 6, wherein the model channel clipping-based deep learning knowledge distillation method comprises: the original knowledge distillation objective function:
wherein m represents the total number of input data;
α represents a parameter that balances the "soft target" and the "hard target";
d (-) represents cross entropy;
y j a true tag representing the jth input;
t represents the softening parameter of logits;
8. The method of distilling knowledge learned based on model channel clipping as set forth in claim 7, wherein: the added feature map is as follows:
wherein s represents the number of stages of the intermediate model;
9. A model channel clipping based deep learning knowledge distillation system comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that: the processor, when executing the computer program, realizes the steps of the method according to any of claims 1-8.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210697905.3A CN114898165A (en) | 2022-06-20 | 2022-06-20 | Deep learning knowledge distillation method based on model channel cutting |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210697905.3A CN114898165A (en) | 2022-06-20 | 2022-06-20 | Deep learning knowledge distillation method based on model channel cutting |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114898165A true CN114898165A (en) | 2022-08-12 |
Family
ID=82727746
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210697905.3A Pending CN114898165A (en) | 2022-06-20 | 2022-06-20 | Deep learning knowledge distillation method based on model channel cutting |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114898165A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115294332A (en) * | 2022-10-09 | 2022-11-04 | 浙江啄云智能科技有限公司 | Image processing method, device, equipment and storage medium |
CN115965964A (en) * | 2023-01-29 | 2023-04-14 | 中国农业大学 | Egg freshness identification method, system and equipment |
CN116304677A (en) * | 2023-01-30 | 2023-06-23 | 格兰菲智能科技有限公司 | Channel pruning method and device for model, computer equipment and storage medium |
-
2022
- 2022-06-20 CN CN202210697905.3A patent/CN114898165A/en active Pending
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115294332A (en) * | 2022-10-09 | 2022-11-04 | 浙江啄云智能科技有限公司 | Image processing method, device, equipment and storage medium |
CN115965964A (en) * | 2023-01-29 | 2023-04-14 | 中国农业大学 | Egg freshness identification method, system and equipment |
CN115965964B (en) * | 2023-01-29 | 2024-01-23 | 中国农业大学 | Egg freshness identification method, system and equipment |
CN116304677A (en) * | 2023-01-30 | 2023-06-23 | 格兰菲智能科技有限公司 | Channel pruning method and device for model, computer equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114898165A (en) | Deep learning knowledge distillation method based on model channel cutting | |
CN104077595B (en) | Deep learning network image recognition methods based on Bayesian regularization | |
CN107229904A (en) | A kind of object detection and recognition method based on deep learning | |
CN112230675B (en) | Unmanned aerial vehicle task allocation method considering operation environment and performance in collaborative search and rescue | |
CN110473592B (en) | Multi-view human synthetic lethal gene prediction method | |
CN112699958A (en) | Target detection model compression and acceleration method based on pruning and knowledge distillation | |
CN110097178A (en) | It is a kind of paid attention to based on entropy neural network model compression and accelerated method | |
CN108235003B (en) | Three-dimensional video quality evaluation method based on 3D convolutional neural network | |
CN112364719A (en) | Method for rapidly detecting remote sensing image target | |
CN111523546A (en) | Image semantic segmentation method, system and computer storage medium | |
CN114943345B (en) | Active learning and model compression-based federal learning global model training method | |
CN115100238A (en) | Knowledge distillation-based light single-target tracker training method | |
CN110889450A (en) | Method and device for super-parameter tuning and model building | |
CN107577736A (en) | A kind of file recommendation method and system based on BP neural network | |
CN111967971A (en) | Bank client data processing method and device | |
CN106708044A (en) | Full-hovering hovercraft course control method based on grey prediction hybrid genetic algorithm-PID | |
CN112580662A (en) | Method and system for recognizing fish body direction based on image features | |
CN115063274A (en) | Virtual reality flight training scheme generation method based on object technology capability | |
CN117236421A (en) | Large model training method based on federal knowledge distillation | |
CN114819091A (en) | Multi-task network model training method and system based on self-adaptive task weight | |
CN107492129A (en) | Non-convex compressed sensing optimal reconfiguration method with structuring cluster is represented based on sketch | |
CN113516163B (en) | Vehicle classification model compression method, device and storage medium based on network pruning | |
CN115937693A (en) | Road identification method and system based on remote sensing image | |
CN113361570B (en) | 3D human body posture estimation method based on joint data enhancement and network training model | |
CN115457269A (en) | Semantic segmentation method based on improved DenseNAS |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |