CN114898165A - Deep learning knowledge distillation method based on model channel cutting - Google Patents

Deep learning knowledge distillation method based on model channel cutting Download PDF

Info

Publication number
CN114898165A
CN114898165A CN202210697905.3A CN202210697905A CN114898165A CN 114898165 A CN114898165 A CN 114898165A CN 202210697905 A CN202210697905 A CN 202210697905A CN 114898165 A CN114898165 A CN 114898165A
Authority
CN
China
Prior art keywords
model
convolution
channel
teacher
teacher model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210697905.3A
Other languages
Chinese (zh)
Inventor
张翀
王宏志
刘宏伟
丁小欧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN202210697905.3A priority Critical patent/CN114898165A/en
Publication of CN114898165A publication Critical patent/CN114898165A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a deep learning knowledge distillation method based on model channel clipping, and particularly relates to a deep learning knowledge distillation method based on model channel clipping for image classification. Inputting the images to be classified into a teacher model, and sequencing the images from large to small by using the average rank of convolution channels in each convolution layer of the teacher model; calculating the parameter quantity average value of the teacher model and the student model, and taking the parameter quantity average value and the parameter quantity of the teacher model as the total compression ratio of channel cutting in a changing proportion; cutting redundant convolution channels by utilizing a channel cutting technology to obtain an intermediate model; and (4) knowledge distillation is carried out on the student model by utilizing the intermediate model to obtain a new knowledge distillation objective function, and the student model is trained to obtain a trained student model. Belongs to the field of knowledge distillation.

Description

Deep learning knowledge distillation method based on model channel cutting
Technical Field
The invention relates to a knowledge distillation method, in particular to a deep learning knowledge distillation method based on model channel cutting for image classification, and belongs to the field of knowledge distillation.
Background
In the academic research of deep learning, a deep learning model represented by a convolutional neural network gradually shows strong performance and excellent development potential in the application fields of image classification, target detection, automatic driving and the like, and in the process of practical application, because the parameter quantity and the calculation quantity of the deep learning model are overlarge and the calculation capacity of hardware equipment is limited, the requirements of a calculation task on real-time performance and accuracy are difficult to meet simultaneously when the deep learning model is applied, so that the compression of the model while the accuracy is kept unchanged is a prominent challenge in the application process of the deep learning model.
The traditional deep model compression method generally adopts a clipping method based on the importance of the connection weight of a neural network, the methods show obvious limitations in the implementation and training process along with the continuous structural development of the deep neural network, and for the structured network model, students propose a structured clipping scheme based on a convolution channel and a compression scheme such as knowledge distillation introduced from transfer learning. The methods can compress the deep learning model to a certain extent while keeping the model precision as much as possible, but in specific application, for example, in image classification, because the network structures or parameters and calculated quantities of a teacher model (an original model) and a student model (a target model) in the adopted knowledge distillation compression method are too different, the methods are difficult to obtain good effects, so that the accuracy of the student model for image classification is rapidly reduced, and finally the accuracy of the image classification is low.
Disclosure of Invention
The invention provides a deep learning knowledge distillation method based on model channel clipping, and aims to solve the problem that the accuracy of image classification is low due to the fact that parameter quantities of a teacher model and student models are greatly different when a knowledge distillation compression method is adopted in the conventional image classification.
The technical scheme adopted by the invention is as follows:
it comprises the following steps:
s1, obtaining images to be classified, inputting the images to be classified into the teacher model to obtain the average rank of the feature map output by each convolution channel in each convolution layer of the teacher model, and sequencing the convolution channels in each convolution layer from large to small according to the corresponding average rank to obtain the sequenced convolution channels of each convolution layer of the teacher model;
s2, calculating parameter quantity average values of the teacher model parameter quantity and the student model parameter quantity according to the parameter quantity of the teacher model and the parameter quantity of the student model, and taking the change proportion of the parameter quantity average values and the teacher model parameter quantity as the channel cutting total compression rate;
s3, channel cutting is carried out on the redundant convolution channels in each convolution layer of the teacher model according to the convolution channels obtained in the S1 after each convolution layer is sequenced and the overall compression rate of channel cutting in the S2 by utilizing a channel cutting technology, and an intermediate model is obtained, wherein the intermediate model is a deep convolution neural network model;
s4, carrying out knowledge distillation on the student model by using the intermediate model, summing the feature maps output by each stage of the intermediate model in the knowledge distillation process, adding the summed feature maps and the original knowledge distillation objective function to obtain a new knowledge distillation objective function, and training the student model by using the new knowledge distillation objective function to obtain a trained student model;
and S5, inputting the images to be classified into the trained student model, and outputting the classification results of the images.
Preferably, in S1, the image to be classified is obtained, the image to be classified is input into the teacher model, an average rank of the feature map output by each convolution channel in each convolution layer of the teacher model is obtained, the convolution channels in each convolution layer are sorted from large to small according to the corresponding average rank, and the sorted convolution channels of each convolution layer of the teacher model are obtained, which specifically includes:
s11, C convolution channels are arranged in a certain convolution layer in the assumed teacher model;
and S12, obtaining A images to be classified, inputting the A images to be classified into the teacher model to obtain the average rank of the feature maps output by each convolution channel in the convolution layer in S11, sequencing the convolution channels according to the average rank of the feature maps output by each convolution channel from large to small to obtain the sequenced convolution channels of the convolution layers, and repeating the steps until the sequenced convolution channels of each convolution layer of the teacher model are obtained.
Preferably, in S3, the channel clipping technique is used to perform channel clipping on the redundant convolution channels in each convolution layer of the teacher model according to the ordered convolution channels in each convolution layer of the teacher model obtained in S1 and the total compression rate of channel clipping in S2, so as to obtain an intermediate model, and the specific process is as follows:
and distributing the compression rate of each convolution layer of the teacher model by using hyper-parameter adjustment according to the channel cutting total compression rate in S2, performing channel cutting on redundant convolution channels in each convolution layer of the teacher model according to the sequenced convolution channels of each convolution layer of the teacher model obtained in S1 and the distributed compression rate of each convolution layer of the teacher model by using a channel cutting technology, and enabling the parameter quantity of the teacher model after channel cutting to be equal to the parameter quantity average value in S2 to obtain an intermediate model.
Preferably, the specific process of the super-parameter adjustment is as follows:
setting a hyper-parameter p, p ═ p 1 ,p 2 ,…,p i ,…,p n The hyperparameter p needs to satisfy the requirement formula:
Figure BDA0003702819910000021
wherein n represents the total number of convolutional layers, i is 1,2, …, n;
Pr(C i ) Representing the parameter quantity of each convolution layer;
m represents the number of parameters of the teacher model;
q represents the global compression ratio;
calculating the hyperparameter p according to the compression ratio of the channel clipping population in S2 and the formula (1) i =[0,1]。
Preferably, the compression ratio of each convolution layer of the teacher model is distributed by means of hyper-parameter adjustment according to the channel cutting total compression ratio in S2, the channel cutting technology is used for channel cutting of the redundant convolution channels in each convolution layer of the teacher model according to the sequenced convolution channels of each convolution layer of the teacher model obtained in S1 and the distributed compression ratio of each convolution layer of the teacher model, the number of parameters of the teacher model after channel cutting is equal to the mean value of the parameter values in S2, and an intermediate model is obtained, and the specific process is as follows:
will exceed the parameter p i As the compression rate of each convolution layer of the teacher model, the channel cutting technology is used to obtain the convolution channel and the hyper-parameter p after sequencing according to each convolution layer of the teacher model obtained in S1 i And performing channel cutting on redundant convolution channels in each convolution layer of the teacher model, wherein the parameter quantity of the teacher model after the channel cutting is equal to the parameter quantity mean value in the S2, and obtaining an intermediate model.
Preferably, the new knowledge distillation objective function in S4:
L=t*L f +(1-t)*L s (2)
wherein L represents a new knowledge distillation objective function;
t represents a balance parameter;
L f representing the original knowledge distillation objective function;
L s and (4) representing the added feature map.
Preferably, the original knowledge distillation objective function:
Figure BDA0003702819910000031
wherein m represents the total number of input data;
α represents a parameter that balances the "soft target" and the "hard target";
d (-) represents cross entropy;
Figure BDA0003702819910000032
representing the logits output generated by the student model for the jth input;
y j a true tag representing the jth input;
t represents the softening parameter of logits;
Figure BDA0003702819910000041
representing the logits output generated by the teacher model for the jth input.
Preferably, the added feature profile is:
Figure BDA0003702819910000042
wherein s represents the number of stages of the intermediate model;
Figure BDA0003702819910000043
representing a characteristic spectrum output by the student model at the kth stage;
Figure BDA0003702819910000044
and representing the characteristic map output by the teacher model at the kth stage.
Preferably, a model channel clipping-based deep learning knowledge distillation compression system comprises a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements any one of the steps of a model channel clipping-based deep learning knowledge distillation method when executing the computer program.
Preferably, a computer readable storage medium stores a computer program which when executed by a processor implements any of the steps of the method as a model channel clipping based deep learning knowledge distillation method.
Has the advantages that:
the invention fuses a channel cutting method provided by HRank Filter Pruning High-Rank Feature Map and a multilayer Distillation method provided by Improved Knowledge Distillation via Teacher Assistant to obtain a Knowledge Distillation compression method with channel cutting, which specifically comprises the following steps: acquiring images to be classified, inputting the images to be classified into a teacher model, and sequencing convolution channels according to the average rank of a feature map output by each convolution channel in each convolution layer of the teacher model (original model) from big to small to obtain the sequenced convolution channels of each convolution layer of the teacher model; calculating parameter quantity mean values of the known teacher model parameter quantity and the student model parameter quantity according to the known teacher model parameter quantity and the known student model parameter quantity, and taking the change proportion of the parameter quantity mean values and the teacher model parameter quantity as a channel cutting total compression rate; the compression ratio of each layer of the teacher model is reasonably distributed by utilizing hyper-parameter adjustment according to the channel cutting total compression ratio, the channel cutting technology is utilized to carry out channel cutting on the redundant convolution channel of each layer of the teacher model according to the sequenced convolution channel of each layer of the teacher model and the compression ratio distributed by each layer of the teacher model, the main convolution channel of the teacher model is reserved, the parameter quantity of the teacher model after channel cutting is equal to the parameter quantity mean value, namely, the intermediate position of the parameter quantity of the teacher model after channel cutting in the teacher model (original model) and the parameter quantity of the student model (target model) is reserved as far as possible in the process, and therefore the compression ratio of each layer of the teacher model is convenient to balance. And restoring the precision of the teacher model by fine tuning to obtain an intermediate model, wherein the intermediate model is a deep convolutional neural network model, knowledge distillation is performed on the student models by using the intermediate model, the feature maps output by each stage of the intermediate model are added in the knowledge distillation process, the added feature maps are added with the original knowledge distillation objective function to obtain a new knowledge distillation objective function, the student models are trained by using the new knowledge distillation objective function to obtain trained student models (namely the objective models) so as to conveniently recognize pictures, and finally, the images to be classified are input into the trained student models to output the classification results of the images.
The invention integrates the advantages of the channel cutting compression method of the deep learning model and the knowledge distillation compression method of the teacher-student model, and improves the knowledge distillation compression method. When the knowledge distillation compression method is applied, the student models (target models) can simultaneously refer to the final output of the teacher model (original model) and the output result of the intermediate model in the knowledge distillation training process to learn more accurate teacher knowledge, so that the accuracy of image classification of the student models is improved, better performance of the student models can be obtained when the parameter difference between the teacher model (original model) and the student models (target models) is too large, the result of final image classification is more accurate, and the accuracy of classification is improved. And successfully verified experimentally, as in example 2.
Drawings
FIG. 1 is a flow diagram of the structure of distillation according to the knowledge of the present invention;
FIG. 2 is a block diagram of an intermediate model;
FIG. 3 is a diagram of a new knowledge distillation objective function;
Detailed Description
The first embodiment is as follows: the present embodiment is described with reference to fig. 1 to fig. 3, and the present embodiment describes a deep learning knowledge distillation method based on model channel clipping, which includes the following steps:
s1, obtaining images to be classified, inputting the images to be classified into the teacher model, obtaining an average rank of a feature map output by each convolution channel in each convolution layer of the teacher model, sequencing the convolution channels in each convolution layer from large to small according to the corresponding average rank, and obtaining the sequenced convolution channels of each convolution layer of the teacher model, wherein the specific process is as follows:
the images are forest image maps (the classification result is the type of forest creatures), underwater images (the classification result is the type of underwater creatures), and facial expression images (the classification result is the expression of human faces).
S11, C convolution layers in the teacher model are assumedA convolution channel, the dimension of the characteristic spectrum output by the convolution layer is C x N h *N w If the size of the feature map output by any convolution channel in the convolution layer is N h *N w A two-dimensional matrix of (a);
and S12, obtaining A images to be classified, inputting the A images to be classified into the teacher model to obtain the average rank of the feature map output by each convolution channel in the convolution layer in S11, sequencing the C convolution channels according to the average rank of the feature map output by each convolution channel from large to small, determining which convolution channels are removed when the channels are cut according to the sequencing to obtain the sequenced convolution channels of the convolution layers, and repeating the steps until the sequenced convolution channels of each convolution layer of the teacher model are obtained.
S2, calculating parameter quantity average values of the teacher model parameter quantity and the student model parameter quantity according to the known parameter quantity of the teacher model and the parameter quantity of the student model, and taking the change proportion of the parameter quantity average values and the teacher model parameter quantity as the channel cutting total compression rate;
the model parameter number is measured by number, for example, the parameter number of the model is 600M, which indicates that the model has 600M parameters. The common default model is stored with the precision of FP32, that is, 1 parameter is stored with 32 bits (bit), and then 600M parameters need to be stored with bits of 600M × 32 ═ 19200M. Assuming that the parameter quantity of the teacher model is 400M and the parameter quantity of the student model is 200M, the parameter quantity average value of the parameter quantity of the teacher model and the parameter quantity average value of the student model is (400M +200M)/2, which is 300M, and then dividing the parameter quantity 400M of the teacher model and the parameter quantity average value 300M to obtain the channel cutting total compression rate. S3, channel clipping is carried out on the redundant convolution channels in each convolution layer of the teacher model according to the convolution channels obtained in S1 after each convolution layer is sequenced and the overall compression rate of channel clipping in S2 by utilizing a channel clipping technology, an intermediate model is obtained, the intermediate model is a deep convolution neural network model, and the specific process is as follows:
reasonably distributing the compression ratio of each convolution layer of the teacher model by utilizing hyper-parameter adjustment according to the channel cutting total compression ratio in S2, carrying out channel cutting on a redundant convolution channel in each convolution layer of the teacher model according to the convolution channel obtained in S1 after sequencing each convolution layer of the teacher model and the compression ratio distributed by each convolution layer of the teacher model, wherein the parameter quantity of the teacher model after channel cutting is equal to the parameter quantity mean value in S2, and recovering the precision of the teacher model by utilizing fine tuning to obtain an intermediate model, and the specific process is as follows:
setting a hyper-parameter p, p ═ p 1 ,p 2 ,…,p i ,…,p n The hyperparameter p needs to satisfy the requirement formula:
Figure BDA0003702819910000061
wherein n represents the total number of convolutional layers, i is 1,2, …, n;
Pr(C i ) Representing the parameter quantity of each convolution layer;
m represents the number of parameters of the teacher model;
q represents a global compression ratio, which is the ratio of model parameters before and after compression;
the process of the hyper-parameter adjustment is to set a series of p values on the premise of meeting the formula (1), then try all the p values to obtain the corresponding model accuracy, and select a group of p values with the best accuracy to establish an intermediate model. The formula (1) can ensure that the sum of the parameter quantities of all the trimmed convolution layers meets the overall trimming rate.
The invention obtains the hyperparameter p by calculating according to the channel clipping total compression ratio in S2 and the formula (1) i =[0,1]Will exceed the parameter p i As the compression rate of each convolution layer of the teacher model, the channel cutting technology is used to obtain the convolution channel and the hyper-parameter p after sequencing according to each convolution layer of the teacher model obtained in S1 i Channel cutting is carried out on redundant convolution channels in each convolution layer of the teacher model, main convolution channels of the teacher model are reserved, the parameter quantity of the teacher model after channel cutting is equal to the parameter quantity mean value in S2, namely the parameter quantity of the middle model is reserved to the middle position of the parameter quantity of the original model and the parameter quantity of the target model,for example, the teacher model parameter number is 800M, the student model parameter number is 400M, and the middle model parameter number is 600M. This arrangement facilitates equalizing the compression ratio and ensures an equal distribution of compression ratios per layer. And recovering the precision of the teacher model by fine tuning to obtain an intermediate model. The teacher model obtains a teacher model with a simplified convolution channel structure by cutting a structured channel of a deep convolution neural network, the teacher model is used as an intermediate model at the moment, namely the intermediate model is the deep convolution neural network model, and a feature map (feature map) output by each stage of the intermediate model can be obtained according to the inherent structural features of a plurality of stages of the deep convolution network.
S4, carrying out knowledge distillation on the student model by using the intermediate model, adding the feature maps (featuremap) output by each stage (stage) of the intermediate model in the knowledge distillation process, adding the added feature maps (featuremap) and the original knowledge distillation objective function to obtain a new knowledge distillation objective function, training the student model by using the new knowledge distillation objective function to obtain a trained student model, and obtaining a final student model (objective model), wherein the teacher model and the student model are deep convolution neural network models, and the specific process is as follows:
the new knowledge distillation objective function obtained by adding the added feature map and the original knowledge distillation objective function not only considers the soft target and the hard target of the final output layer of the teacher model in the original knowledge distillation method, but also considers the output feature map of each stage of the intermediate model. When the knowledge distillation compression method is applied, the student model (target model) can simultaneously refer to the final output of the teacher model (original model) and the output result of the intermediate model in the knowledge distillation training process, more accurate teacher knowledge can be learned, the accuracy of image classification of the student model is improved, and better performance of the student model can be obtained when the parameter quantity difference between the teacher model (original model) and the student model (target model) is too large.
The original knowledge distillation objective function:
Figure BDA0003702819910000071
the original knowledge distillation objective function considers the difference (left term) between the output of the teacher model and the real label of the data and the difference (right term) between the output of the teacher model and the final output of the teacher model, and m represents the total amount of input data; α represents a parameter for weighing the teacher model "soft goal" and "hard goal"; d (-) represents cross entropy;
Figure BDA0003702819910000072
representing the logits output generated by the student model for the jth input; y is j A true tag representing the jth input; t represents the softening parameter of logits;
Figure BDA0003702819910000073
representing the logits output generated by the teacher model for the jth input.
The feature map after the addition of the intermediate model is as follows:
Figure BDA0003702819910000081
the formula (3) is used for measuring the difference of the characteristic maps of the teacher-student model intermediate output (intermediate model). Where s is the number of stages (typically 3 or 4) of the intermediate model, as shown in FIG. 2;
Figure BDA0003702819910000082
representing the feature map output by the student model at the kth stage,
Figure BDA0003702819910000083
and representing the characteristic map output by the teacher model at the kth stage.
L=t*L f +(1-t)*L s (4)
L is the final global objective function, i.e. the new knowledge distillation objective function after the addition of the added feature map (featuremap) and the original knowledge distillation objective function, and takes into account the final output of the teacher modelOutputting a feature map with each stage of the intermediate model, wherein t is used as a balance parameter for adjusting the global target L f And a phase target L s Of importance in between.
And S5, inputting the images to be classified into the trained student model, and outputting the classification results of the images.
The second embodiment is as follows: the present embodiment is described with reference to fig. 1 to fig. 3, and the deep learning knowledge distillation compression system based on model channel clipping in the present embodiment includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the processor implements any one of the steps in a deep learning knowledge distillation method based on model channel clipping.
The third concrete implementation mode: the present embodiment is described with reference to fig. 1 to 3, and the computer readable storage medium of the present embodiment stores a computer program which, when executed by a processor, implements any one of the steps of a deep learning knowledge distillation method based on model channel clipping.
Example 1
The invention combines a multilayer Distillation scheme provided by Improved Knowledge Distillation via Teacher Assistant with a structured channel cutting technology of a deep convolutional neural network provided by Filter planning High-Rank Feature Map. A Teacher model (an original model) is cut by a structured channel of a deep convolutional neural network to obtain the Teacher model with a simplified convolutional channel structure, the Teacher model at the moment is used as an intermediate model, then a multilayer Knowledge Distillation scheme provided by Improved Knowledge Distillation via teachers Assistant is used for reference, the size (parameter and calculated amount) of the intermediate model is set to be approximate to that (parameter and calculated amount) of an Assistant model, and the intermediate model trains a student model (a target model) (a Knowledge Distillation stage). In the process, the intermediate model is closer to the original model, so more original knowledge can be kept in the process of training the student model, the number of parameters of the student model is small, an optimal solution can be obtained by training an optimization function of the student model more easily, and a more excellent distillation effect can be obtained theoretically. In the knowledge distillation process of the last stage, according to the inherent structural characteristics of a plurality of stage convolution channels of the deep convolution network, the number of stages of the teacher model is consistent with that of the student model, but the depth and the number of the convolution channels of each stage of the teacher model are different, and due to the existence of the depth difference, certain optimization difficulty exists in the distillation process. Therefore, the invention improves the objective function of knowledge distillation, and sums the feature maps after each stage of the intermediate model and adds the feature maps to the objective function of original knowledge distillation, so that the student model (the objective model) can simultaneously refer to the final output of the teacher model (the original model) and the output result of the intermediate model in the training process to learn more accurate teacher knowledge.
The traditional deep learning model compression method is represented by a channel cutting model compression method and a knowledge distillation model compression method, the two model compression methods have respective advantages and disadvantages, the model compression method based on channel cutting can realize a larger compression ratio and reserve the depth of the model, but in the compression process, particularly under the condition of a large compression ratio, the compression ratio of each layer of the model is difficult to distribute, and a plurality of rounds of iterative fine tuning are needed. The model compression method based on knowledge distillation has more balanced sub-model structure, but the depth of the original model is lost in the distillation process, and a satisfactory effect on processing the knowledge distillation problem with large span is difficult to obtain.
The present invention uses a two-stage process. In the first stage, a channel cutting technology is adopted to cut redundant convolution channels in a teacher model (an original model), main convolution channels of the teacher model are reserved, the precision of the teacher model is recovered by finely adjusting the teacher model, the teacher model after the channels are cut is used as an intermediate model, parameters of the intermediate model are reserved until the parameter difference value between the parameter of the teacher model and the parameter of a student model (a target model) is consistent, the compression ratio of the two stages is balanced conveniently set, and the balanced distribution of the compression ratio of each layer is ensured. And in the second stage, compressing the intermediate model into a target model by adopting a knowledge distillation method, introducing an improved multi-stage target function in the process, summing the feature maps (characteristic maps) of the intermediate model after each stage, adding the summed feature maps to the target function of the original knowledge distillation to obtain a new knowledge distillation target function, training the student model by using the new knowledge distillation target function to obtain a trained student model (target model), and applying the trained student model to subsequent specific application to improve the application effect and accuracy.
The structure (intermediate model) of the structured convolutional neural network is generally divided into 3-4 stages as shown in fig. 2, each stage is then downsampled to reduce the feature map of the featuremap, the number of convolution channels is doubled, the difference between the teacher model (original model) and the student model (target model) is reflected in the depth and width of the convolution channel inside each stage, and in order to make the distillation process in the second stage of the invention easier to optimize, the featuremap output by each stage of the intermediate model is added into the objective function of original knowledge distillation, so that the student model can learn intermediate link information (intermediate model) of the teacher model reasoning process in addition to learning the 'soft target' and 'hard target' of the teacher model.
Example 2
Specific experiments, as exemplified by Resnet:
to verify the effectiveness of the present invention, experiments were conducted with Resnet as the teacher model. The data set used in this experiment was cfar-10, and the experimental goal was to compress the original model Resnet56 to a parameter size of the target model Resnet 32. By way of comparison, we have introduced the following reference schemes, respectively:
1. obtaining model accuracy 93.26 through the original Resnet56 after pre-training;
2. directly compressing Resnet56 into Resnet32 by using a traditional knowledge distillation method, and training for 150 turns to obtain model accuracy 92.474;
3. compressing Resnet56 into Resnet44 by using a traditional knowledge distillation method, compressing Resnet44 into Resnet32, and training turns 150+150 to obtain model accuracy 93.018;
4. the channel clipping is directly used to obtain a model with the approximate parameter number of resnet32, and the training round 300 obtains the model precision of 92.32, which shows that the precision loss of the channel clipping is large under the condition of large parameter span.
By using the two-stage processing method, Resnet56 is compressed into Resnet44, Resnet44 is compressed into Resnet32, and the model accuracy 93.265 is obtained through training turns 150+ 150.
It can be shown from the above experiments that it is indeed effective to add an intermediate link model between the original model and the target model. Compared with other schemes, the invention achieves the best knowledge distillation effect. As shown in table 1:
TABLE 1
Resnet56 Resnet56-32KD Resnet56-44-32KD This scheme Channel tailoring
93.26 92.474 93.018 93.265 92.32

Claims (10)

1. A deep learning knowledge distillation method based on model channel cutting is characterized in that: it comprises the following steps:
s1, obtaining images to be classified, inputting the images to be classified into the teacher model to obtain the average rank of the feature map output by each convolution channel in each convolution layer of the teacher model, and sequencing the convolution channels in each convolution layer from large to small according to the corresponding average rank to obtain the sequenced convolution channels of each convolution layer of the teacher model;
s2, calculating parameter quantity average values of the teacher model parameter quantity and the student model parameter quantity according to the parameter quantity of the teacher model and the parameter quantity of the student model, and taking the change proportion of the parameter quantity average values and the teacher model parameter quantity as the channel cutting total compression rate;
s3, channel clipping is carried out on the redundant convolution channels in each convolution layer of the teacher model according to the convolution channels obtained in S1 after each convolution layer of the teacher model is sequenced and the overall compression rate of channel clipping in S2 by means of a channel clipping technology, and an intermediate model is obtained;
s4, carrying out knowledge distillation on the student model by using the intermediate model, summing the feature maps output by each stage of the intermediate model in the knowledge distillation process, adding the summed feature maps and the original knowledge distillation objective function to obtain a new knowledge distillation objective function, and training the student model by using the new knowledge distillation objective function to obtain a trained student model;
and S5, inputting the images to be classified into the trained student model, and outputting the classification results of the images.
2. The method of distilling knowledge learned based on model channel clipping as set forth in claim 1, wherein: the method comprises the steps of obtaining images to be classified in S1, inputting the images to be classified into a teacher model, obtaining an average rank of a feature map output by each convolution channel in each convolution layer of the teacher model, sequencing the convolution channels in each convolution layer from large to small according to the corresponding average rank, and obtaining the sequenced convolution channels of each convolution layer of the teacher model, wherein the specific process is as follows:
s11, C convolution channels are arranged in a certain convolution layer in the assumed teacher model;
and S12, obtaining A images to be classified, inputting the A images to be classified into the teacher model to obtain the average rank of the feature maps output by each convolution channel in the convolution layer in S11, sequencing the convolution channels according to the average rank of the feature maps output by each convolution channel from large to small to obtain the sequenced convolution channels of the convolution layers, and repeating the steps until the sequenced convolution channels of each convolution layer of the teacher model are obtained.
3. The method of distilling knowledge learned deeply based on model channel clipping as set forth in claim 2, wherein: in S3, channel clipping is performed on the redundant convolution channels in each convolution layer of the teacher model according to the ordered convolution channels in each convolution layer of the teacher model obtained in S1 and the total compression rate of channel clipping in S2 by using a channel clipping technique, so as to obtain an intermediate model, and the specific process is as follows:
and distributing the compression rate of each convolution layer of the teacher model by using hyper-parameter adjustment according to the channel cutting total compression rate in S2, performing channel cutting on redundant convolution channels in each convolution layer of the teacher model according to the sequenced convolution channels of each convolution layer of the teacher model obtained in S1 and the distributed compression rate of each convolution layer of the teacher model by using a channel cutting technology, and enabling the parameter quantity of the teacher model after channel cutting to be equal to the parameter quantity average value in S2 to obtain an intermediate model.
4. The method of claim 3, wherein the model channel clipping-based deep learning knowledge distillation method comprises: the specific process of the super-parameter adjustment is as follows:
setting a hyper-parameter p, p ═ p 1 ,p 2 ,…,p i ,…,p n The hyperparameter p satisfies the requirement formula:
Figure FDA0003702819900000021
wherein n represents the total number of convolutional layers, i is 1,2, …, n;
Pr(C i ) Each layer is expressedParameter quantities of convolutional layers;
m represents the number of parameters of the teacher model;
q represents the global compression ratio;
calculating the hyperparameter p according to the channel clipping total compression ratio in S2 and the formula (1) i =[0,1]。
5. The method of distilling knowledge learned deeply based on model channel clipping as set forth in claim 4, wherein: the method comprises the following steps of distributing the compression rate of each convolution layer of the teacher model by utilizing hyper-parameter adjustment according to the channel cutting total compression rate in S2, carrying out channel cutting on a redundant convolution channel in each convolution layer of the teacher model according to the sequenced convolution channel of each convolution layer of the teacher model obtained in S1 and the distributed compression rate of each convolution layer of the teacher model by utilizing a channel cutting technology, and obtaining an intermediate model by enabling the parameter quantity of the teacher model after channel cutting to be equal to the parameter quantity mean value in S2, wherein the specific process is as follows:
will exceed the parameter p i As the compression rate of each convolution layer of the teacher model, the channel cutting technology is used to obtain the convolution channel and the hyper-parameter p after sequencing according to each convolution layer of the teacher model obtained in S1 i And performing channel cutting on redundant convolution channels in each convolution layer of the teacher model, wherein the parameter quantity of the teacher model after the channel cutting is equal to the parameter quantity mean value in the S2, and obtaining an intermediate model.
6. The method of distilling knowledge learned deeply based on model channel clipping as set forth in claim 5, wherein: the new knowledge distillation objective function in S4:
L=t*L f +(1-t)*L s (2)
wherein L represents a new knowledge distillation objective function;
t represents a balance parameter;
L f representing the original knowledge distillation objective function;
L s and (4) representing the added feature map.
7. The method of claim 6, wherein the model channel clipping-based deep learning knowledge distillation method comprises: the original knowledge distillation objective function:
Figure FDA0003702819900000031
wherein m represents the total number of input data;
α represents a parameter that balances the "soft target" and the "hard target";
d (-) represents cross entropy;
Figure FDA0003702819900000032
representing the logits output generated by the student model for the jth input;
y j a true tag representing the jth input;
t represents the softening parameter of logits;
Figure FDA0003702819900000033
representing the logits output generated by the teacher model for the jth input.
8. The method of distilling knowledge learned based on model channel clipping as set forth in claim 7, wherein: the added feature map is as follows:
Figure FDA0003702819900000034
wherein s represents the number of stages of the intermediate model;
Figure FDA0003702819900000035
representing a characteristic spectrum output by the student model at the kth stage;
Figure FDA0003702819900000036
and representing the characteristic map output by the teacher model at the kth stage.
9. A model channel clipping based deep learning knowledge distillation system comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that: the processor, when executing the computer program, realizes the steps of the method according to any of claims 1-8.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.
CN202210697905.3A 2022-06-20 2022-06-20 Deep learning knowledge distillation method based on model channel cutting Pending CN114898165A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210697905.3A CN114898165A (en) 2022-06-20 2022-06-20 Deep learning knowledge distillation method based on model channel cutting

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210697905.3A CN114898165A (en) 2022-06-20 2022-06-20 Deep learning knowledge distillation method based on model channel cutting

Publications (1)

Publication Number Publication Date
CN114898165A true CN114898165A (en) 2022-08-12

Family

ID=82727746

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210697905.3A Pending CN114898165A (en) 2022-06-20 2022-06-20 Deep learning knowledge distillation method based on model channel cutting

Country Status (1)

Country Link
CN (1) CN114898165A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115294332A (en) * 2022-10-09 2022-11-04 浙江啄云智能科技有限公司 Image processing method, device, equipment and storage medium
CN115965964A (en) * 2023-01-29 2023-04-14 中国农业大学 Egg freshness identification method, system and equipment
CN116304677A (en) * 2023-01-30 2023-06-23 格兰菲智能科技有限公司 Channel pruning method and device for model, computer equipment and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115294332A (en) * 2022-10-09 2022-11-04 浙江啄云智能科技有限公司 Image processing method, device, equipment and storage medium
CN115965964A (en) * 2023-01-29 2023-04-14 中国农业大学 Egg freshness identification method, system and equipment
CN115965964B (en) * 2023-01-29 2024-01-23 中国农业大学 Egg freshness identification method, system and equipment
CN116304677A (en) * 2023-01-30 2023-06-23 格兰菲智能科技有限公司 Channel pruning method and device for model, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN114898165A (en) Deep learning knowledge distillation method based on model channel cutting
CN104077595B (en) Deep learning network image recognition methods based on Bayesian regularization
CN107229904A (en) A kind of object detection and recognition method based on deep learning
CN112230675B (en) Unmanned aerial vehicle task allocation method considering operation environment and performance in collaborative search and rescue
CN110473592B (en) Multi-view human synthetic lethal gene prediction method
CN112699958A (en) Target detection model compression and acceleration method based on pruning and knowledge distillation
CN110097178A (en) It is a kind of paid attention to based on entropy neural network model compression and accelerated method
CN108235003B (en) Three-dimensional video quality evaluation method based on 3D convolutional neural network
CN112364719A (en) Method for rapidly detecting remote sensing image target
CN111523546A (en) Image semantic segmentation method, system and computer storage medium
CN114943345B (en) Active learning and model compression-based federal learning global model training method
CN115100238A (en) Knowledge distillation-based light single-target tracker training method
CN110889450A (en) Method and device for super-parameter tuning and model building
CN107577736A (en) A kind of file recommendation method and system based on BP neural network
CN111967971A (en) Bank client data processing method and device
CN106708044A (en) Full-hovering hovercraft course control method based on grey prediction hybrid genetic algorithm-PID
CN112580662A (en) Method and system for recognizing fish body direction based on image features
CN115063274A (en) Virtual reality flight training scheme generation method based on object technology capability
CN117236421A (en) Large model training method based on federal knowledge distillation
CN114819091A (en) Multi-task network model training method and system based on self-adaptive task weight
CN107492129A (en) Non-convex compressed sensing optimal reconfiguration method with structuring cluster is represented based on sketch
CN113516163B (en) Vehicle classification model compression method, device and storage medium based on network pruning
CN115937693A (en) Road identification method and system based on remote sensing image
CN113361570B (en) 3D human body posture estimation method based on joint data enhancement and network training model
CN115457269A (en) Semantic segmentation method based on improved DenseNAS

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination