CN115294407B - Model compression method and system based on preview mechanism knowledge distillation - Google Patents

Model compression method and system based on preview mechanism knowledge distillation Download PDF

Info

Publication number
CN115294407B
CN115294407B CN202211206057.8A CN202211206057A CN115294407B CN 115294407 B CN115294407 B CN 115294407B CN 202211206057 A CN202211206057 A CN 202211206057A CN 115294407 B CN115294407 B CN 115294407B
Authority
CN
China
Prior art keywords
network
student network
student
teacher
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211206057.8A
Other languages
Chinese (zh)
Other versions
CN115294407A (en
Inventor
吴建龙
丁沐河
聂礼强
董雪
甘甜
丁宁
姜飞俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Maojing Artificial Intelligence Technology Co ltd
Shandong University
Shenzhen Graduate School Harbin Institute of Technology
Original Assignee
Zhejiang Maojing Artificial Intelligence Technology Co ltd
Shandong University
Shenzhen Graduate School Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Maojing Artificial Intelligence Technology Co ltd, Shandong University, Shenzhen Graduate School Harbin Institute of Technology filed Critical Zhejiang Maojing Artificial Intelligence Technology Co ltd
Priority to CN202211206057.8A priority Critical patent/CN115294407B/en
Publication of CN115294407A publication Critical patent/CN115294407A/en
Application granted granted Critical
Publication of CN115294407B publication Critical patent/CN115294407B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/778Active pattern-learning, e.g. online learning of image or video features
    • G06V10/7784Active pattern-learning, e.g. online learning of image or video features based on feedback from supervisors
    • G06V10/7788Active pattern-learning, e.g. online learning of image or video features based on feedback from supervisors the supervisor being a human, e.g. interactive learning with a human teacher
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Abstract

The invention belongs to the field of computer vision image classification, and provides a model compression method and system based on preview mechanism knowledge distillation to solve the problems of poor accuracy and instability in image classification identification. Acquiring an image sample, marking a label of the image sample, and performing supervision training on a student network; enabling the student network and a pre-trained teacher network to perform output alignment, feature alignment, category center alignment and category center comparison learning; calculating difficulty scores of the image samples, and dynamically distributing weights of different image samples; obtaining a total loss function based on loss functions of supervision training, output alignment, feature alignment, category center alignment and category center comparison learning and weights of different image samples; and guiding the training of the student network according to the total loss function to obtain the trained student network which is used as an image classification model for carrying out class distribution prediction on the input image. Which improves the accuracy of the image recognition classification.

Description

Model compression method and system based on preview mechanism knowledge distillation
Technical Field
The invention belongs to the field of computer vision image classification, and particularly relates to a model compression method and system based on preview mechanism knowledge distillation.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
The convolutional neural network has superiority in an image classification task, and a complex model can significantly improve the performance of the image classification task, but brings more parameters and calculation amount, so that the parameters and the calculation amount are difficult to deploy in practical application, and therefore compression of the network model is necessary. Knowledge distillation is a mainstream algorithm in model compression, and the core idea is to distill the knowledge of a complex network (called a teacher) to a lightweight network (called a student). Under the guidance of a teacher network, the student network with simple structure and less parameter quantity improves the accuracy rate in the image classification task and obtains the performance similar to that of the teacher network.
Generally, the key to knowledge distillation is to extract knowledge from the teacher network to guide the student network. For example, KD proposes to treat the output (logits) of the teacher network as knowledge and pass the knowledge by narrowing the KL divergence of the two, letting the student network mimic the output of the teacher network. FitNet and Attention further treat the feature maps of the middle layers of the network as knowledge, which is conveyed by narrowing the euclidean distance of the feature maps. However, these methods only extract instance-level consistency information, and ignore structured information in the teacher web feature space. CRD and CRCD introduce comparative learning into knowledge distillation to extract discriminative characterizing knowledge. Despite the success of the above method, the following problems also exist: the existing method mainly makes the student network imitate the result of the teacher network, but ignores the operation of teaching the student network to obtain the result. Specifically, they primarily guide students with the results (e.g., features or outputs) of the teacher's network, while ignoring the teacher's intrinsic knowledge (e.g., structure and parameters). The output (logits) of an instance is derived by fully connected layers, so that the weight matrix of the fully connected layers can sense the similarity between the instance and each class and output the probability that the instance belongs to all classes. Therefore, we classify instances with the weight matrix as a class center, which shows the operation of the teacher network to get the results, which can be taught to the student network as knowledge; the existing method does not distinguish the difficulty degree of the samples, so that the student network cannot effectively learn the difficult samples in the initial stage. Some images are single objects, and the background is clear and easy to distinguish; while others are multiple objects and the background is complex and difficult to distinguish. The student network has simple structure and less parameters, and cannot accept all knowledge, especially the knowledge of difficult samples, taught by the teacher network at the beginning.
In summary, the inventors found that, although the existing model compression technology enables the student network for image classification to simulate the result of the teacher network, the accuracy of the result of the student network for identifying the image category is unstable because the inherent knowledge of the teacher network is ignored, and the difficulty level of distinguishing the pattern sample is not existed, so that the student network cannot effectively learn the difficult sample in the initial stage, and finally the accuracy of identifying the image category is poor.
Disclosure of Invention
In order to solve the technical problems in the background art, the invention provides a model compression method and system based on pre-learning mechanism knowledge distillation, which extracts the relationship between the class center of a full-connected layer and example features, explicitly optimizes class representation, explores the correlation between the feature representation and the class, and enhances the discriminability, thereby improving the accuracy of image recognition class.
In order to achieve the purpose, the invention adopts the following technical scheme:
the first aspect of the invention provides a model compression method based on the distillation of the knowledge of the preview mechanism.
A model compression method based on preview mechanism knowledge distillation, comprising:
acquiring an image sample, marking a label of the image sample, and performing supervision training on a student network;
based on a class comparison learning knowledge distillation method, a student network and a pre-trained teacher network carry out output alignment, feature alignment, class center alignment and class center comparison learning;
calculating difficulty scores of the image samples by adopting a learning strategy of a pre-learning mechanism, and dynamically distributing weights of different image samples based on the difficulty scores;
obtaining a total loss function based on loss functions of supervision training, output alignment, feature alignment, category center alignment and category center comparison learning and weights of different image samples;
and guiding and training the student network according to the total loss function to obtain the student network which is compressed and trained based on the teacher network model, wherein the student network is used as an image classification model and used for carrying out class distribution prediction on the input image.
In the process of carrying out supervised training on the student network, the cross entropy of the predicted distribution and the labels of the student network is minimized based on the image samples and the labels.
As an embodiment, in aligning the output of the student network with the output of the pre-trained teacher network, the KL divergence of the output of the teacher network and the output of the student network are minimized and the outputs of the teacher network and the student network are made similar.
In the process of aligning the characteristics of the student network and the pre-trained teacher network, the characteristics of the student network are aligned with the characteristic dimension of the teacher network through a multi-layer perceptron, and the Euclidean distance between the characteristics of the student network and the characteristics of the teacher network is minimized to enable the characteristics of the student network and the characteristics of the teacher network to be similar.
In one embodiment, in the process of aligning the class centers of the student network and the pre-trained teacher network, the Euclidean distance of the weight matrixes of the full connection layers of the teacher network and the student network is minimized, and the class centers of the teacher network and the student network are aligned.
The technical scheme has the advantages that the output alignment, the feature alignment and the class center alignment of the student network and the teacher network are trained, the full connection layers of the student network and the teacher network can be kept consistent, the class level information is mined, and the knowledge transfer is facilitated.
As an implementation manner, in the process of calculating the difficulty scores of the image samples by adopting the learning strategy of the pre-learning mechanism, when the difficulty scores of the image samples are not greater than the dynamic threshold, the weight of the image samples is assigned to 1; otherwise, the inverse of the assignment of the image sample weight is the e-index of the square of the difficulty score of the image sample.
In one embodiment, the dynamic threshold is a power exponent function, where the exponent is the number of training times and the base is the sum of 1 and a hyperparameter controlling the growth rate.
The technical scheme has the advantages that the learning strategy based on the preview mechanism is adopted, the difficulty level training network of the sample is divided, the accuracy and the efficiency of the knowledge distillation method are improved, and the accuracy of image classification is finally improved.
A second aspect of the invention provides a model compression system based on a priori knowledge of the mechanism of learning to distill.
A model compression system based on learning-mechanism-knowledge distillation, comprising:
the supervised training module is used for acquiring the image sample, labeling the label of the image sample and carrying out supervised training on the student network;
the knowledge distillation module is used for enabling a student network and a pre-trained teacher network to carry out output alignment, feature alignment, class center alignment and class center comparison learning based on a class comparison learning knowledge distillation method;
the preview mechanism learning module is used for calculating difficulty scores of the image samples by adopting a learning strategy of a preview mechanism and dynamically distributing weights of different image samples based on the difficulty scores;
the total loss function determining module is used for obtaining a total loss function based on loss functions of supervision training, output alignment, feature alignment, category center alignment and category center comparison learning and weights of different image samples;
and the model compression module is used for guiding the training of the student network according to the total loss function to obtain the student network which is compressed and trained based on the teacher network model, and the student network is used as an image classification model and used for carrying out class distribution prediction on the input image.
A third aspect of the invention provides a computer-readable storage medium.
A computer-readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method for model compression based on predictive mechanism knowledge distillation as described above.
A fourth aspect of the invention provides an electronic device.
An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method for model compression based on pre-learned mechanism knowledge distillation as described above when executing the program.
Compared with the prior art, the invention has the beneficial effects that:
(1) According to the method, the output alignment, the feature alignment and the class center alignment of the student network and the teacher network are trained, the process that the features pass through the full connection layer is regarded as the operation of obtaining the result of the network for research, the weight matrix of the full connection layer is regarded as the class center, the full connection layer of the student network and the teacher network is kept consistent, information of the class layer is mined, knowledge transfer is facilitated, and the accuracy of image classification is greatly improved.
(2) The invention is based on a class comparison learning knowledge distillation method, compares the structural knowledge of the relationship between the characteristics of a learning student network and the class center of a teacher network and the class center of the teacher network, explicitly optimizes the class representation of the student network and the teacher network, and transmits the correlation between the characteristic representation and the class. Therefore, the method can transmit the structural knowledge of the relation between the example characteristics and the category center, the relation between the category center of the full connection layer and the example characteristics, explicitly optimize the category representation and explore the correlation between the characteristic representation and the category, and enhance the discriminability and improve the accuracy of image classification.
(3) The invention is based on the learning strategy of the preview mechanism, divides the difficulty training network of the sample, improves the accuracy and efficiency of the knowledge distillation method, and finally improves the accuracy of image classification.
Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are included to illustrate an exemplary embodiment of the invention and not to limit the invention.
FIG. 1 is a flow chart of a model compression method based on the distillation of the knowledge of the preview mechanism according to an embodiment of the present invention;
FIG. 2 (a) is a schematic diagram of a distillation method for class-contrast learning knowledge proposed by an embodiment of the present invention;
FIG. 2 (b) is a schematic diagram of class-centric comparison learning according to an embodiment of the present invention;
fig. 2 (c) is a schematic diagram of a learning strategy based on a preview mechanism according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a feature multiplied by a full connectivity level class center in an embodiment of the present invention;
fig. 4 (a) is a training example of a learning strategy based on a preview mechanism proposed by an embodiment of the present invention;
FIG. 4 (b) is a drawing showing a second embodiment of the present inventionp 1 The ratio of the loss functions of the simple samples and the difficult samples during the secondary training is obtained;
FIG. 4 (c) shows a second embodiment of the present inventionp 1 The number of samples of simple samples and difficult samples during secondary training is proportional;
FIG. 4 (d) shows the second embodiment of the present inventionp k The ratio of the loss functions of the simple samples and the difficult samples during the secondary training is obtained;
FIG. 4 (e) is a drawing showing a first embodiment of the present inventionp k Simple sample and sleepiness during sub-trainingThe number of samples of the hard samples is proportional;
FIG. 4 (f) is a drawing showing a second embodiment of the present inventionp n The ratio of the loss functions of the simple samples and the difficult samples during the secondary training is obtained;
FIG. 4 (g) is a drawing showing a second embodiment of the present inventionp n The number of samples of simple and difficult samples in the sub-training is proportional.
Detailed Description
The invention is further described with reference to the following figures and examples.
It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
Example one
The embodiment provides a model compression process based on the distillation of the knowledge of a pre-learning mechanism, and particularly pre-trains a teacher network, and then leads the teacher network to guide the network training of students. The student network firstly carries out supervision training through the image labels, and then adopts output alignment to enable the student network to imitate the output of a teacher network; then, feature alignment is carried out, so that the student network keeps similar to the features of the teacher network; then, class center alignment is carried out, and the class centers of the teacher network and the student network are kept consistent; and finally, performing category center comparison learning, so that the characteristics of the student network are compared with the category centers of the student network and the teacher network respectively for learning, and the characteristics and the category center of the same category are drawn close to be far away from the category centers of different categories. In addition, the sample difficulty score is calculated through a learning strategy based on a preview mechanism, the weight is dynamically given along with training, and then the sample weight is multiplied by the corresponding loss function to obtain a final loss function to guide the student to learn on the network.
As shown in fig. 1, the present embodiment provides a model compression method based on the distillation of the knowledge of the preview mechanism, which includes:
step a: and acquiring an image sample, labeling a label of the image sample, and performing supervision training on a student network.
Supervision and training: in the process of carrying out supervision training on the student network, the prediction distribution of the student network and the cross entropy of the labels are minimized based on the image training samples and the labels.
Note book
Figure 490246DEST_PATH_IMAGE001
For training the data set, the corresponding label set is
Figure 953589DEST_PATH_IMAGE002
In which
Figure 501245DEST_PATH_IMAGE003
Is as follows
Figure 468064DEST_PATH_IMAGE004
A picture corresponding to a label of
Figure 708552DEST_PATH_IMAGE005
Figure 393611DEST_PATH_IMAGE006
Is the total number of pictures. The training data set is realized according to the following method, for example, the image is cut, rotated and normalized by mean variance, and the image after data enhancement is sequentially rotated by 90 degrees, 180 degrees and 270 degrees and expanded into original four parts.
For each input picture
Figure 10538DEST_PATH_IMAGE003
By using
Figure 831863DEST_PATH_IMAGE007
Representing the output (logits) of the student network. We minimize the cross entropy of the prediction distribution and image labels according to a supervised learning approach for image classification. The loss function is as follows:
Figure 508832DEST_PATH_IMAGE008
wherein the content of the first and second substances,
Figure 415608DEST_PATH_IMAGE009
is the softmax function.
Step b: the knowledge distillation method based on class contrast learning enables a student network to perform output alignment, feature alignment, class center alignment and class center contrast learning with a pre-trained teacher network, such as fig. 2 (a) and fig. 2 (b).
The following describes a specific process of output alignment, feature alignment, and category center alignment with reference to fig. 2 (a).
Output alignment:
in the process of carrying out output alignment training on the student network, the KL divergence of the output of the teacher network and the output of the student network are minimized, and the outputs of the teacher network and the student network are similar.
Knowledge is transferred by minimizing KL divergence of teacher network and student network outputs using a KD method, allowing the student network to mimic the teacher network outputs. The loss function is as follows:
Figure 570646DEST_PATH_IMAGE010
wherein the content of the first and second substances,
Figure 512057DEST_PATH_IMAGE011
are the weight coefficients.
Feature alignment:
in the process of carrying out feature alignment training on the student network, aligning the features of the student network with the feature dimensions of the teacher network through the multilayer perceptron, and minimizing the Euclidean distance between the features to enable the features of the student network and the teacher network to be similar.
And performing feature alignment on the expanded data. We first represent the augmented data as
Figure 828769DEST_PATH_IMAGE012
Figure 222841DEST_PATH_IMAGE013
And
Figure 181570DEST_PATH_IMAGE014
second to represent teacher and student networks respectively
Figure 977488DEST_PATH_IMAGE004
After sample expansion
Figure 730680DEST_PATH_IMAGE015
And (4) a feature. Feature alignment lets features of student network
Figure 877628DEST_PATH_IMAGE014
To imitate the characteristics of teacher's network
Figure 108889DEST_PATH_IMAGE013
. Because the two characteristics have different dimensions, the network characteristics of the students are combined
Figure 759313DEST_PATH_IMAGE014
By means of multi-layer perceptron
Figure 948986DEST_PATH_IMAGE016
Mapping to the same dimension of teacher's network, regularizing, and calculating
Figure 583230DEST_PATH_IMAGE017
Distance, feature alignment is achieved. The feature alignment penalty function is as follows:
Figure 618182DEST_PATH_IMAGE018
wherein the content of the first and second substances,
Figure 117253DEST_PATH_IMAGE019
and
Figure 212248DEST_PATH_IMAGE020
respectively representing the characteristics of the student network and the teacher network after regularization.
Figure 333788DEST_PATH_IMAGE021
Representing vectors
Figure 906852DEST_PATH_IMAGE022
And (4) regularizing. In particular, the present invention relates to a method for producing,
Figure 531868DEST_PATH_IMAGE023
to do so
Figure 797764DEST_PATH_IMAGE024
Class center alignment:
and in the process of carrying out class center alignment training on the student network, minimizing the Euclidean distance of the weight matrixes of the full connection layers of the teacher network and the student network, and aligning the class centers of the teacher network and the student network.
In convolutional neural networks, a teacher network extracts features of an input image through a plurality of convolutional and pooling layers, and then maps the features to outputs (logits) through fully-connected layers for classification. Specifically, as shown in fig. 3, the weight matrix of the fully-connected layer operates by perceiving the similarity between the image features and each class, thereby outputting the probability that the image belongs to each class. We refer to each column of the weight matrix as a class center, representing the attributes of a particular class. Therefore, we teach the knowledge in the class center to the student network, enabling it to understand how the teacher network classifies the images. Technically, we let students learn class centers of teacher networks over the network by the following loss function:
Figure 875442DEST_PATH_IMAGE025
wherein, the first and the second end of the pipe are connected with each other,
Figure 517776DEST_PATH_IMAGE026
and
Figure 731719DEST_PATH_IMAGE027
representing class centers (i.e., weight matrices for fully connected layers) for the teacher network and the student network, respectively.
According to the method, the student network and the teacher network are trained to be output aligned, feature aligned and category center aligned, so that the student network and the teacher network can be kept consistent in all connection layers, information on category level is mined, and knowledge transfer is facilitated.
Specifically, as shown in fig. 2 (b), in the process of comparing the class centers of the learning student network and the teacher network, based on the class center comparison learning knowledge distillation method, the structured knowledge of the relationship between the features of the learning student network and the class centers of the teacher network and the class centers thereof, respectively, is compared, the class representations of the student network and the teacher network are explicitly optimized, and the correlation between the feature representations and the classes is transferred.
The category centers represent the attributes of each category, and thus each category center should have representativeness for the category and distinctiveness from other categories. Zooming-in student network
Figure 168517DEST_PATH_IMAGE028
Class center of the same class as the teacher network
Figure 999070DEST_PATH_IMAGE029
Class center for classes corresponding to student network
Figure 913936DEST_PATH_IMAGE030
Away from the center of the two different categories. Using cosine distance estimationSimilarity between the sample features and their class centers is calculated. In summary, we define the following contrast loss:
Figure 513544DEST_PATH_IMAGE031
Figure 386823DEST_PATH_IMAGE032
wherein the content of the first and second substances,
Figure 439092DEST_PATH_IMAGE033
as a network of students
Figure 157650DEST_PATH_IMAGE034
After sample expansion
Figure 877344DEST_PATH_IMAGE015
Is characterized by that
Figure 921523DEST_PATH_IMAGE035
To do so
Figure 461089DEST_PATH_IMAGE036
And
Figure 983337DEST_PATH_IMAGE037
respectively representing teacher network and student network categories
Figure 291959DEST_PATH_IMAGE038
The category center of (1).
Figure 241460DEST_PATH_IMAGE039
Is the total number of categories that are,
Figure 2743DEST_PATH_IMAGE040
are the weight coefficients.
The embodiment is based on a comparative learning knowledge distillation method of category centers to transmit structural knowledge of the relationship between the example characteristics and the category centers, the relationship between the category centers of the full-connected layers and the example characteristics, explicitly optimize the category representation and explore the correlation between the characteristic representation and the category, so that the discriminability is enhanced.
Step c: and (d) calculating difficulty scores of the image samples by adopting a learning strategy of a preview mechanism, and dynamically allocating the weights of different image samples based on the difficulty scores, as shown in fig. 2 (c).
When the difficulty score of the image sample is not larger than the dynamic threshold value, the weight of the image sample is assigned to be 1; otherwise, the inverse of the assignment of the image sample weight is the e-index of the square of the difficulty score of the image sample.
The dynamic threshold is a power exponent function, wherein the exponent is the training frequency, and the base number is the sum of 1 and a super parameter for controlling the growth rate.
And a learning strategy based on a preview mechanism is used for dividing a training network of the difficulty degree of the sample, so that the accuracy and the efficiency of the knowledge distillation method are improved, and the accuracy of image classification is finally improved.
In an actual computer vision task, there are great differences between different images, some images have a single object and a clear background and are easy to recognize, while other images have multiple objects and a cluttered background and are difficult to recognize. Obviously, a student network with simple structure and few parameters is difficult to learn all knowledge from a teacher network, especially knowledge of a difficult sample. In a real scene, a teacher usually teaches the students the knowledge of the current class and lets the students pre-learn the difficult knowledge of the next class after class to better understand the difficult knowledge. Based on the inspiration, a new learning strategy based on pre-learning is provided, as shown in fig. 4 (a) -4 (g), so that a student network can pre-learn the difficulty of the samples in the training process in advance, and the weights of different samples are dynamically distributed, and the guidance of a teacher network is effectively received.
If the student network can correctly classify the sample, it is inferred to be a simple sample, and vice versa to be a difficult sample. Technically, the cross-entropy loss of a sample indicates how similar its prediction is to its label, and can be considered as a sampleDifficulty. Thus, the sample is taken
Figure 594261DEST_PATH_IMAGE041
Difficulty score of
Figure 757389DEST_PATH_IMAGE042
The definition is as follows:
Figure 877792DEST_PATH_IMAGE043
wherein the content of the first and second substances,
Figure 657529DEST_PATH_IMAGE044
is a sample
Figure 787159DEST_PATH_IMAGE041
The cross-entropy value of (a) is,
Figure 804794DEST_PATH_IMAGE045
indicating all of the samples within the batch,
Figure 627256DEST_PATH_IMAGE046
it is the number of samples within this batch. We divide the cross entropy value of this sample by the cross entropy average of the samples of the entire batch to obtain this sample difficulty score.
The learning strategy based on the preview mechanism enables the student network to not only focus on simple samples, but also preview difficult samples in advance. Specifically, in a batch training process, the student network focuses on simple samples taught by the teacher network, and can also focus on learning of difficult samples. To achieve this, we introduce a weight for each sample during the training process
Figure 628710DEST_PATH_IMAGE047
As its attention, the following are shown:
Figure 30873DEST_PATH_IMAGE048
wherein the content of the first and second substances,
Figure 434172DEST_PATH_IMAGE049
is the difficulty score of the sample, and
Figure 427536DEST_PATH_IMAGE050
is a dynamically changing threshold that is constantly greater than 1 to divide the sample. Due to the fact that
Figure 385128DEST_PATH_IMAGE051
So that the difficult samples are always weighted less than
Figure 122140DEST_PATH_IMAGE052
. In addition, as training progresses, the performance of the student network is gradually enhanced under the guidance of the teacher network, and more difficult sample knowledge can be learned. Therefore, we dynamically raise the threshold value at each training period
Figure 114366DEST_PATH_IMAGE050
And dividing more difficult samples into simple samples, so that students learn more difficult knowledge on network. We will threshold
Figure 747473DEST_PATH_IMAGE050
The definition is as follows:
Figure 452081DEST_PATH_IMAGE053
wherein the content of the first and second substances,
Figure 727204DEST_PATH_IMAGE054
is a hyper-parameter that controls the rate of growth. With the training round
Figure 573938DEST_PATH_IMAGE055
Increase, threshold value
Figure 909104DEST_PATH_IMAGE050
The size of the material is increased to be larger,more and more difficult samples are classified as simple samples.
Step d: and obtaining a total loss function based on the loss functions of supervision training, output alignment, feature alignment, class center alignment and class center comparison learning and the weights of different image samples.
Integrating the methods of supervision training, output alignment, feature alignment, category center comparison learning and the like, and combining weights given by a learning strategy based on a preview mechanism to form a final loss function to guide student network training. The final objective function can be expressed as:
Figure 841288DEST_PATH_IMAGE056
wherein
Figure 920102DEST_PATH_IMAGE057
Is a function of the cross-entropy loss,
Figure 152501DEST_PATH_IMAGE058
Figure 127410DEST_PATH_IMAGE059
Figure 812469DEST_PATH_IMAGE060
and
Figure 429395DEST_PATH_IMAGE061
respectively a previously defined KD-loss function, a feature alignment loss function, a class-centric alignment loss function and a class-centric contrast loss function.
Figure 250721DEST_PATH_IMAGE062
Figure 662110DEST_PATH_IMAGE063
Figure 834466DEST_PATH_IMAGE064
And
Figure 989504DEST_PATH_IMAGE065
is a weight parameter.
Figure 665336DEST_PATH_IMAGE066
Sample assignment for learning strategy based on preview mechanism
Figure 513206DEST_PATH_IMAGE041
The weight of (c). Based on the final loss function, the embodiment can guide the network training of students based on the pre-trained teacher network, so that the effect of the student network is close to or even exceeds that of the teacher network, and the model compression is realized.
Step e: and guiding and training the student network according to the total loss function to obtain the student network which is compressed and trained based on the teacher network model, wherein the student network is used as an image classification model and used for carrying out class distribution prediction on the input image.
When the network training of students is completed, image classification can be performed. Given an image
Figure 907278DEST_PATH_IMAGE041
Inputting the sample into a student network, outputting a predicted distribution of the sample, and determining the class with the highest probability in the predicted distribution
Figure 866007DEST_PATH_IMAGE067
Is the final classification result. By the method, higher classification accuracy under the condition of model compression is realized, namely:
Figure 661925DEST_PATH_IMAGE068
in table 1, 7 network structures (respectively: teacher network, student network, KD, fitNet, AT, CRD, and the method of the present invention) selected from the CIFAR-100 image dataset compare the classification accuracy of the present invention with other knowledge distillation algorithms, and it can be observed that the method achieves the optimal results on all network structures.
TABLE 1 Top1 accuracy (%) comparison of different network structures on CIFAR-100 dataset for the present invention and prior knowledge distillation algorithms
Figure 680696DEST_PATH_IMAGE069
In the embodiment, the operation of taking the process of the characteristic passing through the full connection layer as the result obtained by the network is studied, the weight matrix of the full connection layer is taken as the class center, the class centers of the network are kept consistent, and the information of the class level is mined. The embodiment also provides a comparative learning knowledge distillation method based on class centers to transfer the structured knowledge of the relationship between example characteristics and the class centers. The embodiment provides a learning strategy based on a preview mechanism, and a difficulty training network for dividing samples is provided, so that the accuracy of image classification is greatly improved, and the method is superior to the conventional method.
Example two
The embodiment provides a model compression system based on preview mechanism knowledge distillation, which comprises:
(1) The supervised training module is used for acquiring the image sample, labeling the label of the image sample and carrying out supervised training on the student network;
(2) The knowledge distillation module is used for enabling a student network and a pre-trained teacher network to carry out output alignment, feature alignment, class center alignment and class center comparison learning based on a class comparison learning knowledge distillation method;
(3) The preview mechanism learning module is used for calculating difficulty scores of the image samples by adopting a learning strategy of the preview mechanism and dynamically distributing the weights of different image samples based on the difficulty scores;
(4) The total loss function determining module is used for obtaining a total loss function based on loss functions of supervision training, output alignment, feature alignment, category center alignment and category center comparison learning and weights of different image samples;
(5) And the model compression module is used for guiding the training of the student network according to the total loss function to obtain the student network which is compressed and trained based on the teacher network model, and the student network is used as an image classification model and used for carrying out class distribution prediction on the input image.
It should be noted that, each module in the present embodiment corresponds to each step in the first embodiment one to one, and the specific implementation process is the same, which is not described again here.
EXAMPLE III
The present embodiment provides a computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method for model compression based on pre-learned mechanism knowledge distillation as described above.
Example four
The present embodiment provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor executes the program to implement the steps of the method for model compression based on pre-learning mechanism knowledge distillation as described above.
The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (6)

1. A model compression method based on preview mechanism knowledge distillation is characterized by comprising the following steps:
acquiring an image sample, marking a label of the image sample, and performing supervision training on a student network;
based on a class comparison learning knowledge distillation method, a student network and a pre-trained teacher network are subjected to output alignment, feature alignment, class center alignment and class center comparison learning;
calculating difficulty scores of the image samples by adopting a learning strategy of a pre-learning mechanism, and dynamically distributing weights of different image samples based on the difficulty scores;
obtaining a total loss function based on loss functions of supervision training, output alignment, feature alignment, category center alignment and category center comparison learning and weights of different image samples;
guiding to train a student network according to the total loss function to obtain the student network which is compressed and trained based on the teacher network model, wherein the student network is used as an image classification model and used for carrying out class distribution prediction on input images;
in the process of carrying out supervision training on the student network, based on the image samples and the labels, the prediction distribution of the student network and the cross entropy of the labels are minimized;
in the process of aligning the output of the student network and the output of the pre-trained teacher network, the KL divergence of the output of the teacher network and the output of the student network are minimized, and the outputs of the teacher network and the student network are similar;
in the process of aligning the characteristics of the student network and the pre-trained teacher network, aligning the characteristics of the student network to the characteristic dimension of the teacher network through a multilayer perceptron, and minimizing the Euclidean distance between the characteristics of the student network and the teacher network to ensure that the characteristics of the student network and the teacher network are similar;
in the process of aligning the class centers of the student network and the pre-trained teacher network, the Euclidean distance of the weight matrixes of the full connection layers of the teacher network and the student network is minimized, and the class centers of the teacher network and the student network are aligned.
2. The model compression method based on preview mechanism knowledge distillation of claim 1, wherein in the process of calculating the difficulty score of the image sample by adopting the learning strategy of the preview mechanism, when the difficulty score of the image sample is not greater than the dynamic threshold, the weight of the image sample is assigned to 1; otherwise, the inverse of the assignment of the image sample weight is the e-index of the square of the difficulty score of the image sample.
3. The method of model compression based on pre-learned mechanism knowledge distillation of claim 2, wherein the dynamic threshold is a power exponent function, wherein the exponent is the number of training times and the base is the sum of 1 and the hyperparameter controlling the growth rate.
4. A model compression system based on learning-mechanism-knowledge distillation, comprising:
the supervised training module is used for acquiring the image sample, labeling the label of the image sample and carrying out supervised training on the student network;
the knowledge distillation module is used for enabling a student network and a pre-trained teacher network to carry out output alignment, feature alignment, class center alignment and class center comparison learning based on a class comparison learning knowledge distillation method;
the preview mechanism learning module is used for calculating difficulty scores of the image samples by adopting a learning strategy of the preview mechanism and dynamically distributing the weights of different image samples based on the difficulty scores;
the total loss function determining module is used for obtaining a total loss function based on loss functions of supervision training, output alignment, feature alignment, category center alignment and category center comparison learning and weights of different image samples;
the model compression module is used for guiding the training of the student network according to the total loss function to obtain the student network which is compressed and trained based on the teacher network model, and the student network is used as an image classification model and used for carrying out class distribution prediction on the input image;
wherein:
in the process of carrying out supervision training on the student network, based on the image samples and the labels, the prediction distribution of the student network and the cross entropy of the labels are minimized;
in the process of aligning the output of the student network and the output of the pre-trained teacher network, the KL divergence of the output of the teacher network and the output of the student network are minimized, and the outputs of the teacher network and the student network are similar;
in the process of aligning the characteristics of the student network and the pre-trained teacher network, aligning the characteristics of the student network to the characteristic dimension of the teacher network through a multilayer perceptron, and minimizing the Euclidean distance between the characteristics of the student network and the teacher network to ensure that the characteristics of the student network and the teacher network are similar;
in the process of aligning the class centers of the student network and the pre-trained teacher network, the Euclidean distance of the weight matrixes of the full connection layers of the teacher network and the student network is minimized, and the class centers of the teacher network and the student network are aligned.
5. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method for model compression based on pre-learned mechanism knowledge distillation as set forth in any one of claims 1 to 3.
6. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the steps in the method for model compression based on pre-learned mechanism knowledge distillation of any of claims 1-3.
CN202211206057.8A 2022-09-30 2022-09-30 Model compression method and system based on preview mechanism knowledge distillation Active CN115294407B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211206057.8A CN115294407B (en) 2022-09-30 2022-09-30 Model compression method and system based on preview mechanism knowledge distillation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211206057.8A CN115294407B (en) 2022-09-30 2022-09-30 Model compression method and system based on preview mechanism knowledge distillation

Publications (2)

Publication Number Publication Date
CN115294407A CN115294407A (en) 2022-11-04
CN115294407B true CN115294407B (en) 2023-01-03

Family

ID=83834355

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211206057.8A Active CN115294407B (en) 2022-09-30 2022-09-30 Model compression method and system based on preview mechanism knowledge distillation

Country Status (1)

Country Link
CN (1) CN115294407B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115908823B (en) * 2023-03-09 2023-05-12 南京航空航天大学 Semantic segmentation method based on difficulty distillation
CN116091849B (en) * 2023-04-11 2023-07-25 山东建筑大学 Tire pattern classification method, system, medium and equipment based on grouping decoder
CN116502621B (en) * 2023-06-26 2023-10-17 北京航空航天大学 Network compression method and device based on self-adaptive comparison knowledge distillation
CN116758618B (en) * 2023-08-16 2024-01-09 苏州浪潮智能科技有限公司 Image recognition method, training device, electronic equipment and storage medium
CN117009830B (en) * 2023-10-07 2024-02-13 之江实验室 Knowledge distillation method and system based on embedded feature regularization
CN117372785B (en) * 2023-12-04 2024-03-26 吉林大学 Image classification method based on feature cluster center compression

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114241273A (en) * 2021-12-01 2022-03-25 电子科技大学 Multi-modal image processing method and system based on Transformer network and hypersphere space learning

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111062951B (en) * 2019-12-11 2022-03-25 华中科技大学 Knowledge distillation method based on semantic segmentation intra-class feature difference
KR102191351B1 (en) * 2020-04-28 2020-12-15 아주대학교산학협력단 Method for semantic segmentation based on knowledge distillation
CN111950638B (en) * 2020-08-14 2024-02-06 厦门美图之家科技有限公司 Image classification method and device based on model distillation and electronic equipment
CN112287920B (en) * 2020-09-17 2022-06-14 昆明理工大学 Burma language OCR method based on knowledge distillation
CN112464981B (en) * 2020-10-27 2024-02-06 中科视语(句容)科技有限公司 Self-adaptive knowledge distillation method based on spatial attention mechanism
CN114444558A (en) * 2020-11-05 2022-05-06 佳能株式会社 Training method and training device for neural network for object recognition
US20220138633A1 (en) * 2020-11-05 2022-05-05 Samsung Electronics Co., Ltd. Method and apparatus for incremental learning
CN112561059B (en) * 2020-12-15 2023-08-01 北京百度网讯科技有限公司 Method and apparatus for model distillation
CN114170478A (en) * 2021-12-09 2022-03-11 中山大学 Defect detection and positioning method and system based on cross-image local feature alignment
CN114419369A (en) * 2022-01-04 2022-04-29 深圳市大数据研究院 Method, system, electronic device and storage medium for classifying polyps in image
CN115019123B (en) * 2022-05-20 2023-04-18 中南大学 Self-distillation contrast learning method for remote sensing image scene classification

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114241273A (en) * 2021-12-01 2022-03-25 电子科技大学 Multi-modal image processing method and system based on Transformer network and hypersphere space learning

Also Published As

Publication number Publication date
CN115294407A (en) 2022-11-04

Similar Documents

Publication Publication Date Title
CN115294407B (en) Model compression method and system based on preview mechanism knowledge distillation
CN111554268B (en) Language identification method based on language model, text classification method and device
CN110334705B (en) Language identification method of scene text image combining global and local information
CN110263912B (en) Image question-answering method based on multi-target association depth reasoning
Islalm et al. Recognition bangla sign language using convolutional neural network
CN107644235A (en) Image automatic annotation method based on semi-supervised learning
CN107251059A (en) Sparse reasoning module for deep learning
CN110321967B (en) Image classification improvement method based on convolutional neural network
TW201841130A (en) Neural network compression via weak supervision
CN113887610A (en) Pollen image classification method based on cross attention distillation transducer
CN111754596A (en) Editing model generation method, editing model generation device, editing method, editing device, editing equipment and editing medium
CN105138973A (en) Face authentication method and device
CN112446423B (en) Fast hybrid high-order attention domain confrontation network method based on transfer learning
CN106778852A (en) A kind of picture material recognition methods for correcting erroneous judgement
CN109344884A (en) The method and device of media information classification method, training picture classification model
CN108154156B (en) Image set classification method and device based on neural topic model
CN112733866A (en) Network construction method for improving text description correctness of controllable image
CN109308319A (en) File classification method, document sorting apparatus and computer readable storage medium
CN109344898A (en) Convolutional neural networks image classification method based on sparse coding pre-training
CN115563327A (en) Zero sample cross-modal retrieval method based on Transformer network selective distillation
CN113159067A (en) Fine-grained image identification method and device based on multi-grained local feature soft association aggregation
CN114282059A (en) Video retrieval method, device, equipment and storage medium
CN114299362A (en) Small sample image classification method based on k-means clustering
CN111310820A (en) Foundation meteorological cloud chart classification method based on cross validation depth CNN feature integration
CN105389588A (en) Multi-semantic-codebook-based image feature representation method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant