CN115294407B

CN115294407B - Model compression method and system based on preview mechanism knowledge distillation

Info

Publication number: CN115294407B
Application number: CN202211206057.8A
Authority: CN
Inventors: 吴建龙; 丁沐河; 聂礼强; 董雪; 甘甜; 丁宁; 姜飞俊
Original assignee: Zhejiang Maojing Artificial Intelligence Technology Co ltd; Shandong University; Shenzhen Graduate School Harbin Institute of Technology
Current assignee: Zhejiang Maojing Artificial Intelligence Technology Co ltd; Shandong University; Shenzhen Graduate School Harbin Institute of Technology
Priority date: 2022-09-30
Filing date: 2022-09-30
Publication date: 2023-01-03
Anticipated expiration: 2042-09-30
Also published as: CN115294407A

Abstract

The invention belongs to the field of computer vision image classification, and provides a model compression method and system based on preview mechanism knowledge distillation to solve the problems of poor accuracy and instability in image classification identification. Acquiring an image sample, marking a label of the image sample, and performing supervision training on a student network; enabling the student network and a pre-trained teacher network to perform output alignment, feature alignment, category center alignment and category center comparison learning; calculating difficulty scores of the image samples, and dynamically distributing weights of different image samples; obtaining a total loss function based on loss functions of supervision training, output alignment, feature alignment, category center alignment and category center comparison learning and weights of different image samples; and guiding the training of the student network according to the total loss function to obtain the trained student network which is used as an image classification model for carrying out class distribution prediction on the input image. Which improves the accuracy of the image recognition classification.

Description

Model compression method and system based on preview mechanism knowledge distillation

Technical Field

The invention belongs to the field of computer vision image classification, and particularly relates to a model compression method and system based on preview mechanism knowledge distillation.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

The convolutional neural network has superiority in an image classification task, and a complex model can significantly improve the performance of the image classification task, but brings more parameters and calculation amount, so that the parameters and the calculation amount are difficult to deploy in practical application, and therefore compression of the network model is necessary. Knowledge distillation is a mainstream algorithm in model compression, and the core idea is to distill the knowledge of a complex network (called a teacher) to a lightweight network (called a student). Under the guidance of a teacher network, the student network with simple structure and less parameter quantity improves the accuracy rate in the image classification task and obtains the performance similar to that of the teacher network.

Generally, the key to knowledge distillation is to extract knowledge from the teacher network to guide the student network. For example, KD proposes to treat the output (logits) of the teacher network as knowledge and pass the knowledge by narrowing the KL divergence of the two, letting the student network mimic the output of the teacher network. FitNet and Attention further treat the feature maps of the middle layers of the network as knowledge, which is conveyed by narrowing the euclidean distance of the feature maps. However, these methods only extract instance-level consistency information, and ignore structured information in the teacher web feature space. CRD and CRCD introduce comparative learning into knowledge distillation to extract discriminative characterizing knowledge. Despite the success of the above method, the following problems also exist: the existing method mainly makes the student network imitate the result of the teacher network, but ignores the operation of teaching the student network to obtain the result. Specifically, they primarily guide students with the results (e.g., features or outputs) of the teacher's network, while ignoring the teacher's intrinsic knowledge (e.g., structure and parameters). The output (logits) of an instance is derived by fully connected layers, so that the weight matrix of the fully connected layers can sense the similarity between the instance and each class and output the probability that the instance belongs to all classes. Therefore, we classify instances with the weight matrix as a class center, which shows the operation of the teacher network to get the results, which can be taught to the student network as knowledge; the existing method does not distinguish the difficulty degree of the samples, so that the student network cannot effectively learn the difficult samples in the initial stage. Some images are single objects, and the background is clear and easy to distinguish; while others are multiple objects and the background is complex and difficult to distinguish. The student network has simple structure and less parameters, and cannot accept all knowledge, especially the knowledge of difficult samples, taught by the teacher network at the beginning.

In summary, the inventors found that, although the existing model compression technology enables the student network for image classification to simulate the result of the teacher network, the accuracy of the result of the student network for identifying the image category is unstable because the inherent knowledge of the teacher network is ignored, and the difficulty level of distinguishing the pattern sample is not existed, so that the student network cannot effectively learn the difficult sample in the initial stage, and finally the accuracy of identifying the image category is poor.

Disclosure of Invention

In order to solve the technical problems in the background art, the invention provides a model compression method and system based on pre-learning mechanism knowledge distillation, which extracts the relationship between the class center of a full-connected layer and example features, explicitly optimizes class representation, explores the correlation between the feature representation and the class, and enhances the discriminability, thereby improving the accuracy of image recognition class.

In order to achieve the purpose, the invention adopts the following technical scheme:

the first aspect of the invention provides a model compression method based on the distillation of the knowledge of the preview mechanism.

A model compression method based on preview mechanism knowledge distillation, comprising:

acquiring an image sample, marking a label of the image sample, and performing supervision training on a student network;

based on a class comparison learning knowledge distillation method, a student network and a pre-trained teacher network carry out output alignment, feature alignment, class center alignment and class center comparison learning;

calculating difficulty scores of the image samples by adopting a learning strategy of a pre-learning mechanism, and dynamically distributing weights of different image samples based on the difficulty scores;

obtaining a total loss function based on loss functions of supervision training, output alignment, feature alignment, category center alignment and category center comparison learning and weights of different image samples;

and guiding and training the student network according to the total loss function to obtain the student network which is compressed and trained based on the teacher network model, wherein the student network is used as an image classification model and used for carrying out class distribution prediction on the input image.

In the process of carrying out supervised training on the student network, the cross entropy of the predicted distribution and the labels of the student network is minimized based on the image samples and the labels.

As an embodiment, in aligning the output of the student network with the output of the pre-trained teacher network, the KL divergence of the output of the teacher network and the output of the student network are minimized and the outputs of the teacher network and the student network are made similar.

In the process of aligning the characteristics of the student network and the pre-trained teacher network, the characteristics of the student network are aligned with the characteristic dimension of the teacher network through a multi-layer perceptron, and the Euclidean distance between the characteristics of the student network and the characteristics of the teacher network is minimized to enable the characteristics of the student network and the characteristics of the teacher network to be similar.

In one embodiment, in the process of aligning the class centers of the student network and the pre-trained teacher network, the Euclidean distance of the weight matrixes of the full connection layers of the teacher network and the student network is minimized, and the class centers of the teacher network and the student network are aligned.

The technical scheme has the advantages that the output alignment, the feature alignment and the class center alignment of the student network and the teacher network are trained, the full connection layers of the student network and the teacher network can be kept consistent, the class level information is mined, and the knowledge transfer is facilitated.

As an implementation manner, in the process of calculating the difficulty scores of the image samples by adopting the learning strategy of the pre-learning mechanism, when the difficulty scores of the image samples are not greater than the dynamic threshold, the weight of the image samples is assigned to 1; otherwise, the inverse of the assignment of the image sample weight is the e-index of the square of the difficulty score of the image sample.

In one embodiment, the dynamic threshold is a power exponent function, where the exponent is the number of training times and the base is the sum of 1 and a hyperparameter controlling the growth rate.

The technical scheme has the advantages that the learning strategy based on the preview mechanism is adopted, the difficulty level training network of the sample is divided, the accuracy and the efficiency of the knowledge distillation method are improved, and the accuracy of image classification is finally improved.

A second aspect of the invention provides a model compression system based on a priori knowledge of the mechanism of learning to distill.

A model compression system based on learning-mechanism-knowledge distillation, comprising:

the supervised training module is used for acquiring the image sample, labeling the label of the image sample and carrying out supervised training on the student network;

the knowledge distillation module is used for enabling a student network and a pre-trained teacher network to carry out output alignment, feature alignment, class center alignment and class center comparison learning based on a class comparison learning knowledge distillation method;

the preview mechanism learning module is used for calculating difficulty scores of the image samples by adopting a learning strategy of a preview mechanism and dynamically distributing weights of different image samples based on the difficulty scores;

the total loss function determining module is used for obtaining a total loss function based on loss functions of supervision training, output alignment, feature alignment, category center alignment and category center comparison learning and weights of different image samples;

and the model compression module is used for guiding the training of the student network according to the total loss function to obtain the student network which is compressed and trained based on the teacher network model, and the student network is used as an image classification model and used for carrying out class distribution prediction on the input image.

A third aspect of the invention provides a computer-readable storage medium.

A computer-readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method for model compression based on predictive mechanism knowledge distillation as described above.

A fourth aspect of the invention provides an electronic device.

An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method for model compression based on pre-learned mechanism knowledge distillation as described above when executing the program.

Compared with the prior art, the invention has the beneficial effects that:

(1) According to the method, the output alignment, the feature alignment and the class center alignment of the student network and the teacher network are trained, the process that the features pass through the full connection layer is regarded as the operation of obtaining the result of the network for research, the weight matrix of the full connection layer is regarded as the class center, the full connection layer of the student network and the teacher network is kept consistent, information of the class layer is mined, knowledge transfer is facilitated, and the accuracy of image classification is greatly improved.

(2) The invention is based on a class comparison learning knowledge distillation method, compares the structural knowledge of the relationship between the characteristics of a learning student network and the class center of a teacher network and the class center of the teacher network, explicitly optimizes the class representation of the student network and the teacher network, and transmits the correlation between the characteristic representation and the class. Therefore, the method can transmit the structural knowledge of the relation between the example characteristics and the category center, the relation between the category center of the full connection layer and the example characteristics, explicitly optimize the category representation and explore the correlation between the characteristic representation and the category, and enhance the discriminability and improve the accuracy of image classification.

(3) The invention is based on the learning strategy of the preview mechanism, divides the difficulty training network of the sample, improves the accuracy and efficiency of the knowledge distillation method, and finally improves the accuracy of image classification.

Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are included to illustrate an exemplary embodiment of the invention and not to limit the invention.

FIG. 1 is a flow chart of a model compression method based on the distillation of the knowledge of the preview mechanism according to an embodiment of the present invention;

FIG. 2 (a) is a schematic diagram of a distillation method for class-contrast learning knowledge proposed by an embodiment of the present invention;

FIG. 2 (b) is a schematic diagram of class-centric comparison learning according to an embodiment of the present invention;

fig. 2 (c) is a schematic diagram of a learning strategy based on a preview mechanism according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a feature multiplied by a full connectivity level class center in an embodiment of the present invention;

fig. 4 (a) is a training example of a learning strategy based on a preview mechanism proposed by an embodiment of the present invention;

FIG. 4 (b) is a drawing showing a second embodiment of the present inventionp ₁ The ratio of the loss functions of the simple samples and the difficult samples during the secondary training is obtained;

FIG. 4 (c) shows a second embodiment of the present inventionp ₁ The number of samples of simple samples and difficult samples during secondary training is proportional;

FIG. 4 (d) shows the second embodiment of the present inventionp _k The ratio of the loss functions of the simple samples and the difficult samples during the secondary training is obtained;

FIG. 4 (e) is a drawing showing a first embodiment of the present inventionp _k Simple sample and sleepiness during sub-trainingThe number of samples of the hard samples is proportional;

FIG. 4 (f) is a drawing showing a second embodiment of the present inventionp _n The ratio of the loss functions of the simple samples and the difficult samples during the secondary training is obtained;

FIG. 4 (g) is a drawing showing a second embodiment of the present inventionp _n The number of samples of simple and difficult samples in the sub-training is proportional.

Detailed Description

The invention is further described with reference to the following figures and examples.

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

Example one

The embodiment provides a model compression process based on the distillation of the knowledge of a pre-learning mechanism, and particularly pre-trains a teacher network, and then leads the teacher network to guide the network training of students. The student network firstly carries out supervision training through the image labels, and then adopts output alignment to enable the student network to imitate the output of a teacher network; then, feature alignment is carried out, so that the student network keeps similar to the features of the teacher network; then, class center alignment is carried out, and the class centers of the teacher network and the student network are kept consistent; and finally, performing category center comparison learning, so that the characteristics of the student network are compared with the category centers of the student network and the teacher network respectively for learning, and the characteristics and the category center of the same category are drawn close to be far away from the category centers of different categories. In addition, the sample difficulty score is calculated through a learning strategy based on a preview mechanism, the weight is dynamically given along with training, and then the sample weight is multiplied by the corresponding loss function to obtain a final loss function to guide the student to learn on the network.

As shown in fig. 1, the present embodiment provides a model compression method based on the distillation of the knowledge of the preview mechanism, which includes:

step a: and acquiring an image sample, labeling a label of the image sample, and performing supervision training on a student network.

Supervision and training: in the process of carrying out supervision training on the student network, the prediction distribution of the student network and the cross entropy of the labels are minimized based on the image training samples and the labels.

Note book

For training the data set, the corresponding label set is

In which

Is as follows

A picture corresponding to a label of

，

Is the total number of pictures. The training data set is realized according to the following method, for example, the image is cut, rotated and normalized by mean variance, and the image after data enhancement is sequentially rotated by 90 degrees, 180 degrees and 270 degrees and expanded into original four parts.

For each input picture

By using

Representing the output (logits) of the student network. We minimize the cross entropy of the prediction distribution and image labels according to a supervised learning approach for image classification. The loss function is as follows:

wherein the content of the first and second substances,

is the softmax function.

Step b: the knowledge distillation method based on class contrast learning enables a student network to perform output alignment, feature alignment, class center alignment and class center contrast learning with a pre-trained teacher network, such as fig. 2 (a) and fig. 2 (b).

The following describes a specific process of output alignment, feature alignment, and category center alignment with reference to fig. 2 (a).

Output alignment:

in the process of carrying out output alignment training on the student network, the KL divergence of the output of the teacher network and the output of the student network are minimized, and the outputs of the teacher network and the student network are similar.

Knowledge is transferred by minimizing KL divergence of teacher network and student network outputs using a KD method, allowing the student network to mimic the teacher network outputs. The loss function is as follows:

wherein the content of the first and second substances,

are the weight coefficients.

Feature alignment:

in the process of carrying out feature alignment training on the student network, aligning the features of the student network with the feature dimensions of the teacher network through the multilayer perceptron, and minimizing the Euclidean distance between the features to enable the features of the student network and the teacher network to be similar.

And performing feature alignment on the expanded data. We first represent the augmented data as

，

And

second to represent teacher and student networks respectively

After sample expansion

And (4) a feature. Feature alignment lets features of student network

To imitate the characteristics of teacher's network

. Because the two characteristics have different dimensions, the network characteristics of the students are combined

By means of multi-layer perceptron

Mapping to the same dimension of teacher's network, regularizing, and calculating

Distance, feature alignment is achieved. The feature alignment penalty function is as follows:

wherein the content of the first and second substances,

and

respectively representing the characteristics of the student network and the teacher network after regularization.

Representing vectors

And (4) regularizing. In particular, the present invention relates to a method for producing,

to do so

。

Class center alignment:

and in the process of carrying out class center alignment training on the student network, minimizing the Euclidean distance of the weight matrixes of the full connection layers of the teacher network and the student network, and aligning the class centers of the teacher network and the student network.

In convolutional neural networks, a teacher network extracts features of an input image through a plurality of convolutional and pooling layers, and then maps the features to outputs (logits) through fully-connected layers for classification. Specifically, as shown in fig. 3, the weight matrix of the fully-connected layer operates by perceiving the similarity between the image features and each class, thereby outputting the probability that the image belongs to each class. We refer to each column of the weight matrix as a class center, representing the attributes of a particular class. Therefore, we teach the knowledge in the class center to the student network, enabling it to understand how the teacher network classifies the images. Technically, we let students learn class centers of teacher networks over the network by the following loss function:

wherein, the first and the second end of the pipe are connected with each other,

and

representing class centers (i.e., weight matrices for fully connected layers) for the teacher network and the student network, respectively.

According to the method, the student network and the teacher network are trained to be output aligned, feature aligned and category center aligned, so that the student network and the teacher network can be kept consistent in all connection layers, information on category level is mined, and knowledge transfer is facilitated.

Specifically, as shown in fig. 2 (b), in the process of comparing the class centers of the learning student network and the teacher network, based on the class center comparison learning knowledge distillation method, the structured knowledge of the relationship between the features of the learning student network and the class centers of the teacher network and the class centers thereof, respectively, is compared, the class representations of the student network and the teacher network are explicitly optimized, and the correlation between the feature representations and the classes is transferred.

The category centers represent the attributes of each category, and thus each category center should have representativeness for the category and distinctiveness from other categories. Zooming-in student network

Class center of the same class as the teacher network

Class center for classes corresponding to student network

Away from the center of the two different categories. Using cosine distance estimationSimilarity between the sample features and their class centers is calculated. In summary, we define the following contrast loss:

wherein the content of the first and second substances,

as a network of students

After sample expansion

Is characterized by that

To do so

And

respectively representing teacher network and student network categories

The category center of (1).

Is the total number of categories that are,

are the weight coefficients.

The embodiment is based on a comparative learning knowledge distillation method of category centers to transmit structural knowledge of the relationship between the example characteristics and the category centers, the relationship between the category centers of the full-connected layers and the example characteristics, explicitly optimize the category representation and explore the correlation between the characteristic representation and the category, so that the discriminability is enhanced.

Step c: and (d) calculating difficulty scores of the image samples by adopting a learning strategy of a preview mechanism, and dynamically allocating the weights of different image samples based on the difficulty scores, as shown in fig. 2 (c).

When the difficulty score of the image sample is not larger than the dynamic threshold value, the weight of the image sample is assigned to be 1; otherwise, the inverse of the assignment of the image sample weight is the e-index of the square of the difficulty score of the image sample.

The dynamic threshold is a power exponent function, wherein the exponent is the training frequency, and the base number is the sum of 1 and a super parameter for controlling the growth rate.

And a learning strategy based on a preview mechanism is used for dividing a training network of the difficulty degree of the sample, so that the accuracy and the efficiency of the knowledge distillation method are improved, and the accuracy of image classification is finally improved.

In an actual computer vision task, there are great differences between different images, some images have a single object and a clear background and are easy to recognize, while other images have multiple objects and a cluttered background and are difficult to recognize. Obviously, a student network with simple structure and few parameters is difficult to learn all knowledge from a teacher network, especially knowledge of a difficult sample. In a real scene, a teacher usually teaches the students the knowledge of the current class and lets the students pre-learn the difficult knowledge of the next class after class to better understand the difficult knowledge. Based on the inspiration, a new learning strategy based on pre-learning is provided, as shown in fig. 4 (a) -4 (g), so that a student network can pre-learn the difficulty of the samples in the training process in advance, and the weights of different samples are dynamically distributed, and the guidance of a teacher network is effectively received.

If the student network can correctly classify the sample, it is inferred to be a simple sample, and vice versa to be a difficult sample. Technically, the cross-entropy loss of a sample indicates how similar its prediction is to its label, and can be considered as a sampleDifficulty. Thus, the sample is taken

Difficulty score of

The definition is as follows:

wherein the content of the first and second substances,

is a sample

The cross-entropy value of (a) is,

indicating all of the samples within the batch,

it is the number of samples within this batch. We divide the cross entropy value of this sample by the cross entropy average of the samples of the entire batch to obtain this sample difficulty score.

The learning strategy based on the preview mechanism enables the student network to not only focus on simple samples, but also preview difficult samples in advance. Specifically, in a batch training process, the student network focuses on simple samples taught by the teacher network, and can also focus on learning of difficult samples. To achieve this, we introduce a weight for each sample during the training process

As its attention, the following are shown:

wherein the content of the first and second substances,

is the difficulty score of the sample, and

is a dynamically changing threshold that is constantly greater than 1 to divide the sample. Due to the fact that

So that the difficult samples are always weighted less than

. In addition, as training progresses, the performance of the student network is gradually enhanced under the guidance of the teacher network, and more difficult sample knowledge can be learned. Therefore, we dynamically raise the threshold value at each training period

And dividing more difficult samples into simple samples, so that students learn more difficult knowledge on network. We will threshold

The definition is as follows:

wherein the content of the first and second substances,

is a hyper-parameter that controls the rate of growth. With the training round

Increase, threshold value

The size of the material is increased to be larger,more and more difficult samples are classified as simple samples.

Step d: and obtaining a total loss function based on the loss functions of supervision training, output alignment, feature alignment, class center alignment and class center comparison learning and the weights of different image samples.

Integrating the methods of supervision training, output alignment, feature alignment, category center comparison learning and the like, and combining weights given by a learning strategy based on a preview mechanism to form a final loss function to guide student network training. The final objective function can be expressed as:

wherein

Is a function of the cross-entropy loss,

，

，

and

respectively a previously defined KD-loss function, a feature alignment loss function, a class-centric alignment loss function and a class-centric contrast loss function.

，

，

And

is a weight parameter.

Sample assignment for learning strategy based on preview mechanism

The weight of (c). Based on the final loss function, the embodiment can guide the network training of students based on the pre-trained teacher network, so that the effect of the student network is close to or even exceeds that of the teacher network, and the model compression is realized.

Step e: and guiding and training the student network according to the total loss function to obtain the student network which is compressed and trained based on the teacher network model, wherein the student network is used as an image classification model and used for carrying out class distribution prediction on the input image.

When the network training of students is completed, image classification can be performed. Given an image

Inputting the sample into a student network, outputting a predicted distribution of the sample, and determining the class with the highest probability in the predicted distribution

Is the final classification result. By the method, higher classification accuracy under the condition of model compression is realized, namely:

in table 1, 7 network structures (respectively: teacher network, student network, KD, fitNet, AT, CRD, and the method of the present invention) selected from the CIFAR-100 image dataset compare the classification accuracy of the present invention with other knowledge distillation algorithms, and it can be observed that the method achieves the optimal results on all network structures.

TABLE 1 Top1 accuracy (%) comparison of different network structures on CIFAR-100 dataset for the present invention and prior knowledge distillation algorithms

In the embodiment, the operation of taking the process of the characteristic passing through the full connection layer as the result obtained by the network is studied, the weight matrix of the full connection layer is taken as the class center, the class centers of the network are kept consistent, and the information of the class level is mined. The embodiment also provides a comparative learning knowledge distillation method based on class centers to transfer the structured knowledge of the relationship between example characteristics and the class centers. The embodiment provides a learning strategy based on a preview mechanism, and a difficulty training network for dividing samples is provided, so that the accuracy of image classification is greatly improved, and the method is superior to the conventional method.

Example two

The embodiment provides a model compression system based on preview mechanism knowledge distillation, which comprises:

(1) The supervised training module is used for acquiring the image sample, labeling the label of the image sample and carrying out supervised training on the student network;

(2) The knowledge distillation module is used for enabling a student network and a pre-trained teacher network to carry out output alignment, feature alignment, class center alignment and class center comparison learning based on a class comparison learning knowledge distillation method;

(3) The preview mechanism learning module is used for calculating difficulty scores of the image samples by adopting a learning strategy of the preview mechanism and dynamically distributing the weights of different image samples based on the difficulty scores;

(4) The total loss function determining module is used for obtaining a total loss function based on loss functions of supervision training, output alignment, feature alignment, category center alignment and category center comparison learning and weights of different image samples;

(5) And the model compression module is used for guiding the training of the student network according to the total loss function to obtain the student network which is compressed and trained based on the teacher network model, and the student network is used as an image classification model and used for carrying out class distribution prediction on the input image.

It should be noted that, each module in the present embodiment corresponds to each step in the first embodiment one to one, and the specific implementation process is the same, which is not described again here.

EXAMPLE III

The present embodiment provides a computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method for model compression based on pre-learned mechanism knowledge distillation as described above.

Example four

The present embodiment provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor executes the program to implement the steps of the method for model compression based on pre-learning mechanism knowledge distillation as described above.

The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A model compression method based on preview mechanism knowledge distillation is characterized by comprising the following steps:

based on a class comparison learning knowledge distillation method, a student network and a pre-trained teacher network are subjected to output alignment, feature alignment, class center alignment and class center comparison learning;

guiding to train a student network according to the total loss function to obtain the student network which is compressed and trained based on the teacher network model, wherein the student network is used as an image classification model and used for carrying out class distribution prediction on input images;

in the process of carrying out supervision training on the student network, based on the image samples and the labels, the prediction distribution of the student network and the cross entropy of the labels are minimized;

in the process of aligning the output of the student network and the output of the pre-trained teacher network, the KL divergence of the output of the teacher network and the output of the student network are minimized, and the outputs of the teacher network and the student network are similar;

in the process of aligning the characteristics of the student network and the pre-trained teacher network, aligning the characteristics of the student network to the characteristic dimension of the teacher network through a multilayer perceptron, and minimizing the Euclidean distance between the characteristics of the student network and the teacher network to ensure that the characteristics of the student network and the teacher network are similar;

in the process of aligning the class centers of the student network and the pre-trained teacher network, the Euclidean distance of the weight matrixes of the full connection layers of the teacher network and the student network is minimized, and the class centers of the teacher network and the student network are aligned.

2. The model compression method based on preview mechanism knowledge distillation of claim 1, wherein in the process of calculating the difficulty score of the image sample by adopting the learning strategy of the preview mechanism, when the difficulty score of the image sample is not greater than the dynamic threshold, the weight of the image sample is assigned to 1; otherwise, the inverse of the assignment of the image sample weight is the e-index of the square of the difficulty score of the image sample.

3. The method of model compression based on pre-learned mechanism knowledge distillation of claim 2, wherein the dynamic threshold is a power exponent function, wherein the exponent is the number of training times and the base is the sum of 1 and the hyperparameter controlling the growth rate.

4. A model compression system based on learning-mechanism-knowledge distillation, comprising:

the preview mechanism learning module is used for calculating difficulty scores of the image samples by adopting a learning strategy of the preview mechanism and dynamically distributing the weights of different image samples based on the difficulty scores;

the model compression module is used for guiding the training of the student network according to the total loss function to obtain the student network which is compressed and trained based on the teacher network model, and the student network is used as an image classification model and used for carrying out class distribution prediction on the input image;

wherein:

5. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method for model compression based on pre-learned mechanism knowledge distillation as set forth in any one of claims 1 to 3.

6. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the steps in the method for model compression based on pre-learned mechanism knowledge distillation of any of claims 1-3.