CN112508169A

CN112508169A - Knowledge distillation method and system

Info

Publication number: CN112508169A
Application number: CN202011273058.5A
Authority: CN
Inventors: 聂迎; 韩凯; 王云鹤; 许春景
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-11-13
Filing date: 2020-11-13
Publication date: 2021-03-16

Abstract

The application relates to a method for training a neural network through knowledge distillation in the field of artificial intelligence. The method comprises the following steps: training at least one teacher machine learning model according to the training data; inputting training data into the machine learning model of the trained teacher so as to obtain an output result of the machine learning model of the trained teacher; and adjusting parameters of the student machine learning model according to the loss function, so that the difference value of the output result of the student machine learning model aiming at the training data and the output result of the trained teacher machine learning model is smaller than a preset threshold value. The loss function includes a first portion determined from output results of the trained teacher machine learning model and a second portion determined from intermediate layer output features generated by intermediate layers included in the trained teacher machine learning model.

Description

Knowledge distillation method and system

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a knowledge distillation method and a knowledge distillation system.

Background

Deep learning is widely cited in image processing, face recognition and other fields as one of the mainstream branches of machine learning. Convolutional Neural Networks (CNNs) are one of the algorithms that represent deep learning, and have been widely used and have enjoyed great success in large-scale computer vision applications such as image classification, object detection, image segmentation and video analysis. Convolutional Neural Network models, particularly Deep Neural Network (DNN) models, require a large amount of computing power and memory. In a common CNN, the parameter amount of each convolutional layer can reach tens of thousands or even hundreds of thousands, and the parameter of all convolutional layers of the whole network can reach tens of millions when accumulated. Hundreds of megabytes of memory or cache are required if represented by 32-bit floating point numbers. On the other hand, the convolution operation has a huge amount of calculation, contains hundreds of thousands of parameters of convolution kernels, and the Floating Point Operations (FLOPs) of the convolution operation can reach thousands of times. Common CNNs can have up to billions of times of FLOPs throughout the network.

Edge devices, such as mobile phones, wearable devices, etc., need to handle deep learning related tasks locally, and are limited by limited resources and power consumption as well as latency and cost considerations, and it is difficult to meet the computational power and memory requirements of common CNNs. Products based on deep learning are therefore difficult to deploy on these edge devices. To advance the application of deep learning based products to edge devices, one approach is to design a dedicated hardware accelerator for a given computational task, and the other approach is to simplify the neural network Model to reduce its computational power and memory requirements, i.e., Model Compression. Model compression includes methods such as Pruning (Pruning), quantification (Quantization), Low-Rank decomposition (Low-Rank Factorization), and Knowledge Distillation (Knowledge Distillation). Among them, the related research of simplifying the full-precision neural network into the binary neural network is receiving much attention. The binary neural network expresses the weight/activation value of the full-precision neural network in a binary mode (such as 1 and-1), and changes the original floating point multiply-add operation into bit operation, so that the required computing resources can be greatly reduced, the operation speed is improved, and the deployment at a mobile terminal is facilitated. However, the methods adopted in the prior art to simplify the full-precision neural network into a binary neural network are generally quantized by sign function or approximated by Straight-Through Estimator (STE) to calculate the network parameter gradient of 32-bit floating point number. This results in inaccurate gradients and affects the update accuracy of the network parameters of the full-precision neural network and the update accuracy of the network parameters of the binarized neural network. Therefore, when the weight value and the activation value of the full-precision neural network are quantized to two values, the training of the network is easy to fall into a local minimum value point, so that the training is insufficient, and further, the great precision loss is brought. Knowledge distillation based model compression, on the other hand, migrates the inferential predictive power of a trained, more complex machine learning model to that of a simpler machine learning model. Therefore, knowledge distillation can be applied to a full-precision neural network, thereby obtaining a low bit-width neural network with a weight/activation value of a lower bit-width. The low-bit-width neural network replaces floating point calculation of the full-precision neural network with a low-bit-width numerical value, so that the operation efficiency is improved, but precision loss can be caused, and the larger the bit-width difference between a teacher network and a student network for knowledge distillation is, the larger the precision loss is, and the full-precision neural network is not easy to be simplified into a binary neural network.

Therefore, a model compression method is needed to effectively reduce the required computing resources and improve the operation speed while avoiding the precision loss as much as possible, and particularly, the model compression method can be used for simplifying a full-precision neural network model into a binary neural network model.

Disclosure of Invention

It is an object of the present application to provide a method, system and computer readable storage medium for training a neural network by knowledge distillation. The method comprises the following steps: training at least one teacher machine learning model according to the training data to obtain a trained teacher machine learning model; inputting the training data into the trained teacher machine learning model so as to obtain an output result of the trained teacher machine learning model, wherein an intermediate layer included in the trained teacher machine learning model generates a corresponding intermediate layer output feature in a generation process of the output result of the trained teacher machine learning model; and adjusting parameters of a student machine learning model according to a loss function, so that the difference between the output result of the student machine learning model aiming at the training data and the output result of the trained teacher machine learning model is smaller than a preset threshold value, wherein the loss function comprises a first part and a second part, the first part of the loss function is determined according to the output result of the trained teacher machine learning model, and the second part of the loss function is determined according to the intermediate layer output characteristics generated by the intermediate layer included in the trained teacher machine learning model. Therefore, the training of the student machine models is guided by the trained machine learning models of the plurality of teachers together, the precision of the machine learning models of the students is improved, the changes of the distillation importance degrees of different teacher networks and different middle layers to the student networks are considered through the loss function, and the training effect is improved.

In a first aspect, embodiments of the present application provide a method for training a neural network by knowledge distillation. The method comprises the following steps: training at least one teacher machine learning model according to the training data to obtain a trained teacher machine learning model; inputting the training data into the trained teacher machine learning model so as to obtain an output result of the trained teacher machine learning model, wherein an intermediate layer included in the trained teacher machine learning model generates a corresponding intermediate layer output feature in a generation process of the output result of the trained teacher machine learning model; and adjusting parameters of a student machine learning model according to a loss function, so that the difference between the output result of the student machine learning model aiming at the training data and the output result of the trained teacher machine learning model is smaller than a preset threshold value, wherein the loss function comprises a first part and a second part, the first part of the loss function is determined according to the output result of the trained teacher machine learning model, and the second part of the loss function is determined according to the intermediate layer output characteristics generated by the intermediate layer included in the trained teacher machine learning model.

According to the technical scheme described in the first aspect, the training of the student machine models is guided by the trained teacher machine learning models together, so that the precision of the student machine learning models is improved, and meanwhile, the change of distillation importance degree of different teacher networks and different middle layers to the student networks is considered through the loss function, and the training effect is improved.

In a possible implementation form according to the first aspect, the bit width of the weight and the activation value of the at least one teacher machine learning model is greater than the bit width of the weight and the activation value of the student machine learning model.

Therefore, the teacher machine learning model with the high bit width is used for guiding the training of the student machine model with the low bit width, and the method is beneficial to reducing required computing resources and improving the operation speed and avoiding the precision loss as much as possible.

In a possible implementation form according to the first aspect, the first part of the loss function comprises a Kullback-leibler (kl) divergence determined based on the output of the trained teacher machine learning model and the output of the student machine learning model for the training data.

In this way, the influence of the output result is taken into account by the first part of the loss function.

According to the first aspect, in a possible implementation manner, the number of the intermediate layers included in the student machine learning model is the same as the number of the intermediate layers included in the at least one teacher machine learning model, the intermediate layers included in the student machine learning model generate corresponding intermediate layer output features in a generation process of an output result of the student machine learning model, and the second part of the loss function includes a distance loss determined based on the intermediate layer output features generated by the intermediate layers included in the trained teacher machine learning model and the intermediate layer output features generated by the intermediate layers included in the student machine learning model after the point convolution layer conversion.

In this way, the variation of the distillation importance of different teacher networks, different intermediate layers to the student network is taken into account by the second part of the loss function.

According to the first aspect, in one possible implementation, the loss function further comprises a third part comprising cross-entropy losses determined for the output results of the training data and the true labels of the training data based on the student machine learning model.

In this way, the training effect is improved by the third part of the loss function.

According to a first aspect, in one possible implementation, the at least one teacher machine learning model includes a plurality of teacher machine learning models, wherein the number of intermediate layers included in each of the plurality of teacher machine learning models is the same as the number of intermediate layers included in the student machine learning model, and wherein training the at least one teacher machine learning model based on the training data to obtain a trained teacher machine learning model includes: and respectively training the plurality of teacher machine learning models according to the training data so as to obtain a plurality of corresponding trained teacher machine learning models.

In this way, the training of the student machine model is guided together by using the trained plurality of teacher machine learning models.

According to the first aspect, in a possible implementation manner, bit widths of the weights and the activation values of the plurality of teacher machine learning models are greater than bit widths of the weights and the activation values of the student machine learning models.

According to a first aspect, in one possible implementation, inputting the training data into the trained teacher machine learning model to obtain an output of the trained teacher machine learning model, comprises: inputting the training data into the plurality of trained teacher machine learning models, respectively, to obtain output results of the plurality of trained teacher machine learning models that correspond to each other, wherein for each of the plurality of trained teacher machine learning models: the plurality of intermediate layers included in the particular trained teacher machine learning model each generate a corresponding intermediate layer output feature in a process in which the particular trained teacher machine learning model generates an output result of the particular trained teacher machine learning model.

Thus, the influence of the output result is considered.

According to the first aspect, in a possible implementation manner, the plurality of trained teacher machine learning models correspond to a plurality of first weight coefficients in a first weight coefficient set in a one-to-one manner, output results of the plurality of trained teacher machine learning models are weighted and summed according to the respective corresponding first weight coefficients to obtain a classification layer output result of the multi-bit teacher machine learning model, and the first part of the loss function includes a Kullback-leibler divergence (kl) determined based on the classification layer output result of the multi-bit teacher machine learning model and an output result of the student machine learning model with respect to the training data.

According to the first aspect, in one possible implementation manner, a plurality of intermediate layers included in each of the plurality of trained teacher machine learning models are in one-to-one correspondence with a plurality of second weight coefficients in a second weight coefficient set, where, for one or more of the plurality of intermediate layers included in the student machine learning model: determining intermediate layer output characteristics of a multi-bit teacher machine learning model corresponding to a specific intermediate layer of the student machine learning model, wherein the intermediate layer output characteristics of the multi-bit teacher machine learning model are obtained by weighting and summing intermediate layer output characteristics generated by the intermediate layers of the plurality of trained teacher machine learning models which are positioned at the same level as the specific intermediate layer according to respective corresponding second weight coefficients; the second part of the loss function includes distance losses determined based on the mid-level output features of the multi-bit teacher machine learning model and mid-level output features generated by a particular mid-level of the student machine learning model after the point convolution layer conversion.

According to the first aspect, in a possible implementation manner, the second part of the loss function further includes summing distance losses corresponding to a plurality of intermediate layers included in the student machine learning model after smoothing processing.

Thus, the training is stabilized by the smoothing process.

According to the first aspect, in a possible implementation manner, the method further includes: deriving the first weight coefficient by the loss function to obtain a first gradient, wherein the first gradient is a gradient of the loss function with respect to a classification layer output result of the multi-bit teacher machine learning model, the first gradient being determined according to the KL divergence included in a first portion of the loss function; deriving the second weight coefficient by the loss function to obtain a second gradient, wherein the second gradient is a gradient of the loss function with respect to an intermediate layer output feature of the multi-bit teacher machine learning model, the second gradient being determined from the distance loss included in a second portion of the loss function; and performing back propagation through the first gradient and the second gradient respectively, thereby dynamically adjusting the first weight coefficient and the second weight coefficient respectively.

Thus, self-dynamic adjustment according to the corresponding loss function part is realized, and further, the operation is simplified and the system efficiency is improved.

In a possible implementation form according to the first aspect, the weight and the activation value of the student machine learning model are both binarized.

Thus, the student machine learning model simplified into binaryzation is realized.

In a second aspect, embodiments of the present application provide a knowledgeable distillation system. The knowledge distillation system comprises: a plurality of teacher machine learning models, and a student machine learning model. The number of the middle layers included in the teacher machine learning models is the same as that of the middle layers included in the student machine learning models; bit widths adopted by the weights and the activation values of the teacher machine learning models are all larger than bit widths adopted by the weights and the activation values of the student machine learning models; wherein parameters of the student machine learning model are adjusted according to a loss function, so that the difference value between the output result of the student machine learning model for the training data and the output result of the plurality of teacher machine learning models for the training data is smaller than a preset threshold value; the first part of the loss function comprises Kullback-Leibler (KL) divergence determined based on the output results of a classification layer of a multi-bit teacher machine learning model and the output results of the student machine learning model aiming at the training data, and the output results of the classification layer of the multi-bit teacher machine learning model are obtained by weighting and summing the output results of the plurality of teacher machine learning models aiming at the training data according to respective corresponding first weight coefficients; the second part of the loss function comprises distance loss determined based on intermediate layer output characteristics of a multi-bit teacher machine learning model and intermediate layer output characteristics generated by a specific intermediate layer of the student machine learning model after point convolution layer conversion, and the intermediate layer output characteristics of the multi-bit teacher machine learning model are obtained by weighting and summing the intermediate layer output characteristics generated by the intermediate layers of the plurality of teacher machine learning models which are positioned at the same level as the specific intermediate layer according to respective corresponding second weight coefficients; wherein a third portion of the loss function includes cross-entropy losses determined for output results of the training data and true labels of the training data based on the student machine learning model.

The technical scheme described in the second aspect realizes that a plurality of trained teacher machine learning models are used for guiding the training of the student machine models together, thereby improving the precision of the student machine learning models, simultaneously considering the change of distillation importance degree of different teacher networks and different middle layers to the student networks through the loss function, and being beneficial to improving the training effect.

In a third aspect, an embodiment of the present application provides a computer-readable storage medium, which holds computer instructions, where the computer instructions, when executed by a processor, cause the processor to perform the following operations: training at least one teacher machine learning model according to the training data to obtain a trained teacher machine learning model; inputting the training data into the trained teacher machine learning model so as to obtain an output result of the trained teacher machine learning model, wherein an intermediate layer included in the trained teacher machine learning model generates a corresponding intermediate layer output feature in a generation process of the output result of the trained teacher machine learning model; and adjusting parameters of a student machine learning model according to a loss function, so that the difference between the output result of the student machine learning model aiming at the training data and the output result of the trained teacher machine learning model is smaller than a preset threshold value, wherein the loss function comprises a first part and a second part, the first part of the loss function is determined according to the output result of the trained teacher machine learning model, and the second part of the loss function is determined according to the intermediate layer output characteristics generated by the intermediate layer included in the trained teacher machine learning model.

The technical scheme described in the third aspect realizes that a plurality of trained teacher machine learning models are used for guiding the training of the student machine models together, thereby improving the precision of the student machine learning models, simultaneously considering the change of distillation importance degree of different teacher networks and different middle layers to the student networks through the loss function, and being beneficial to improving the training effect.

Drawings

In order to explain the technical solutions in the embodiments or background art of the present application, the drawings used in the embodiments or background art of the present application will be described below.

FIG. 1 illustrates a knowledge distillation system including a single teacher machine learning model provided by embodiments of the present application.

FIG. 2 illustrates a knowledge distillation system including a plurality of teacher machine learning models provided by embodiments of the present application.

Figure 3 shows a schematic flow diagram of a method of knowledge distillation according to one embodiment provided in the examples of the present application.

Figure 4 shows a schematic flow diagram of another embodiment of the knowledge distillation process provided in the examples of the present application.

Detailed Description

The embodiments of the present application may be applied to various application scenarios including, but not limited to, various scenarios in the field of computer vision applications, such as face recognition, image classification, object detection, semantic segmentation, etc., or to neural network model-based processing systems deployed on edge devices (e.g., mobile phones, wearable devices, computing nodes, etc.), or to application scenarios for speech signal processing, natural language processing, recommendation systems, or to application scenarios requiring compression of neural network models due to limited resources and latency requirements.

For illustrative purposes only, the embodiments of the present application may be applied to an application scenario of object detection at a mobile phone end. The technical problems to be solved by the application scenario are as follows: when a user uses a mobile phone to take a picture, the user needs to automatically capture objects such as human faces and animals so as to help the mobile phone to automatically focus, beautify and the like, so that a convolutional neural network model for object detection, which is small in size and fast in operation, is needed, and better user experience is brought to the user and the quality of mobile phone products is improved.

For illustrative purposes only, the present application embodiments may also be used in application scenarios of autonomous driving scenario segmentation. The technical problems to be solved by the application scenario are as follows: after capturing the road image, the camera of the automatic driving vehicle needs to divide the image, and then separates different objects such as road surface, roadbed, vehicles, pedestrians and the like, so as to keep the vehicle running in a correct area. There is therefore a need for a convolutional neural network model that can provide fast real-time correct interpretation and semantic segmentation of a picture.

For illustrative purposes only, the embodiments of the present application may also be used in application scenarios of portal gate face verification. The technical problems to be solved by the application scenario are as follows: when passengers carry out face authentication on gates at entrances of high-speed rails, airports and the like, a camera can shoot a face image and extract features by using a convolutional neural network, and then similarity calculation is carried out on the face image and the image features of identity documents stored in a system; and if the similarity is high, the verification is successful. Among them, it is most time-consuming to extract features through a convolutional neural network, and thus an efficient convolutional neural network model capable of performing face verification and feature extraction quickly is required.

For illustrative purposes only, the embodiments of the present application may also be used in application scenarios where the translator is simultaneously interpreting vocals. The technical problems to be solved by the application scenario are as follows: in terms of speech recognition and machine translation problems, real-time speech recognition must be achieved and translation must be performed, so an efficient convolutional neural network model is required.

The embodiments of the present application may be modified and improved according to specific application environments, and are not limited herein.

In order to make the technical field of the present application better understand, embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application.

Referring to fig. 1, fig. 1 illustrates a knowledge distillation system including a single teacher machine learning model provided by an embodiment of the present application. As shown in fig. 1, the knowledge distillation system 100 includes a teacher machine learning model 110 and a student machine learning model 120. The student machine learning model 120 may be pre-trained, or may be partially trained, or may be completely untrained, among other things. The knowledge distillation system 100 migrates the knowledge of the teacher machine learning model 110 to the student machine learning model 120, that is, the teacher machine learning model 110 achieves the purpose of improving the reasoning prediction capability of the student machine learning model 120. It is composed ofThe teacher machine learning model 110 is a model that has been trained, and the output of the classification layer 118 is represented as y^t. The teacher machine learning model 110 includes three intermediate layers 112, 114, and 116, and these intermediate layers are generating output results y^tIn the process of (1) sequentially generating intermediate layer output characteristics

To know

The output of the classification layer 128 of the student machine learning model 120 is denoted as y^s. The student machine learning model 120 includes three intermediate layers 122, 124, and 126, and these intermediate layers are generating the output result y^sIn the process of (1) sequentially generating intermediate layer output characteristics

And

with continued reference to fig. 1, the knowledge distillation system 100 first trains the teacher machine learning model 110 based on the training data to obtain a trained teacher machine learning model 110; the training data is then input into the trained teacher machine learning model 110 to obtain an output y of the trained teacher machine learning model 110^t(ii) a Adjusting parameters of the student machine learning model 120 according to the loss function, so that the student machine learning model 120 outputs a result y for the training data^sOutput y matched to the trained teacher machine learning model 110^t. Wherein the loss function includes a first portion and a second portion. The first part of the loss function is based on the output y of the trained teacher machine learning model^tDetermining that a second portion of the loss function is based on intermediate layer output features generated by an intermediate layer included in the trained teacher machine learning model

And

and (4) determining. It should be understood that the output y of the student machine learning model 120 is judged^sWhether to match the output y of the teacher machine learning model 110^tThe difference between the two may be required to be smaller than a specified threshold value by comparing the two, or a global minimum solution of the difference between the two may be solved, or by other technical means, which is not specifically limited herein. In addition, when the training data of the training teacher machine learning model 110 is true-labeled, it is also possible to judge the prediction ability of the student machine learning model 120 or the training effect of the knowledge distillation system 100 by the training data with the true-label, and to make a judgment of matching when the prediction ability or the training effect is sufficiently good.

With continued reference to fig. 1, the bit widths used for the weights and activation values of the teacher machine learning model 110 are greater than the bit widths used for the weights and activation values of the student machine learning model 120. For example, teacher machine learning model 110 may be a model that is 32-bit wide, while student machine learning model 120 may be a binarization model, i.e., 1-bit wide. Therefore, compression from a high bit width model to a low bit width model can be realized simultaneously, and knowledge transfer and inheritance of prediction capability are realized through a knowledge distillation technology.

With continued reference to fig. 1, the knowledge distillation system 100 adjusts parameters of the student machine learning model 120 according to a loss function. The first part of the loss function includes an output result y based on the trained teacher machine learning model^tAnd the output result y of the student machine learning model aiming at the training data^sAnd a determined Kullback-Leibler (KL) divergence. Thus, the difference between output results obtained for the same input data, i.e., training data, can be measured by KL divergence. Knowledge distillation system 100 not only achieves knowledge migration by outputting results, but also improves the effect of knowledge distillation by the characteristics of the intermediate layer output. In particular, student machine learning modelThe number of intermediate layers included in the model 120 is the same as the number of intermediate layers included in the teacher machine learning model 110. The second part of the loss function includes intermediate layer output features generated based on the intermediate layers included in the trained teacher machine learning model 110

And

and the intermediate layer output characteristics generated by the intermediate layer included in the student machine learning model 120 after the point convolution layer conversion

And

but the determined distance is lost. It should be understood that student machine learning model 120 includes the same number of intermediate layers as teacher machine learning model 110, i.e., both have the same depth or have the same number of hidden layers. Depth is defined herein with respect to the output from the input to both respective classification layers. That is, the number of intermediate layers included in the student machine learning model 120, or the depth thereof, is referred to as being before the classification layer 128 of the student machine learning model 120; the number of intermediate layers included in the teacher machine learning model 110, or the depth thereof, is referred to as being before the classification layer 118 of the teacher machine learning model 110. Here, since the intermediate layers of the student machine learning model 120 and the teacher machine learning model 110 do not have a one-to-one correspondence relationship, even features output by intermediate layers of the same layer do not necessarily have a correspondence relationship. For example, the first intermediate level 122 of the student machine learning model 120 and the first intermediate level 112 of the teacher machine learning model 110 are both at the first level, but each outputs an intermediate level feature

And

there is not necessarily a correspondence between them, and they do not necessarily belong to the same channel. Therefore, there is a need for intermediate level output features generated for an intermediate level included in the student machine learning model 120

And

point convolution layer transformations are performed and can thus be used to improve the performance of the student machine learning model 120. In one possible implementation, the loss function further includes a third portion. The third part of the loss function includes an output result y for the training data based on the student machine learning model 120^sAnd a true label of the training data.

Referring to fig. 2, fig. 2 illustrates a knowledge distillation system including a plurality of teacher machine learning models, provided by an embodiment of the present application. As shown in fig. 2, knowledge distillation system 200 includes teacher

machine learning models

220, 240, and 260, and student machine learning model 280. The student machine learning model 280 may be pre-trained, or may be partially trained, or may be completely untrained, among other things. The student machine learning model 280 has three middle layers 282, 284 and 286 that output middle layer features in sequence

And

and classification layer 288 outputs ys. The teacher machine learning model 220 has three middle layers 222, 224, and 226 outputting middle layer features in sequence

And

and a classification layer 228 output

The teacher machine learning model 240 has three middle layers 242, 244, and 246 that output middle layer features in sequence

And

and a classification layer 248 output

The teacher machine learning model 260 has three intermediate layers 262, 264, and 266 that output intermediate layer features in sequence

And

and classification layer 268 output

It should be understood that teacher

machine learning models

220, 240, and 260, and student machine learning model 280 each include the same number of intermediate layers, i.e., the models have the same depth or the same number of hidden layers. Depth is defined herein with respect to the output from the input to the respective classification layer. For example, the depth of student machine learning model 280 is to classification level 288, while the depth of teacher machine learning model 220 is to classification level 228. Teacher

machine learning models

220, 240, and 260 are all models trained from the same training data. In one possible implementation, teacher

machine learning models

220, 240, and 260 are models having the same structure, or the depth of the models is the same, or the number of layers of the hidden layer is the same, or the number of channels is the same. The same training data is used for training teacher

machine learning models

220, 240 and 260 to obtain multiple trained teacher

machinesThe learning models

220, 240, and 260.

With continued reference to fig. 2, the bit widths of the weights and activation values of the plurality of teacher

machine learning models

220, 240, and 260 are each greater than the bit widths of the weights and activation values of the student machine learning model 280. For example, the bit-widths of the plurality of teacher

machine learning models

220, 240, and 260 may be 32 bits, 16 bits, and 8 bits in order, while the student machine learning model 120 may be a binarization model, i.e., 1 bit-width. Therefore, compression from a high bit width model to a low bit width model can be achieved simultaneously, and knowledge transfer and inheritance of prediction capability are achieved through the knowledge distillation technology combining a plurality of teacher models. As another example, the bit-widths of the plurality of teacher

machine learning models

220, 240, and 260 may be 32 bits, and 16 bits in order. That is, there may be the same bit-width or some of the same bit-width between the plurality of teacher machine learning models.

With continued reference to FIG. 2, the knowledge distillation system 200 inputs the same training data into the plurality of trained teacher

machine learning models

220, 240, and 260, respectively, to obtain output results of the plurality of trained teacher machine learning models corresponding to each of the training teacher machine learning models

And

and, for each of the plurality of trained teacher

machine learning models

220, 240, and 260: the plurality of intermediate layers included in the particular trained teacher machine learning model each generate a corresponding intermediate layer output feature in a process in which the particular trained teacher machine learning model generates an output result of the particular trained teacher machine learning model. The knowledge distillation system 200 adjusts parameters of the student machine learning model 280 according to the loss function. Wherein, the first part of the loss function needs to use the classification layer output result of the multi-bit teacher machine learning model, and the classification layer output result of the multi-bit teacher machine learning model is obtained by the formula (1).

In the formula (1), y^tRepresenting a classification layer output result of the multi-bit teacher machine learning model;

representing the output result of the classification layer of the teacher machine learning model with the sequence number M, wherein M is the total number of the plurality of teacher machine learning models; delta_0,mAnd a first weight coefficient corresponding to the teacher machine learning model with the sequence number m in the first weight coefficient set is represented. The meaning of the formula (1) is that the output results of the trained teacher machine learning models are weighted and summed according to the corresponding first weight coefficients to obtain the classification layer output results of the multi-bit teacher machine learning models. Referring to FIG. 2 and equation (1), the output results of each of the trained teacher

machine learning models

220, 240, and 260

And

the classification layer output results of the multi-bit teacher machine learning models representing the trained teacher

machine learning models

220, 240, and 260 are obtained by summing up the weighted values of the corresponding first weight coefficients. It should be understood that fig. 2 only schematically shows the case of three teacher machine learning models, the total number M of the plurality of teacher machine learning models mentioned in formula (1) may be any positive integer greater than 1, and the case of three teacher machine learning models shown in fig. 2 may also be extended to the case of M teacher machine learning models.

With continued reference to fig. 2, the first part of the loss function is expressed as equation (2).

In the formula (2), y^tThe classification layer output result, y, of the multi-bit teacher machine learning model obtained by the formula (1) is expressed^sRepresents the output result of the student machine learning model for the training data, T is a hyper-parameter or a temperature parameter, sigma represents a softmax activation function, and the left side of the formula (2) represents the Kullback-Leibler (KL) divergence of the classification layer output. As shown in equation (2), the first portion of the loss function includes a KL divergence determined based on the classification layer output results of the multi-bit teacher machine learning model and the output results of the student machine learning model for the training data. It should be appreciated that the probability distribution of the output of the softmax activation function becomes smoother as the temperature parameter T increases.

With continued reference to fig. 2 and equation (2), the knowledge distilling system 200 measures the difference between the output results obtained for the same input data, i.e., training data, by KL divergence, thereby implementing the knowledge migration from the plurality of teacher machine learning models to the student machine learning models by outputting the results. Knowledge distillation system 200 also improves the effectiveness of knowledge distillation by the characteristics of the intermediate layer output. Specifically, knowledge distillation system 200 also utilizes the middle layer output characteristics of the multi-bit teacher machine learning model in equations (1) and (2). The middle layer output characteristics of the multi-bit teacher machine learning model are obtained through formula (3).

In the formula (3), the first and second groups,

the middle layer output characteristics of the middle layer with the serial number i of the multi-bit teacher machine learning model are represented;

representing the intermediate layer output characteristics output by the intermediate layer with the serial number i in the teacher machine learning model with the serial number m; delta_i，mTeacher with presentation and serial number mA second weight coefficient corresponding to the middle layer with the sequence number i in the machine learning model; and M is the total number of the plurality of teacher machine learning models. Taking the knowledge distillation system 200 shown in fig. 2 as an example, M is 3. The formula (3) means: the multiple intermediate layers included in the multiple trained teacher machine learning models are in one-to-one correspondence with the multiple second weight coefficients in the second weight coefficient set; wherein for one or more of a plurality of intermediate layers comprised by the student machine learning model: and determining the intermediate layer output characteristics of the multi-bit teacher machine learning model corresponding to the specific intermediate layer of the student machine learning model, wherein the intermediate layer output characteristics of the multi-bit teacher machine learning model are obtained by weighting and summing the intermediate layer output characteristics generated by the intermediate layers of the plurality of trained teacher machine learning models which are positioned at the same level as the specific intermediate layer according to the respective corresponding second weight coefficients. Here, δ_i，mAnd the weight coefficients of the intermediate layers of the different teacher machine learning models are expressed, and the sum of the weight coefficients meets the constraint condition and is zero. That is, a plurality of weight coefficients δ for the same hierarchy i_i，m(M is 1 to M), and these plural weight coefficients corresponding to the same hierarchy i satisfy the constraint condition that the sum is zero. In a possible implementation, the second weighting factor δ_i，mThis is done by softmax.

With continued reference to fig. 2 and equation (3), the second portion of the loss function includes distance losses determined based on the mid-level output characteristics of the multi-bit teacher machine learning model and the mid-level output characteristics generated by the particular mid-level of the student machine learning model after the point convolution layer conversion. The second part of the loss function is obtained by equation (4).

Wherein the content of the first and second substances,

expressing the intermediate layer output characteristics of the multi-bit teacher machine learning model obtained by the formula (3);

the intermediate layer output characteristics of the intermediate layer with the serial number i of the student machine learning model are represented; r is_iA conversion layer for performing point convolution layer conversion on the output of the intermediate layer with the serial number i of the student machine learning model; n represents the number of selected intermediate layers. Formula (4) means that, for the selected N intermediate layers, the intermediate layer output characteristics of the multi-bit teacher machine learning model and the intermediate layer output characteristics of the student machine learning model corresponding to each of the N intermediate layers are paired, the distance loss is solved for each pair, and finally the respective distance losses of the N intermediate layers are summed to obtain the second part of the loss function. It should be understood that the middle layers of the student machine learning model and the teacher machine learning model do not have a one-to-one correspondence, and even if there is no correspondence between features output by the middle layers of the same layer. Therefore, it is necessary to perform point convolution layer conversion on the output of the middle layer of the student machine learning model. The range of N is any positive integer equal to or greater than 1, but the maximum value of N is the total number of intermediate layers included in the teacher machine learning model and the student machine learning model. That is, the characteristics of all the intermediate layer outputs may be taken into account, only a portion of the intermediate layer outputs may be selected, or only a particular intermediate layer output may be selected. Taking the plurality of teacher

machine learning models

220, 240, and 260 shown in fig. 2 as an example, the characteristics of the outputs of all the intermediate layers (N is 3) may be considered, or only the output characteristics of the intermediate layer of the first layer (N is 1), that is, the characteristics of the outputs of the intermediate layers 222, 242, and 262 may be considered. Correspondingly, according to the selected output characteristics of the middle layer of the teacher machine learning model, the output characteristics of the middle layer of the student machine learning model are used for solving the corresponding distance loss after point convolution layer conversion. Smooth in equation (4) represents smoothing processing, and can be represented by equation (5).

Combining the formulas (4) and (5), the distance loss after the smoothing processing and the second part of the loss function are beneficial to making the training process more stable. The second part of the loss function further comprises summing distance losses corresponding to each of a plurality of intermediate layers included in the student machine learning model after smoothing.

With continued reference to fig. 2, the loss function further includes a third portion that includes cross-entropy losses determined for the output results of the training data and the true labels of the training data based on the student machine learning model. The loss function is expressed as equation (6) in conjunction with equations (1) through (5).

Wherein L in the formula (6)_CEIs a third part of the loss function, namely a cross-entropy loss determined for the output result of the training data and the true label of the training data based on the student machine learning model; l is_KLIs the KL divergence, referred to by equation (2), determined based on the classification layer output results of the multi-bit teacher machine learning model and the output results of the student machine learning model for the training data of the first part of the loss function; l is_DisIs the distance loss determined based on the middle layer output characteristics of the multi-bit teacher machine learning model and the middle layer output characteristics generated by the specific middle layer of the student machine learning model after the point convolution layer conversion, as mentioned in equation (4).

With continued reference to fig. 2 and equation (6), the knowledge distillation system 200 can implement dynamic adjustment. Specifically, the first weight coefficient is derived by the loss function, thereby obtaining a first gradient; wherein the first gradient is a gradient of the loss function with respect to a classification layer output result of the multi-bit teacher machine learning model, the first gradient determined according to the KL divergence included in the first portion of the loss function. On the other hand, the second weight coefficient is derived through the loss function, so that a second gradient is obtained; wherein the second gradient is a gradient of the loss function with respect to an intermediate layer output feature of the multi-bit teacher machine learning model, the second gradient determined from the distance loss included in a second portion of the loss function. And performing back propagation through the first gradient and the second gradient respectively, thereby dynamically adjusting the first weight coefficient and the second weight coefficient respectively. The dynamic adjustment process is represented by equation (7).

In the formula (7), δ_0，mRepresenting a first weight coefficient corresponding to a teacher machine learning model with the sequence number m in a first weight coefficient set mentioned in formula (1); delta_i，mRepresents a second weight coefficient corresponding to the intermediate layer with the number i in the teacher's machine learning model with the number m mentioned in the formula (3), and L_allRepresents the overall loss function mentioned in equation (6). From equation (7), it can be seen that the overall loss function L_allRespectively to the first weight coefficient delta_0，mAnd a second weight coefficient delta_i，mAnd carrying out derivation to obtain a first gradient and a second gradient. Wherein the first gradient appears to be related to the KL divergence of the first part of the loss function mentioned by equation (2), and the second gradient appears to be related to the distance loss of the second part of the loss function mentioned by equation (4). Also, the solution of the first gradient does not require the second part of the loss function, whereas the solution of the second gradient does not require the first part of the loss function. Thus, in conjunction with equations (1) through (7), the overall loss function includes three components that each function and together improve the effectiveness of the knowledge distillation, while simultaneously improving the knowledge distillationAnd the inverse propagation and the gradient are obtained through respective derivation, and the derivation result and the gradient do not relate to other parts of the loss function, so that the self dynamic adjustment is favorably realized according to the corresponding part of the loss function, and the simplification of the operation and the improvement of the system efficiency are further facilitated. That is, the derivation result and the back propagation obtained by the formula (7) can dynamically adjust the corresponding weight coefficients, which is further beneficial to adjust the parameters of the student machine learning model through the loss function, so that the output result of the student machine learning model for the training data matches the output result of the trained teacher machine learning model.

In this way, a plurality of pre-trained teacher machine learning models with high bit widths are used in combination with a knowledge distillation technology, so that training of student machine models with low bit widths, such as binaryzation, is guided together, and accuracy of the student machine learning models is improved. In addition, in the training process, different teacher networks and different layers are considered to change the distillation importance degree of the student networks, so that each teacher network is endowed with a learnable weight coefficient, and the dynamic adjustment of the weights of the different teacher networks is realized.

With the combination of the formula (1) to the formula (7), when the student machine learning model is a binary neural network model and is used for an image classification task, the embodiment of the application achieves a significant technical effect. Specifically, experiments were performed on the CIFAR10 and CIFAR100 image classification tasks. Compared with a binary neural network in the prior art, the neural network model obtained by the knowledge distillation method in the embodiment of the application has higher precision under the condition of the same calculated amount, and the table 1 shows.

TABLE 1 CIFAR10 and CIFAR100 comparison of classification results

In addition, experiments were performed on the large-scale image classification dataset ImageNet. Compared with other binary neural networks, the neural network provided by the invention has higher precision under the condition of the same calculated amount, and is shown in tables 2 and 3.

The left part is the comparison of the classification results of ImageNet in Table 2, and the right part is the comparison of the classification results of single-bit and multi-bit distillation in Table 3

In addition, the effectiveness of the present invention was verified by ablation experiments, see table 4 below. Where FULL denotes all methods using the present invention, w/o AKA denotes no dynamic knowledge adjustment, w/o CTL denotes no 1x1 translation of the convolutional Layer, w/o Intermediate Layers denotes no Intermediate Layer characteristic distillation, and w/o Classification Layer denotes no Classification Layer output distillation.

Table 4 ablation experimental results

Referring to fig. 3, fig. 3 shows a schematic flow diagram of a knowledge distillation method according to an embodiment provided in the examples of the present application. As shown in fig. 3, the method includes the following steps.

Step S300: at least one teacher machine learning model is trained based on the training data to obtain a trained teacher machine learning model.

Wherein the at least one teacher machine learning model trains the resulting model through the same training data. In one possible implementation, the at least one teacher machine learning model has a model with the same structure, or the depth of the model is the same, or the number of layers of the hidden layer is the same, or the number of channels is the same.

Step S310: inputting the training data into the trained teacher machine learning model to obtain an output of the trained teacher machine learning model.

And the middle layer included in the trained teacher machine learning model generates corresponding middle layer output characteristics in the generation process of the output result of the trained teacher machine learning model.

Step S320: adjusting parameters of a student machine learning model according to a loss function so that an output result of the student machine learning model for the training data matches an output result of the trained teacher machine learning model.

Wherein the loss function comprises a first part and a second part, the first part of the loss function is determined according to the output result of the trained teacher machine learning model, and the second part of the loss function is determined according to the intermediate layer output characteristics generated by the intermediate layer included in the trained teacher machine learning model.

Referring to fig. 4, fig. 4 shows a schematic flow diagram of a knowledge distillation method according to another embodiment provided in the examples of the present application. As shown in fig. 4, the method includes the following steps.

Step S400: and respectively training a plurality of teacher machine learning models according to the training data so as to obtain a plurality of corresponding trained teacher machine learning models.

The number of the intermediate layers included in the teacher machine learning models and the number of the intermediate layers included in the student machine learning models are the same. Bit widths adopted by the weights and the activation values of the teacher machine learning models are larger than bit widths adopted by the weights and the activation values of the student machine learning models.

Step S410: and inputting the training data into the plurality of trained teacher machine learning models respectively so as to obtain output results of the plurality of trained teacher machine learning models corresponding to each training teacher machine learning model.

Wherein, for each of the plurality of trained teacher machine learning models: the plurality of intermediate layers included in the particular trained teacher machine learning model each generate a corresponding intermediate layer output feature in a process in which the particular trained teacher machine learning model generates an output result of the particular trained teacher machine learning model.

Step S420: adjusting parameters of a student machine learning model according to a loss function so that an output result of the student machine learning model for the training data matches an output result of the trained teacher machine learning model.

The plurality of trained teacher machine learning models are in one-to-one correspondence with a plurality of first weight coefficients in a first weight coefficient set, output results of the plurality of trained teacher machine learning models are weighted and summed according to the respective corresponding first weight coefficients to obtain classification layer output results of the multi-bit teacher machine learning models, and the first part of the loss function comprises Kullback-leibler (kl) divergence determined based on the classification layer output results of the multi-bit teacher machine learning models and the output results of the student machine learning models for the training data.

Wherein the plurality of intermediate layers included in each of the plurality of trained teacher machine learning models are in one-to-one correspondence with the plurality of second weight coefficients in the second weight coefficient set. Wherein for one or more of a plurality of intermediate layers comprised by the student machine learning model: determining intermediate layer output characteristics of a multi-bit teacher machine learning model corresponding to a specific intermediate layer of the student machine learning model, wherein the intermediate layer output characteristics of the multi-bit teacher machine learning model are obtained by weighting and summing intermediate layer output characteristics generated by the intermediate layers of the plurality of trained teacher machine learning models which are positioned at the same level as the specific intermediate layer according to respective corresponding second weight coefficients; the second part of the loss function includes distance losses determined based on the mid-level output features of the multi-bit teacher machine learning model and mid-level output features generated by a particular mid-level of the student machine learning model after the point convolution layer conversion.

Wherein the second part of the loss function further comprises summing distance losses corresponding to each of a plurality of intermediate layers included in the student machine learning model after smoothing.

Step S430: deriving the first weight coefficient by the loss function to obtain a first gradient; deriving the second weight coefficient by the loss function to obtain a second gradient; and performing back propagation through the first gradient and the second gradient respectively, thereby dynamically adjusting the first weight coefficient and the second weight coefficient respectively.

Wherein the first gradient is a gradient of the loss function with respect to a classification layer output result of the multi-bit teacher machine learning model, the first gradient determined from the KL divergence included in the first portion of the loss function; the second gradient is a gradient of the loss function with respect to an intermediate layer output feature of the multi-bit teacher machine learning model, the second gradient determined from the distance loss included in a second portion of the loss function.

Referring to fig. 1-4, in some exemplary embodiments, the teacher machine learning model and the student machine learning model may refer to fully connected neural networks, while the corresponding middle tier outputs refer to the outputs of the fully connected tiers; or the teacher machine learning model and the student machine learning model refer to a convolutional neural network, and the corresponding middle layer output refers to the output of a convolutional layer; or the teacher machine learning model and the student machine learning model are Recurrent Neural Networks (RNNs); or the teacher machine learning model and the student machine learning model are other models having the same structure as long as the knowledge distillation method and system described in the embodiments of the present application are applicable. These may be adjusted according to specific application scenarios, and are not specifically limited herein.

The embodiments provided herein may be implemented in any one or combination of hardware, software, firmware, or solid state logic circuitry, and may be implemented in connection with signal processing, control, and/or application specific circuitry. Particular embodiments of the present application provide an apparatus or device that may include one or more processors (e.g., microprocessors, controllers, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs), etc.) that process various computer-executable instructions to control the operation of the apparatus or device. Particular embodiments of the present application provide an apparatus or device that can include a system bus or data transfer system that couples the various components together. A system bus can include any of a variety of different bus structures or combination of different bus structures, such as a memory bus or memory controller, a peripheral bus, a universal serial bus, and/or a processor or local bus that utilizes any of a variety of bus architectures. The devices or apparatuses provided in the embodiments of the present application may be provided separately, or may be part of a system, or may be part of other devices or apparatuses.

Particular embodiments provided herein may include or be combined with computer-readable storage media, such as one or more storage devices capable of providing non-transitory data storage. The computer-readable storage medium/storage device may be configured to store data, programmers and/or instructions that, when executed by a processor of an apparatus or device provided by embodiments of the present application, cause the apparatus or device to perform operations associated therewith. The computer-readable storage medium/storage device may include one or more of the following features: volatile, non-volatile, dynamic, static, read/write, read-only, random access, sequential access, location addressability, file addressability, and content addressability. In one or more exemplary embodiments, the computer-readable storage medium/storage device may be integrated into a device or apparatus provided in the embodiments of the present application or belong to a common system. The computer-readable storage medium/memory device may include optical, semiconductor, and/or magnetic memory devices, etc., and may also include Random Access Memory (RAM), flash memory, read-only memory (ROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), registers, a hard disk, a removable disk, a recordable and/or rewriteable Compact Disc (CD), a Digital Versatile Disc (DVD), a mass storage media device, or any other form of suitable storage media.

The above is an implementation manner of the embodiments of the present application, and it should be noted that the steps in the method described in the embodiments of the present application may be sequentially adjusted, combined, and deleted according to actual needs. In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments. It is to be understood that the embodiments of the present application and the structures shown in the drawings are not to be construed as particularly limiting the devices or systems concerned. In other embodiments of the present application, an apparatus or system may include more or fewer components than the specific embodiments and figures, or may combine certain components, or may separate certain components, or may have a different arrangement of components. Those skilled in the art will understand that various modifications and changes may be made in the arrangement, operation, and details of the methods and apparatus described in the specific embodiments without departing from the spirit and scope of the embodiments herein; without departing from the principles of embodiments of the present application, several improvements and modifications may be made, and such improvements and modifications are also considered to be within the scope of the present application.

Claims

1. A method of training a neural network by knowledge distillation, the method comprising:

training at least one teacher machine learning model according to the training data to obtain a trained teacher machine learning model;

inputting the training data into the trained teacher machine learning model so as to obtain an output result of the trained teacher machine learning model, wherein an intermediate layer included in the trained teacher machine learning model generates a corresponding intermediate layer output feature in a generation process of the output result of the trained teacher machine learning model; and

and adjusting parameters of a student machine learning model according to a loss function, so that the difference between the output result of the student machine learning model aiming at the training data and the output result of the trained teacher machine learning model is smaller than a preset threshold value, wherein the loss function comprises a first part and a second part, the first part of the loss function is determined according to the output result of the trained teacher machine learning model, and the second part of the loss function is determined according to the intermediate layer output characteristics generated by the intermediate layer included in the trained teacher machine learning model.

2. The method of claim 1, wherein the at least one teacher machine learning model uses a bit width for weights and activation values that is greater than a bit width for weights and activation values of the student machine learning models.

3. The method of claim 1, wherein the first portion of the loss function comprises a Kullback-leibler (kl) divergence determined based on the output of the trained teacher machine learning model and the output of the student machine learning model for the training data.

4. The method of claim 3, wherein the number of middle layers included in the student machine learning model is the same as the number of middle layers included in the at least one teacher machine learning model, the middle layers included in the student machine learning model generate corresponding middle layer output features during generation of the output results of the student machine learning model, and the second part of the loss function includes a distance loss determined based on the middle layer output features generated by the middle layers included in the trained teacher machine learning model and the middle layer output features generated by the middle layers included in the student machine learning model after the point convolution layer conversion.

5. The method of claim 4, wherein the loss function further comprises a third portion comprising cross-entropy losses determined for output results of the training data and true labels of the training data based on the student machine learning model.

6. The method of claim 1, wherein the at least one teacher machine learning model comprises a plurality of teacher machine learning models, wherein the number of intermediate layers included in each of the plurality of teacher machine learning models and the number of intermediate layers included in the student machine learning models are the same, wherein training the at least one teacher machine learning model to obtain a trained teacher machine learning model based on the training data comprises:

and respectively training the plurality of teacher machine learning models according to the training data so as to obtain a plurality of corresponding trained teacher machine learning models.

7. The method of claim 6, wherein the bit widths used for the weights and activation values of the respective teacher machine learning models are greater than the bit widths used for the weights and activation values of the student machine learning models.

8. The method of claim 7, wherein inputting the training data into the trained teacher machine learning model to obtain output results of the trained teacher machine learning model comprises:

inputting the training data into the plurality of trained teacher machine learning models respectively to obtain output results of the plurality of trained teacher machine learning models respectively corresponding to each,

wherein, for each of the plurality of trained teacher machine learning models:

the plurality of intermediate layers included in the particular trained teacher machine learning model each generate a corresponding intermediate layer output feature in a process in which the particular trained teacher machine learning model generates an output result of the particular trained teacher machine learning model.

9. The method of claim 8, wherein the plurality of trained teacher machine learning models are in one-to-one correspondence with a plurality of first weight coefficients in a first set of weight coefficients, wherein the output results of the plurality of trained teacher machine learning models are weighted and summed according to the respective corresponding first weight coefficients to obtain a classification layer output result of a multi-bit teacher machine learning model, and wherein the first portion of the loss function comprises a Kullback-leibler (kl) divergence determined based on the classification layer output result of the multi-bit teacher machine learning model and the output result of the student machine learning model for the training data.

10. The method of claim 9, wherein the plurality of intermediate layers included in each of the plurality of trained teacher machine learning models has a one-to-one correspondence with a plurality of second weight coefficients in a second set of weight coefficients,

wherein for one or more of a plurality of intermediate layers comprised by the student machine learning model:

determining intermediate layer output characteristics of a multi-bit teacher machine learning model corresponding to a specific intermediate layer of the student machine learning model, wherein the intermediate layer output characteristics of the multi-bit teacher machine learning model are obtained by weighting and summing intermediate layer output characteristics generated by the intermediate layers of the plurality of trained teacher machine learning models which are positioned at the same level as the specific intermediate layer according to respective corresponding second weight coefficients;

the second part of the loss function includes distance losses determined based on the mid-level output features of the multi-bit teacher machine learning model and mid-level output features generated by a particular mid-level of the student machine learning model after the point convolution layer conversion.

11. The method of claim 10, wherein the second portion of the loss function further comprises summing distance losses corresponding to each of a plurality of intermediate layers included in the student machine learning model after smoothing.

12. The method of claim 11, wherein the loss function further comprises a third portion comprising cross-entropy losses determined for output results of the training data and true labels of the training data based on the student machine learning model.

13. The method of claim 12, further comprising:

deriving the first weight coefficient by the loss function to obtain a first gradient, wherein the first gradient is a gradient of the loss function with respect to a classification layer output result of the multi-bit teacher machine learning model, the first gradient being determined according to the KL divergence included in a first portion of the loss function;

deriving the second weight coefficient by the loss function to obtain a second gradient, wherein the second gradient is a gradient of the loss function with respect to an intermediate layer output feature of the multi-bit teacher machine learning model, the second gradient being determined from the distance loss included in a second portion of the loss function;

and performing back propagation through the first gradient and the second gradient respectively, thereby dynamically adjusting the first weight coefficient and the second weight coefficient respectively.

14. The method according to any one of claims 1-13, wherein the weight and activation value of the student machine learning model are both binarized.

15. A knowledge distillation system, characterized in that the knowledge distillation system comprises:

a plurality of teacher machine learning models, an

A student machine learning model;

the number of the middle layers included in the teacher machine learning models is the same as that of the middle layers included in the student machine learning models;

bit widths adopted by the weights and the activation values of the teacher machine learning models are all larger than bit widths adopted by the weights and the activation values of the student machine learning models;

wherein parameters of the student machine learning model are adjusted according to a loss function, so that the difference value between the output result of the student machine learning model for the training data and the output result of the plurality of teacher machine learning models for the training data is smaller than a preset threshold value;

the first part of the loss function comprises Kullback-Leibler (KL) divergence determined based on the output results of a classification layer of a multi-bit teacher machine learning model and the output results of the student machine learning model aiming at the training data, and the output results of the classification layer of the multi-bit teacher machine learning model are obtained by weighting and summing the output results of the plurality of teacher machine learning models aiming at the training data according to respective corresponding first weight coefficients;

the second part of the loss function comprises distance loss determined based on intermediate layer output characteristics of a multi-bit teacher machine learning model and intermediate layer output characteristics generated by a specific intermediate layer of the student machine learning model after point convolution layer conversion, and the intermediate layer output characteristics of the multi-bit teacher machine learning model are obtained by weighting and summing the intermediate layer output characteristics generated by the intermediate layers of the plurality of teacher machine learning models which are positioned at the same level as the specific intermediate layer according to respective corresponding second weight coefficients;

wherein a third portion of the loss function includes cross-entropy losses determined for output results of the training data and true labels of the training data based on the student machine learning model.

16. The system of claim 15,

a first gradient obtained by deriving the first weight coefficient by the loss function, wherein the first gradient is a gradient of the loss function with respect to a classification layer output result of the multi-bit teacher machine learning model, the first gradient being determined according to the KL divergence included in a first part of the loss function,

a second gradient obtained by deriving the second weight coefficient by the loss function, wherein the second gradient is a gradient of the loss function with respect to an intermediate layer output feature of the multi-bit teacher machine learning model, the second gradient being determined from the distance loss included in a second part of the loss function,

the first weight coefficient and the second weight coefficient are dynamically adjusted by counter-propagating through the first gradient and the second gradient, respectively.

17. A computer-readable storage medium holding computer instructions that, when executed by a processor, cause the processor to:

18. The computer-readable storage medium of claim 17, wherein the at least one teacher machine learning model uses a bit width for weights and activation values that is greater than a bit width for weights and activation values of the student machine learning models.

19. The computer-readable storage medium of claim 17, wherein the first portion of the loss function comprises a Kullback-leibler (kl) divergence determined based on the output of the trained teacher machine learning model and the output of the student machine learning model for the training data.

20. The computer-readable storage medium of claim 19, wherein the number of intermediate layers included in the student machine learning model is the same as the number of intermediate layers included in the at least one teacher machine learning model, wherein the intermediate layers included in the student machine learning model generate corresponding intermediate layer output features during generation of the output results of the student machine learning model, and wherein the second part of the loss function comprises distance losses determined based on the intermediate layer output features generated by the intermediate layers included in the trained teacher machine learning model and the intermediate layer output features generated by the intermediate layers included in the student machine learning model after the point convolution layer conversion.

21. The computer-readable storage medium of claim 20, wherein the loss function further comprises a third portion, the third portion of the loss function comprising cross-entropy losses determined for output results of the training data and true labels of the training data based on the student machine learning model.

22. The computer-readable storage medium of claim 17, wherein the at least one teacher machine learning model comprises a plurality of teacher machine learning models, wherein a number of intermediate layers included in each of the plurality of teacher machine learning models and a number of intermediate layers included in the student machine learning models are the same, wherein training at least one teacher machine learning model to obtain a trained teacher machine learning model based on the training data comprises:

23. The computer-readable storage medium of claim 22, wherein the respective weights and activation values of the plurality of teacher machine learning models each employ a bit width that is greater than a bit width employed by the weights and activation values of the student machine learning models.

24. The computer-readable storage medium of claim 23, wherein inputting the training data into the trained teacher machine learning model to obtain output results of the trained teacher machine learning model comprises:

wherein, for each of the plurality of trained teacher machine learning models:

25. The computer-readable storage medium of claim 24, wherein the plurality of trained teacher machine learning models are in one-to-one correspondence with a plurality of first weight coefficients in a first set of weight coefficients, wherein the output results of the plurality of trained teacher machine learning models are weighted and summed according to the respective corresponding first weight coefficients to obtain a classification layer output result of a multi-bit teacher machine learning model, and wherein the first portion of the loss function comprises a Kullback-leibler (kl) divergence determined based on the classification layer output result of the multi-bit teacher machine learning model and the output result of the student machine learning model for the training data.

26. The computer-readable storage medium of claim 25, wherein the plurality of intermediate layers included in each of the plurality of trained teacher machine learning models are in one-to-one correspondence with a plurality of second weight coefficients in a second set of weight coefficients,

27. The computer-readable storage medium of claim 26, wherein the second portion of the loss function further comprises summing distance losses corresponding to each of a plurality of intermediate layers included in the student machine learning model after smoothing.

28. The computer-readable storage medium of claim 27, wherein the loss function further comprises a third portion, the third portion of the loss function comprising cross-entropy losses determined for output results of the training data and true labels of the training data based on the student machine learning model.

29. The computer-readable storage medium of claim 28, wherein the processor further performs the following: