CN112508169A - Knowledge distillation method and system - Google Patents

Knowledge distillation method and system Download PDF

Info

Publication number
CN112508169A
CN112508169A CN202011273058.5A CN202011273058A CN112508169A CN 112508169 A CN112508169 A CN 112508169A CN 202011273058 A CN202011273058 A CN 202011273058A CN 112508169 A CN112508169 A CN 112508169A
Authority
CN
China
Prior art keywords
machine learning
learning model
teacher machine
loss function
teacher
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011273058.5A
Other languages
Chinese (zh)
Inventor
聂迎
韩凯
王云鹤
许春景
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN202011273058.5A priority Critical patent/CN112508169A/en
Publication of CN112508169A publication Critical patent/CN112508169A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Image Analysis (AREA)

Abstract

The application relates to a method for training a neural network through knowledge distillation in the field of artificial intelligence. The method comprises the following steps: training at least one teacher machine learning model according to the training data; inputting training data into the machine learning model of the trained teacher so as to obtain an output result of the machine learning model of the trained teacher; and adjusting parameters of the student machine learning model according to the loss function, so that the difference value of the output result of the student machine learning model aiming at the training data and the output result of the trained teacher machine learning model is smaller than a preset threshold value. The loss function includes a first portion determined from output results of the trained teacher machine learning model and a second portion determined from intermediate layer output features generated by intermediate layers included in the trained teacher machine learning model.

Description

Knowledge distillation method and system
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a knowledge distillation method and a knowledge distillation system.
Background
Deep learning is widely cited in image processing, face recognition and other fields as one of the mainstream branches of machine learning. Convolutional Neural Networks (CNNs) are one of the algorithms that represent deep learning, and have been widely used and have enjoyed great success in large-scale computer vision applications such as image classification, object detection, image segmentation and video analysis. Convolutional Neural Network models, particularly Deep Neural Network (DNN) models, require a large amount of computing power and memory. In a common CNN, the parameter amount of each convolutional layer can reach tens of thousands or even hundreds of thousands, and the parameter of all convolutional layers of the whole network can reach tens of millions when accumulated. Hundreds of megabytes of memory or cache are required if represented by 32-bit floating point numbers. On the other hand, the convolution operation has a huge amount of calculation, contains hundreds of thousands of parameters of convolution kernels, and the Floating Point Operations (FLOPs) of the convolution operation can reach thousands of times. Common CNNs can have up to billions of times of FLOPs throughout the network.
Edge devices, such as mobile phones, wearable devices, etc., need to handle deep learning related tasks locally, and are limited by limited resources and power consumption as well as latency and cost considerations, and it is difficult to meet the computational power and memory requirements of common CNNs. Products based on deep learning are therefore difficult to deploy on these edge devices. To advance the application of deep learning based products to edge devices, one approach is to design a dedicated hardware accelerator for a given computational task, and the other approach is to simplify the neural network Model to reduce its computational power and memory requirements, i.e., Model Compression. Model compression includes methods such as Pruning (Pruning), quantification (Quantization), Low-Rank decomposition (Low-Rank Factorization), and Knowledge Distillation (Knowledge Distillation). Among them, the related research of simplifying the full-precision neural network into the binary neural network is receiving much attention. The binary neural network expresses the weight/activation value of the full-precision neural network in a binary mode (such as 1 and-1), and changes the original floating point multiply-add operation into bit operation, so that the required computing resources can be greatly reduced, the operation speed is improved, and the deployment at a mobile terminal is facilitated. However, the methods adopted in the prior art to simplify the full-precision neural network into a binary neural network are generally quantized by sign function or approximated by Straight-Through Estimator (STE) to calculate the network parameter gradient of 32-bit floating point number. This results in inaccurate gradients and affects the update accuracy of the network parameters of the full-precision neural network and the update accuracy of the network parameters of the binarized neural network. Therefore, when the weight value and the activation value of the full-precision neural network are quantized to two values, the training of the network is easy to fall into a local minimum value point, so that the training is insufficient, and further, the great precision loss is brought. Knowledge distillation based model compression, on the other hand, migrates the inferential predictive power of a trained, more complex machine learning model to that of a simpler machine learning model. Therefore, knowledge distillation can be applied to a full-precision neural network, thereby obtaining a low bit-width neural network with a weight/activation value of a lower bit-width. The low-bit-width neural network replaces floating point calculation of the full-precision neural network with a low-bit-width numerical value, so that the operation efficiency is improved, but precision loss can be caused, and the larger the bit-width difference between a teacher network and a student network for knowledge distillation is, the larger the precision loss is, and the full-precision neural network is not easy to be simplified into a binary neural network.
Therefore, a model compression method is needed to effectively reduce the required computing resources and improve the operation speed while avoiding the precision loss as much as possible, and particularly, the model compression method can be used for simplifying a full-precision neural network model into a binary neural network model.
Disclosure of Invention
It is an object of the present application to provide a method, system and computer readable storage medium for training a neural network by knowledge distillation. The method comprises the following steps: training at least one teacher machine learning model according to the training data to obtain a trained teacher machine learning model; inputting the training data into the trained teacher machine learning model so as to obtain an output result of the trained teacher machine learning model, wherein an intermediate layer included in the trained teacher machine learning model generates a corresponding intermediate layer output feature in a generation process of the output result of the trained teacher machine learning model; and adjusting parameters of a student machine learning model according to a loss function, so that the difference between the output result of the student machine learning model aiming at the training data and the output result of the trained teacher machine learning model is smaller than a preset threshold value, wherein the loss function comprises a first part and a second part, the first part of the loss function is determined according to the output result of the trained teacher machine learning model, and the second part of the loss function is determined according to the intermediate layer output characteristics generated by the intermediate layer included in the trained teacher machine learning model. Therefore, the training of the student machine models is guided by the trained machine learning models of the plurality of teachers together, the precision of the machine learning models of the students is improved, the changes of the distillation importance degrees of different teacher networks and different middle layers to the student networks are considered through the loss function, and the training effect is improved.
In a first aspect, embodiments of the present application provide a method for training a neural network by knowledge distillation. The method comprises the following steps: training at least one teacher machine learning model according to the training data to obtain a trained teacher machine learning model; inputting the training data into the trained teacher machine learning model so as to obtain an output result of the trained teacher machine learning model, wherein an intermediate layer included in the trained teacher machine learning model generates a corresponding intermediate layer output feature in a generation process of the output result of the trained teacher machine learning model; and adjusting parameters of a student machine learning model according to a loss function, so that the difference between the output result of the student machine learning model aiming at the training data and the output result of the trained teacher machine learning model is smaller than a preset threshold value, wherein the loss function comprises a first part and a second part, the first part of the loss function is determined according to the output result of the trained teacher machine learning model, and the second part of the loss function is determined according to the intermediate layer output characteristics generated by the intermediate layer included in the trained teacher machine learning model.
According to the technical scheme described in the first aspect, the training of the student machine models is guided by the trained teacher machine learning models together, so that the precision of the student machine learning models is improved, and meanwhile, the change of distillation importance degree of different teacher networks and different middle layers to the student networks is considered through the loss function, and the training effect is improved.
In a possible implementation form according to the first aspect, the bit width of the weight and the activation value of the at least one teacher machine learning model is greater than the bit width of the weight and the activation value of the student machine learning model.
Therefore, the teacher machine learning model with the high bit width is used for guiding the training of the student machine model with the low bit width, and the method is beneficial to reducing required computing resources and improving the operation speed and avoiding the precision loss as much as possible.
In a possible implementation form according to the first aspect, the first part of the loss function comprises a Kullback-leibler (kl) divergence determined based on the output of the trained teacher machine learning model and the output of the student machine learning model for the training data.
In this way, the influence of the output result is taken into account by the first part of the loss function.
According to the first aspect, in a possible implementation manner, the number of the intermediate layers included in the student machine learning model is the same as the number of the intermediate layers included in the at least one teacher machine learning model, the intermediate layers included in the student machine learning model generate corresponding intermediate layer output features in a generation process of an output result of the student machine learning model, and the second part of the loss function includes a distance loss determined based on the intermediate layer output features generated by the intermediate layers included in the trained teacher machine learning model and the intermediate layer output features generated by the intermediate layers included in the student machine learning model after the point convolution layer conversion.
In this way, the variation of the distillation importance of different teacher networks, different intermediate layers to the student network is taken into account by the second part of the loss function.
According to the first aspect, in one possible implementation, the loss function further comprises a third part comprising cross-entropy losses determined for the output results of the training data and the true labels of the training data based on the student machine learning model.
In this way, the training effect is improved by the third part of the loss function.
According to a first aspect, in one possible implementation, the at least one teacher machine learning model includes a plurality of teacher machine learning models, wherein the number of intermediate layers included in each of the plurality of teacher machine learning models is the same as the number of intermediate layers included in the student machine learning model, and wherein training the at least one teacher machine learning model based on the training data to obtain a trained teacher machine learning model includes: and respectively training the plurality of teacher machine learning models according to the training data so as to obtain a plurality of corresponding trained teacher machine learning models.
In this way, the training of the student machine model is guided together by using the trained plurality of teacher machine learning models.
According to the first aspect, in a possible implementation manner, bit widths of the weights and the activation values of the plurality of teacher machine learning models are greater than bit widths of the weights and the activation values of the student machine learning models.
Therefore, the teacher machine learning model with the high bit width is used for guiding the training of the student machine model with the low bit width, and the method is beneficial to reducing required computing resources and improving the operation speed and avoiding the precision loss as much as possible.
According to a first aspect, in one possible implementation, inputting the training data into the trained teacher machine learning model to obtain an output of the trained teacher machine learning model, comprises: inputting the training data into the plurality of trained teacher machine learning models, respectively, to obtain output results of the plurality of trained teacher machine learning models that correspond to each other, wherein for each of the plurality of trained teacher machine learning models: the plurality of intermediate layers included in the particular trained teacher machine learning model each generate a corresponding intermediate layer output feature in a process in which the particular trained teacher machine learning model generates an output result of the particular trained teacher machine learning model.
Thus, the influence of the output result is considered.
According to the first aspect, in a possible implementation manner, the plurality of trained teacher machine learning models correspond to a plurality of first weight coefficients in a first weight coefficient set in a one-to-one manner, output results of the plurality of trained teacher machine learning models are weighted and summed according to the respective corresponding first weight coefficients to obtain a classification layer output result of the multi-bit teacher machine learning model, and the first part of the loss function includes a Kullback-leibler divergence (kl) determined based on the classification layer output result of the multi-bit teacher machine learning model and an output result of the student machine learning model with respect to the training data.
In this way, the influence of the output result is taken into account by the first part of the loss function.
According to the first aspect, in one possible implementation manner, a plurality of intermediate layers included in each of the plurality of trained teacher machine learning models are in one-to-one correspondence with a plurality of second weight coefficients in a second weight coefficient set, where, for one or more of the plurality of intermediate layers included in the student machine learning model: determining intermediate layer output characteristics of a multi-bit teacher machine learning model corresponding to a specific intermediate layer of the student machine learning model, wherein the intermediate layer output characteristics of the multi-bit teacher machine learning model are obtained by weighting and summing intermediate layer output characteristics generated by the intermediate layers of the plurality of trained teacher machine learning models which are positioned at the same level as the specific intermediate layer according to respective corresponding second weight coefficients; the second part of the loss function includes distance losses determined based on the mid-level output features of the multi-bit teacher machine learning model and mid-level output features generated by a particular mid-level of the student machine learning model after the point convolution layer conversion.
In this way, the variation of the distillation importance of different teacher networks, different intermediate layers to the student network is taken into account by the second part of the loss function.
According to the first aspect, in a possible implementation manner, the second part of the loss function further includes summing distance losses corresponding to a plurality of intermediate layers included in the student machine learning model after smoothing processing.
Thus, the training is stabilized by the smoothing process.
According to the first aspect, in one possible implementation, the loss function further comprises a third part comprising cross-entropy losses determined for the output results of the training data and the true labels of the training data based on the student machine learning model.
In this way, the training effect is improved by the third part of the loss function.
According to the first aspect, in a possible implementation manner, the method further includes: deriving the first weight coefficient by the loss function to obtain a first gradient, wherein the first gradient is a gradient of the loss function with respect to a classification layer output result of the multi-bit teacher machine learning model, the first gradient being determined according to the KL divergence included in a first portion of the loss function; deriving the second weight coefficient by the loss function to obtain a second gradient, wherein the second gradient is a gradient of the loss function with respect to an intermediate layer output feature of the multi-bit teacher machine learning model, the second gradient being determined from the distance loss included in a second portion of the loss function; and performing back propagation through the first gradient and the second gradient respectively, thereby dynamically adjusting the first weight coefficient and the second weight coefficient respectively.
Thus, self-dynamic adjustment according to the corresponding loss function part is realized, and further, the operation is simplified and the system efficiency is improved.
In a possible implementation form according to the first aspect, the weight and the activation value of the student machine learning model are both binarized.
Thus, the student machine learning model simplified into binaryzation is realized.
In a second aspect, embodiments of the present application provide a knowledgeable distillation system. The knowledge distillation system comprises: a plurality of teacher machine learning models, and a student machine learning model. The number of the middle layers included in the teacher machine learning models is the same as that of the middle layers included in the student machine learning models; bit widths adopted by the weights and the activation values of the teacher machine learning models are all larger than bit widths adopted by the weights and the activation values of the student machine learning models; wherein parameters of the student machine learning model are adjusted according to a loss function, so that the difference value between the output result of the student machine learning model for the training data and the output result of the plurality of teacher machine learning models for the training data is smaller than a preset threshold value; the first part of the loss function comprises Kullback-Leibler (KL) divergence determined based on the output results of a classification layer of a multi-bit teacher machine learning model and the output results of the student machine learning model aiming at the training data, and the output results of the classification layer of the multi-bit teacher machine learning model are obtained by weighting and summing the output results of the plurality of teacher machine learning models aiming at the training data according to respective corresponding first weight coefficients; the second part of the loss function comprises distance loss determined based on intermediate layer output characteristics of a multi-bit teacher machine learning model and intermediate layer output characteristics generated by a specific intermediate layer of the student machine learning model after point convolution layer conversion, and the intermediate layer output characteristics of the multi-bit teacher machine learning model are obtained by weighting and summing the intermediate layer output characteristics generated by the intermediate layers of the plurality of teacher machine learning models which are positioned at the same level as the specific intermediate layer according to respective corresponding second weight coefficients; wherein a third portion of the loss function includes cross-entropy losses determined for output results of the training data and true labels of the training data based on the student machine learning model.
The technical scheme described in the second aspect realizes that a plurality of trained teacher machine learning models are used for guiding the training of the student machine models together, thereby improving the precision of the student machine learning models, simultaneously considering the change of distillation importance degree of different teacher networks and different middle layers to the student networks through the loss function, and being beneficial to improving the training effect.
In a third aspect, an embodiment of the present application provides a computer-readable storage medium, which holds computer instructions, where the computer instructions, when executed by a processor, cause the processor to perform the following operations: training at least one teacher machine learning model according to the training data to obtain a trained teacher machine learning model; inputting the training data into the trained teacher machine learning model so as to obtain an output result of the trained teacher machine learning model, wherein an intermediate layer included in the trained teacher machine learning model generates a corresponding intermediate layer output feature in a generation process of the output result of the trained teacher machine learning model; and adjusting parameters of a student machine learning model according to a loss function, so that the difference between the output result of the student machine learning model aiming at the training data and the output result of the trained teacher machine learning model is smaller than a preset threshold value, wherein the loss function comprises a first part and a second part, the first part of the loss function is determined according to the output result of the trained teacher machine learning model, and the second part of the loss function is determined according to the intermediate layer output characteristics generated by the intermediate layer included in the trained teacher machine learning model.
The technical scheme described in the third aspect realizes that a plurality of trained teacher machine learning models are used for guiding the training of the student machine models together, thereby improving the precision of the student machine learning models, simultaneously considering the change of distillation importance degree of different teacher networks and different middle layers to the student networks through the loss function, and being beneficial to improving the training effect.
Drawings
In order to explain the technical solutions in the embodiments or background art of the present application, the drawings used in the embodiments or background art of the present application will be described below.
FIG. 1 illustrates a knowledge distillation system including a single teacher machine learning model provided by embodiments of the present application.
FIG. 2 illustrates a knowledge distillation system including a plurality of teacher machine learning models provided by embodiments of the present application.
Figure 3 shows a schematic flow diagram of a method of knowledge distillation according to one embodiment provided in the examples of the present application.
Figure 4 shows a schematic flow diagram of another embodiment of the knowledge distillation process provided in the examples of the present application.
Detailed Description
It is an object of the present application to provide a method, system and computer readable storage medium for training a neural network by knowledge distillation. The method comprises the following steps: training at least one teacher machine learning model according to the training data to obtain a trained teacher machine learning model; inputting the training data into the trained teacher machine learning model so as to obtain an output result of the trained teacher machine learning model, wherein an intermediate layer included in the trained teacher machine learning model generates a corresponding intermediate layer output feature in a generation process of the output result of the trained teacher machine learning model; and adjusting parameters of a student machine learning model according to a loss function, so that the difference between the output result of the student machine learning model aiming at the training data and the output result of the trained teacher machine learning model is smaller than a preset threshold value, wherein the loss function comprises a first part and a second part, the first part of the loss function is determined according to the output result of the trained teacher machine learning model, and the second part of the loss function is determined according to the intermediate layer output characteristics generated by the intermediate layer included in the trained teacher machine learning model. Therefore, the training of the student machine models is guided by the trained machine learning models of the plurality of teachers together, the precision of the machine learning models of the students is improved, the changes of the distillation importance degrees of different teacher networks and different middle layers to the student networks are considered through the loss function, and the training effect is improved.
The embodiments of the present application may be applied to various application scenarios including, but not limited to, various scenarios in the field of computer vision applications, such as face recognition, image classification, object detection, semantic segmentation, etc., or to neural network model-based processing systems deployed on edge devices (e.g., mobile phones, wearable devices, computing nodes, etc.), or to application scenarios for speech signal processing, natural language processing, recommendation systems, or to application scenarios requiring compression of neural network models due to limited resources and latency requirements.
For illustrative purposes only, the embodiments of the present application may be applied to an application scenario of object detection at a mobile phone end. The technical problems to be solved by the application scenario are as follows: when a user uses a mobile phone to take a picture, the user needs to automatically capture objects such as human faces and animals so as to help the mobile phone to automatically focus, beautify and the like, so that a convolutional neural network model for object detection, which is small in size and fast in operation, is needed, and better user experience is brought to the user and the quality of mobile phone products is improved.
For illustrative purposes only, the present application embodiments may also be used in application scenarios of autonomous driving scenario segmentation. The technical problems to be solved by the application scenario are as follows: after capturing the road image, the camera of the automatic driving vehicle needs to divide the image, and then separates different objects such as road surface, roadbed, vehicles, pedestrians and the like, so as to keep the vehicle running in a correct area. There is therefore a need for a convolutional neural network model that can provide fast real-time correct interpretation and semantic segmentation of a picture.
For illustrative purposes only, the embodiments of the present application may also be used in application scenarios of portal gate face verification. The technical problems to be solved by the application scenario are as follows: when passengers carry out face authentication on gates at entrances of high-speed rails, airports and the like, a camera can shoot a face image and extract features by using a convolutional neural network, and then similarity calculation is carried out on the face image and the image features of identity documents stored in a system; and if the similarity is high, the verification is successful. Among them, it is most time-consuming to extract features through a convolutional neural network, and thus an efficient convolutional neural network model capable of performing face verification and feature extraction quickly is required.
For illustrative purposes only, the embodiments of the present application may also be used in application scenarios where the translator is simultaneously interpreting vocals. The technical problems to be solved by the application scenario are as follows: in terms of speech recognition and machine translation problems, real-time speech recognition must be achieved and translation must be performed, so an efficient convolutional neural network model is required.
The embodiments of the present application may be modified and improved according to specific application environments, and are not limited herein.
In order to make the technical field of the present application better understand, embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application.
Referring to fig. 1, fig. 1 illustrates a knowledge distillation system including a single teacher machine learning model provided by an embodiment of the present application. As shown in fig. 1, the knowledge distillation system 100 includes a teacher machine learning model 110 and a student machine learning model 120. The student machine learning model 120 may be pre-trained, or may be partially trained, or may be completely untrained, among other things. The knowledge distillation system 100 migrates the knowledge of the teacher machine learning model 110 to the student machine learning model 120, that is, the teacher machine learning model 110 achieves the purpose of improving the reasoning prediction capability of the student machine learning model 120. It is composed ofThe teacher machine learning model 110 is a model that has been trained, and the output of the classification layer 118 is represented as yt. The teacher machine learning model 110 includes three intermediate layers 112, 114, and 116, and these intermediate layers are generating output results ytIn the process of (1) sequentially generating intermediate layer output characteristics
Figure BDA0002778282240000061
To know
Figure BDA0002778282240000062
The output of the classification layer 128 of the student machine learning model 120 is denoted as ys. The student machine learning model 120 includes three intermediate layers 122, 124, and 126, and these intermediate layers are generating the output result ysIn the process of (1) sequentially generating intermediate layer output characteristics
Figure BDA0002778282240000063
And
Figure BDA0002778282240000064
with continued reference to fig. 1, the knowledge distillation system 100 first trains the teacher machine learning model 110 based on the training data to obtain a trained teacher machine learning model 110; the training data is then input into the trained teacher machine learning model 110 to obtain an output y of the trained teacher machine learning model 110t(ii) a Adjusting parameters of the student machine learning model 120 according to the loss function, so that the student machine learning model 120 outputs a result y for the training datasOutput y matched to the trained teacher machine learning model 110t. Wherein the loss function includes a first portion and a second portion. The first part of the loss function is based on the output y of the trained teacher machine learning modeltDetermining that a second portion of the loss function is based on intermediate layer output features generated by an intermediate layer included in the trained teacher machine learning model
Figure BDA0002778282240000065
And
Figure BDA0002778282240000066
and (4) determining. It should be understood that the output y of the student machine learning model 120 is judgedsWhether to match the output y of the teacher machine learning model 110tThe difference between the two may be required to be smaller than a specified threshold value by comparing the two, or a global minimum solution of the difference between the two may be solved, or by other technical means, which is not specifically limited herein. In addition, when the training data of the training teacher machine learning model 110 is true-labeled, it is also possible to judge the prediction ability of the student machine learning model 120 or the training effect of the knowledge distillation system 100 by the training data with the true-label, and to make a judgment of matching when the prediction ability or the training effect is sufficiently good.
With continued reference to fig. 1, the bit widths used for the weights and activation values of the teacher machine learning model 110 are greater than the bit widths used for the weights and activation values of the student machine learning model 120. For example, teacher machine learning model 110 may be a model that is 32-bit wide, while student machine learning model 120 may be a binarization model, i.e., 1-bit wide. Therefore, compression from a high bit width model to a low bit width model can be realized simultaneously, and knowledge transfer and inheritance of prediction capability are realized through a knowledge distillation technology.
With continued reference to fig. 1, the knowledge distillation system 100 adjusts parameters of the student machine learning model 120 according to a loss function. The first part of the loss function includes an output result y based on the trained teacher machine learning modeltAnd the output result y of the student machine learning model aiming at the training datasAnd a determined Kullback-Leibler (KL) divergence. Thus, the difference between output results obtained for the same input data, i.e., training data, can be measured by KL divergence. Knowledge distillation system 100 not only achieves knowledge migration by outputting results, but also improves the effect of knowledge distillation by the characteristics of the intermediate layer output. In particular, student machine learning modelThe number of intermediate layers included in the model 120 is the same as the number of intermediate layers included in the teacher machine learning model 110. The second part of the loss function includes intermediate layer output features generated based on the intermediate layers included in the trained teacher machine learning model 110
Figure BDA0002778282240000071
And
Figure BDA0002778282240000072
and the intermediate layer output characteristics generated by the intermediate layer included in the student machine learning model 120 after the point convolution layer conversion
Figure BDA0002778282240000073
And
Figure BDA0002778282240000074
but the determined distance is lost. It should be understood that student machine learning model 120 includes the same number of intermediate layers as teacher machine learning model 110, i.e., both have the same depth or have the same number of hidden layers. Depth is defined herein with respect to the output from the input to both respective classification layers. That is, the number of intermediate layers included in the student machine learning model 120, or the depth thereof, is referred to as being before the classification layer 128 of the student machine learning model 120; the number of intermediate layers included in the teacher machine learning model 110, or the depth thereof, is referred to as being before the classification layer 118 of the teacher machine learning model 110. Here, since the intermediate layers of the student machine learning model 120 and the teacher machine learning model 110 do not have a one-to-one correspondence relationship, even features output by intermediate layers of the same layer do not necessarily have a correspondence relationship. For example, the first intermediate level 122 of the student machine learning model 120 and the first intermediate level 112 of the teacher machine learning model 110 are both at the first level, but each outputs an intermediate level feature
Figure BDA0002778282240000075
And
Figure BDA0002778282240000076
there is not necessarily a correspondence between them, and they do not necessarily belong to the same channel. Therefore, there is a need for intermediate level output features generated for an intermediate level included in the student machine learning model 120
Figure BDA0002778282240000077
And
Figure BDA0002778282240000078
point convolution layer transformations are performed and can thus be used to improve the performance of the student machine learning model 120. In one possible implementation, the loss function further includes a third portion. The third part of the loss function includes an output result y for the training data based on the student machine learning model 120sAnd a true label of the training data.
Referring to fig. 2, fig. 2 illustrates a knowledge distillation system including a plurality of teacher machine learning models, provided by an embodiment of the present application. As shown in fig. 2, knowledge distillation system 200 includes teacher machine learning models 220, 240, and 260, and student machine learning model 280. The student machine learning model 280 may be pre-trained, or may be partially trained, or may be completely untrained, among other things. The student machine learning model 280 has three middle layers 282, 284 and 286 that output middle layer features in sequence
Figure BDA0002778282240000079
And
Figure BDA00027782822400000710
and classification layer 288 outputs ys. The teacher machine learning model 220 has three middle layers 222, 224, and 226 outputting middle layer features in sequence
Figure BDA00027782822400000711
And
Figure BDA00027782822400000712
and a classification layer 228 output
Figure BDA00027782822400000720
The teacher machine learning model 240 has three middle layers 242, 244, and 246 that output middle layer features in sequence
Figure BDA00027782822400000714
And
Figure BDA00027782822400000715
and a classification layer 248 output
Figure BDA00027782822400000716
The teacher machine learning model 260 has three intermediate layers 262, 264, and 266 that output intermediate layer features in sequence
Figure BDA00027782822400000717
And
Figure BDA00027782822400000718
and classification layer 268 output
Figure BDA00027782822400000719
It should be understood that teacher machine learning models 220, 240, and 260, and student machine learning model 280 each include the same number of intermediate layers, i.e., the models have the same depth or the same number of hidden layers. Depth is defined herein with respect to the output from the input to the respective classification layer. For example, the depth of student machine learning model 280 is to classification level 288, while the depth of teacher machine learning model 220 is to classification level 228. Teacher machine learning models 220, 240, and 260 are all models trained from the same training data. In one possible implementation, teacher machine learning models 220, 240, and 260 are models having the same structure, or the depth of the models is the same, or the number of layers of the hidden layer is the same, or the number of channels is the same. The same training data is used for training teacher machine learning models 220, 240 and 260 to obtain multiple trained teacher machinesThe learning models 220, 240, and 260.
With continued reference to fig. 2, the bit widths of the weights and activation values of the plurality of teacher machine learning models 220, 240, and 260 are each greater than the bit widths of the weights and activation values of the student machine learning model 280. For example, the bit-widths of the plurality of teacher machine learning models 220, 240, and 260 may be 32 bits, 16 bits, and 8 bits in order, while the student machine learning model 120 may be a binarization model, i.e., 1 bit-width. Therefore, compression from a high bit width model to a low bit width model can be achieved simultaneously, and knowledge transfer and inheritance of prediction capability are achieved through the knowledge distillation technology combining a plurality of teacher models. As another example, the bit-widths of the plurality of teacher machine learning models 220, 240, and 260 may be 32 bits, and 16 bits in order. That is, there may be the same bit-width or some of the same bit-width between the plurality of teacher machine learning models.
With continued reference to FIG. 2, the knowledge distillation system 200 inputs the same training data into the plurality of trained teacher machine learning models 220, 240, and 260, respectively, to obtain output results of the plurality of trained teacher machine learning models corresponding to each of the training teacher machine learning models
Figure BDA0002778282240000081
And
Figure BDA0002778282240000082
and, for each of the plurality of trained teacher machine learning models 220, 240, and 260: the plurality of intermediate layers included in the particular trained teacher machine learning model each generate a corresponding intermediate layer output feature in a process in which the particular trained teacher machine learning model generates an output result of the particular trained teacher machine learning model. The knowledge distillation system 200 adjusts parameters of the student machine learning model 280 according to the loss function. Wherein, the first part of the loss function needs to use the classification layer output result of the multi-bit teacher machine learning model, and the classification layer output result of the multi-bit teacher machine learning model is obtained by the formula (1).
Figure BDA0002778282240000083
In the formula (1), ytRepresenting a classification layer output result of the multi-bit teacher machine learning model;
Figure BDA0002778282240000084
representing the output result of the classification layer of the teacher machine learning model with the sequence number M, wherein M is the total number of the plurality of teacher machine learning models; delta0,mAnd a first weight coefficient corresponding to the teacher machine learning model with the sequence number m in the first weight coefficient set is represented. The meaning of the formula (1) is that the output results of the trained teacher machine learning models are weighted and summed according to the corresponding first weight coefficients to obtain the classification layer output results of the multi-bit teacher machine learning models. Referring to FIG. 2 and equation (1), the output results of each of the trained teacher machine learning models 220, 240, and 260
Figure BDA0002778282240000085
And
Figure BDA0002778282240000086
the classification layer output results of the multi-bit teacher machine learning models representing the trained teacher machine learning models 220, 240, and 260 are obtained by summing up the weighted values of the corresponding first weight coefficients. It should be understood that fig. 2 only schematically shows the case of three teacher machine learning models, the total number M of the plurality of teacher machine learning models mentioned in formula (1) may be any positive integer greater than 1, and the case of three teacher machine learning models shown in fig. 2 may also be extended to the case of M teacher machine learning models.
With continued reference to fig. 2, the first part of the loss function is expressed as equation (2).
Figure BDA0002778282240000091
In the formula (2), ytThe classification layer output result, y, of the multi-bit teacher machine learning model obtained by the formula (1) is expressedsRepresents the output result of the student machine learning model for the training data, T is a hyper-parameter or a temperature parameter, sigma represents a softmax activation function, and the left side of the formula (2) represents the Kullback-Leibler (KL) divergence of the classification layer output. As shown in equation (2), the first portion of the loss function includes a KL divergence determined based on the classification layer output results of the multi-bit teacher machine learning model and the output results of the student machine learning model for the training data. It should be appreciated that the probability distribution of the output of the softmax activation function becomes smoother as the temperature parameter T increases.
With continued reference to fig. 2 and equation (2), the knowledge distilling system 200 measures the difference between the output results obtained for the same input data, i.e., training data, by KL divergence, thereby implementing the knowledge migration from the plurality of teacher machine learning models to the student machine learning models by outputting the results. Knowledge distillation system 200 also improves the effectiveness of knowledge distillation by the characteristics of the intermediate layer output. Specifically, knowledge distillation system 200 also utilizes the middle layer output characteristics of the multi-bit teacher machine learning model in equations (1) and (2). The middle layer output characteristics of the multi-bit teacher machine learning model are obtained through formula (3).
Figure BDA0002778282240000092
In the formula (3), the first and second groups,
Figure BDA0002778282240000093
the middle layer output characteristics of the middle layer with the serial number i of the multi-bit teacher machine learning model are represented;
Figure BDA0002778282240000094
representing the intermediate layer output characteristics output by the intermediate layer with the serial number i in the teacher machine learning model with the serial number m; deltai,mTeacher with presentation and serial number mA second weight coefficient corresponding to the middle layer with the sequence number i in the machine learning model; and M is the total number of the plurality of teacher machine learning models. Taking the knowledge distillation system 200 shown in fig. 2 as an example, M is 3. The formula (3) means: the multiple intermediate layers included in the multiple trained teacher machine learning models are in one-to-one correspondence with the multiple second weight coefficients in the second weight coefficient set; wherein for one or more of a plurality of intermediate layers comprised by the student machine learning model: and determining the intermediate layer output characteristics of the multi-bit teacher machine learning model corresponding to the specific intermediate layer of the student machine learning model, wherein the intermediate layer output characteristics of the multi-bit teacher machine learning model are obtained by weighting and summing the intermediate layer output characteristics generated by the intermediate layers of the plurality of trained teacher machine learning models which are positioned at the same level as the specific intermediate layer according to the respective corresponding second weight coefficients. Here, δi,mAnd the weight coefficients of the intermediate layers of the different teacher machine learning models are expressed, and the sum of the weight coefficients meets the constraint condition and is zero. That is, a plurality of weight coefficients δ for the same hierarchy ii,m(M is 1 to M), and these plural weight coefficients corresponding to the same hierarchy i satisfy the constraint condition that the sum is zero. In a possible implementation, the second weighting factor δi,mThis is done by softmax.
With continued reference to fig. 2 and equation (3), the second portion of the loss function includes distance losses determined based on the mid-level output characteristics of the multi-bit teacher machine learning model and the mid-level output characteristics generated by the particular mid-level of the student machine learning model after the point convolution layer conversion. The second part of the loss function is obtained by equation (4).
Figure BDA0002778282240000101
Wherein the content of the first and second substances,
Figure BDA0002778282240000102
expressing the intermediate layer output characteristics of the multi-bit teacher machine learning model obtained by the formula (3);
Figure BDA0002778282240000103
the intermediate layer output characteristics of the intermediate layer with the serial number i of the student machine learning model are represented; r isiA conversion layer for performing point convolution layer conversion on the output of the intermediate layer with the serial number i of the student machine learning model; n represents the number of selected intermediate layers. Formula (4) means that, for the selected N intermediate layers, the intermediate layer output characteristics of the multi-bit teacher machine learning model and the intermediate layer output characteristics of the student machine learning model corresponding to each of the N intermediate layers are paired, the distance loss is solved for each pair, and finally the respective distance losses of the N intermediate layers are summed to obtain the second part of the loss function. It should be understood that the middle layers of the student machine learning model and the teacher machine learning model do not have a one-to-one correspondence, and even if there is no correspondence between features output by the middle layers of the same layer. Therefore, it is necessary to perform point convolution layer conversion on the output of the middle layer of the student machine learning model. The range of N is any positive integer equal to or greater than 1, but the maximum value of N is the total number of intermediate layers included in the teacher machine learning model and the student machine learning model. That is, the characteristics of all the intermediate layer outputs may be taken into account, only a portion of the intermediate layer outputs may be selected, or only a particular intermediate layer output may be selected. Taking the plurality of teacher machine learning models 220, 240, and 260 shown in fig. 2 as an example, the characteristics of the outputs of all the intermediate layers (N is 3) may be considered, or only the output characteristics of the intermediate layer of the first layer (N is 1), that is, the characteristics of the outputs of the intermediate layers 222, 242, and 262 may be considered. Correspondingly, according to the selected output characteristics of the middle layer of the teacher machine learning model, the output characteristics of the middle layer of the student machine learning model are used for solving the corresponding distance loss after point convolution layer conversion. Smooth in equation (4) represents smoothing processing, and can be represented by equation (5).
Figure BDA0002778282240000104
Combining the formulas (4) and (5), the distance loss after the smoothing processing and the second part of the loss function are beneficial to making the training process more stable. The second part of the loss function further comprises summing distance losses corresponding to each of a plurality of intermediate layers included in the student machine learning model after smoothing.
With continued reference to fig. 2, the loss function further includes a third portion that includes cross-entropy losses determined for the output results of the training data and the true labels of the training data based on the student machine learning model. The loss function is expressed as equation (6) in conjunction with equations (1) through (5).
Figure BDA0002778282240000105
Wherein L in the formula (6)CEIs a third part of the loss function, namely a cross-entropy loss determined for the output result of the training data and the true label of the training data based on the student machine learning model; l isKLIs the KL divergence, referred to by equation (2), determined based on the classification layer output results of the multi-bit teacher machine learning model and the output results of the student machine learning model for the training data of the first part of the loss function; l isDisIs the distance loss determined based on the middle layer output characteristics of the multi-bit teacher machine learning model and the middle layer output characteristics generated by the specific middle layer of the student machine learning model after the point convolution layer conversion, as mentioned in equation (4).
With continued reference to fig. 2 and equation (6), the knowledge distillation system 200 can implement dynamic adjustment. Specifically, the first weight coefficient is derived by the loss function, thereby obtaining a first gradient; wherein the first gradient is a gradient of the loss function with respect to a classification layer output result of the multi-bit teacher machine learning model, the first gradient determined according to the KL divergence included in the first portion of the loss function. On the other hand, the second weight coefficient is derived through the loss function, so that a second gradient is obtained; wherein the second gradient is a gradient of the loss function with respect to an intermediate layer output feature of the multi-bit teacher machine learning model, the second gradient determined from the distance loss included in a second portion of the loss function. And performing back propagation through the first gradient and the second gradient respectively, thereby dynamically adjusting the first weight coefficient and the second weight coefficient respectively. The dynamic adjustment process is represented by equation (7).
Figure BDA0002778282240000111
Figure BDA0002778282240000112
In the formula (7), δ0,mRepresenting a first weight coefficient corresponding to a teacher machine learning model with the sequence number m in a first weight coefficient set mentioned in formula (1); deltai,mRepresents a second weight coefficient corresponding to the intermediate layer with the number i in the teacher's machine learning model with the number m mentioned in the formula (3), and LallRepresents the overall loss function mentioned in equation (6). From equation (7), it can be seen that the overall loss function LallRespectively to the first weight coefficient delta0,mAnd a second weight coefficient deltai,mAnd carrying out derivation to obtain a first gradient and a second gradient. Wherein the first gradient appears to be related to the KL divergence of the first part of the loss function mentioned by equation (2), and the second gradient appears to be related to the distance loss of the second part of the loss function mentioned by equation (4). Also, the solution of the first gradient does not require the second part of the loss function, whereas the solution of the second gradient does not require the first part of the loss function. Thus, in conjunction with equations (1) through (7), the overall loss function includes three components that each function and together improve the effectiveness of the knowledge distillation, while simultaneously improving the knowledge distillationAnd the inverse propagation and the gradient are obtained through respective derivation, and the derivation result and the gradient do not relate to other parts of the loss function, so that the self dynamic adjustment is favorably realized according to the corresponding part of the loss function, and the simplification of the operation and the improvement of the system efficiency are further facilitated. That is, the derivation result and the back propagation obtained by the formula (7) can dynamically adjust the corresponding weight coefficients, which is further beneficial to adjust the parameters of the student machine learning model through the loss function, so that the output result of the student machine learning model for the training data matches the output result of the trained teacher machine learning model.
In this way, a plurality of pre-trained teacher machine learning models with high bit widths are used in combination with a knowledge distillation technology, so that training of student machine models with low bit widths, such as binaryzation, is guided together, and accuracy of the student machine learning models is improved. In addition, in the training process, different teacher networks and different layers are considered to change the distillation importance degree of the student networks, so that each teacher network is endowed with a learnable weight coefficient, and the dynamic adjustment of the weights of the different teacher networks is realized.
With the combination of the formula (1) to the formula (7), when the student machine learning model is a binary neural network model and is used for an image classification task, the embodiment of the application achieves a significant technical effect. Specifically, experiments were performed on the CIFAR10 and CIFAR100 image classification tasks. Compared with a binary neural network in the prior art, the neural network model obtained by the knowledge distillation method in the embodiment of the application has higher precision under the condition of the same calculated amount, and the table 1 shows.
Figure BDA0002778282240000121
TABLE 1 CIFAR10 and CIFAR100 comparison of classification results
In addition, experiments were performed on the large-scale image classification dataset ImageNet. Compared with other binary neural networks, the neural network provided by the invention has higher precision under the condition of the same calculated amount, and is shown in tables 2 and 3.
Figure BDA0002778282240000122
The left part is the comparison of the classification results of ImageNet in Table 2, and the right part is the comparison of the classification results of single-bit and multi-bit distillation in Table 3
In addition, the effectiveness of the present invention was verified by ablation experiments, see table 4 below. Where FULL denotes all methods using the present invention, w/o AKA denotes no dynamic knowledge adjustment, w/o CTL denotes no 1x1 translation of the convolutional Layer, w/o Intermediate Layers denotes no Intermediate Layer characteristic distillation, and w/o Classification Layer denotes no Classification Layer output distillation.
Figure BDA0002778282240000123
Table 4 ablation experimental results
Referring to fig. 3, fig. 3 shows a schematic flow diagram of a knowledge distillation method according to an embodiment provided in the examples of the present application. As shown in fig. 3, the method includes the following steps.
Step S300: at least one teacher machine learning model is trained based on the training data to obtain a trained teacher machine learning model.
Wherein the at least one teacher machine learning model trains the resulting model through the same training data. In one possible implementation, the at least one teacher machine learning model has a model with the same structure, or the depth of the model is the same, or the number of layers of the hidden layer is the same, or the number of channels is the same.
Step S310: inputting the training data into the trained teacher machine learning model to obtain an output of the trained teacher machine learning model.
And the middle layer included in the trained teacher machine learning model generates corresponding middle layer output characteristics in the generation process of the output result of the trained teacher machine learning model.
Step S320: adjusting parameters of a student machine learning model according to a loss function so that an output result of the student machine learning model for the training data matches an output result of the trained teacher machine learning model.
Wherein the loss function comprises a first part and a second part, the first part of the loss function is determined according to the output result of the trained teacher machine learning model, and the second part of the loss function is determined according to the intermediate layer output characteristics generated by the intermediate layer included in the trained teacher machine learning model.
Referring to fig. 4, fig. 4 shows a schematic flow diagram of a knowledge distillation method according to another embodiment provided in the examples of the present application. As shown in fig. 4, the method includes the following steps.
Step S400: and respectively training a plurality of teacher machine learning models according to the training data so as to obtain a plurality of corresponding trained teacher machine learning models.
The number of the intermediate layers included in the teacher machine learning models and the number of the intermediate layers included in the student machine learning models are the same. Bit widths adopted by the weights and the activation values of the teacher machine learning models are larger than bit widths adopted by the weights and the activation values of the student machine learning models.
Step S410: and inputting the training data into the plurality of trained teacher machine learning models respectively so as to obtain output results of the plurality of trained teacher machine learning models corresponding to each training teacher machine learning model.
Wherein, for each of the plurality of trained teacher machine learning models: the plurality of intermediate layers included in the particular trained teacher machine learning model each generate a corresponding intermediate layer output feature in a process in which the particular trained teacher machine learning model generates an output result of the particular trained teacher machine learning model.
Step S420: adjusting parameters of a student machine learning model according to a loss function so that an output result of the student machine learning model for the training data matches an output result of the trained teacher machine learning model.
The plurality of trained teacher machine learning models are in one-to-one correspondence with a plurality of first weight coefficients in a first weight coefficient set, output results of the plurality of trained teacher machine learning models are weighted and summed according to the respective corresponding first weight coefficients to obtain classification layer output results of the multi-bit teacher machine learning models, and the first part of the loss function comprises Kullback-leibler (kl) divergence determined based on the classification layer output results of the multi-bit teacher machine learning models and the output results of the student machine learning models for the training data.
Wherein the plurality of intermediate layers included in each of the plurality of trained teacher machine learning models are in one-to-one correspondence with the plurality of second weight coefficients in the second weight coefficient set. Wherein for one or more of a plurality of intermediate layers comprised by the student machine learning model: determining intermediate layer output characteristics of a multi-bit teacher machine learning model corresponding to a specific intermediate layer of the student machine learning model, wherein the intermediate layer output characteristics of the multi-bit teacher machine learning model are obtained by weighting and summing intermediate layer output characteristics generated by the intermediate layers of the plurality of trained teacher machine learning models which are positioned at the same level as the specific intermediate layer according to respective corresponding second weight coefficients; the second part of the loss function includes distance losses determined based on the mid-level output features of the multi-bit teacher machine learning model and mid-level output features generated by a particular mid-level of the student machine learning model after the point convolution layer conversion.
Wherein the second part of the loss function further comprises summing distance losses corresponding to each of a plurality of intermediate layers included in the student machine learning model after smoothing.
Step S430: deriving the first weight coefficient by the loss function to obtain a first gradient; deriving the second weight coefficient by the loss function to obtain a second gradient; and performing back propagation through the first gradient and the second gradient respectively, thereby dynamically adjusting the first weight coefficient and the second weight coefficient respectively.
Wherein the first gradient is a gradient of the loss function with respect to a classification layer output result of the multi-bit teacher machine learning model, the first gradient determined from the KL divergence included in the first portion of the loss function; the second gradient is a gradient of the loss function with respect to an intermediate layer output feature of the multi-bit teacher machine learning model, the second gradient determined from the distance loss included in a second portion of the loss function.
Referring to fig. 1-4, in some exemplary embodiments, the teacher machine learning model and the student machine learning model may refer to fully connected neural networks, while the corresponding middle tier outputs refer to the outputs of the fully connected tiers; or the teacher machine learning model and the student machine learning model refer to a convolutional neural network, and the corresponding middle layer output refers to the output of a convolutional layer; or the teacher machine learning model and the student machine learning model are Recurrent Neural Networks (RNNs); or the teacher machine learning model and the student machine learning model are other models having the same structure as long as the knowledge distillation method and system described in the embodiments of the present application are applicable. These may be adjusted according to specific application scenarios, and are not specifically limited herein.
The embodiments provided herein may be implemented in any one or combination of hardware, software, firmware, or solid state logic circuitry, and may be implemented in connection with signal processing, control, and/or application specific circuitry. Particular embodiments of the present application provide an apparatus or device that may include one or more processors (e.g., microprocessors, controllers, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs), etc.) that process various computer-executable instructions to control the operation of the apparatus or device. Particular embodiments of the present application provide an apparatus or device that can include a system bus or data transfer system that couples the various components together. A system bus can include any of a variety of different bus structures or combination of different bus structures, such as a memory bus or memory controller, a peripheral bus, a universal serial bus, and/or a processor or local bus that utilizes any of a variety of bus architectures. The devices or apparatuses provided in the embodiments of the present application may be provided separately, or may be part of a system, or may be part of other devices or apparatuses.
Particular embodiments provided herein may include or be combined with computer-readable storage media, such as one or more storage devices capable of providing non-transitory data storage. The computer-readable storage medium/storage device may be configured to store data, programmers and/or instructions that, when executed by a processor of an apparatus or device provided by embodiments of the present application, cause the apparatus or device to perform operations associated therewith. The computer-readable storage medium/storage device may include one or more of the following features: volatile, non-volatile, dynamic, static, read/write, read-only, random access, sequential access, location addressability, file addressability, and content addressability. In one or more exemplary embodiments, the computer-readable storage medium/storage device may be integrated into a device or apparatus provided in the embodiments of the present application or belong to a common system. The computer-readable storage medium/memory device may include optical, semiconductor, and/or magnetic memory devices, etc., and may also include Random Access Memory (RAM), flash memory, read-only memory (ROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), registers, a hard disk, a removable disk, a recordable and/or rewriteable Compact Disc (CD), a Digital Versatile Disc (DVD), a mass storage media device, or any other form of suitable storage media.
The above is an implementation manner of the embodiments of the present application, and it should be noted that the steps in the method described in the embodiments of the present application may be sequentially adjusted, combined, and deleted according to actual needs. In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments. It is to be understood that the embodiments of the present application and the structures shown in the drawings are not to be construed as particularly limiting the devices or systems concerned. In other embodiments of the present application, an apparatus or system may include more or fewer components than the specific embodiments and figures, or may combine certain components, or may separate certain components, or may have a different arrangement of components. Those skilled in the art will understand that various modifications and changes may be made in the arrangement, operation, and details of the methods and apparatus described in the specific embodiments without departing from the spirit and scope of the embodiments herein; without departing from the principles of embodiments of the present application, several improvements and modifications may be made, and such improvements and modifications are also considered to be within the scope of the present application.

Claims (29)

1. A method of training a neural network by knowledge distillation, the method comprising:
training at least one teacher machine learning model according to the training data to obtain a trained teacher machine learning model;
inputting the training data into the trained teacher machine learning model so as to obtain an output result of the trained teacher machine learning model, wherein an intermediate layer included in the trained teacher machine learning model generates a corresponding intermediate layer output feature in a generation process of the output result of the trained teacher machine learning model; and
and adjusting parameters of a student machine learning model according to a loss function, so that the difference between the output result of the student machine learning model aiming at the training data and the output result of the trained teacher machine learning model is smaller than a preset threshold value, wherein the loss function comprises a first part and a second part, the first part of the loss function is determined according to the output result of the trained teacher machine learning model, and the second part of the loss function is determined according to the intermediate layer output characteristics generated by the intermediate layer included in the trained teacher machine learning model.
2. The method of claim 1, wherein the at least one teacher machine learning model uses a bit width for weights and activation values that is greater than a bit width for weights and activation values of the student machine learning models.
3. The method of claim 1, wherein the first portion of the loss function comprises a Kullback-leibler (kl) divergence determined based on the output of the trained teacher machine learning model and the output of the student machine learning model for the training data.
4. The method of claim 3, wherein the number of middle layers included in the student machine learning model is the same as the number of middle layers included in the at least one teacher machine learning model, the middle layers included in the student machine learning model generate corresponding middle layer output features during generation of the output results of the student machine learning model, and the second part of the loss function includes a distance loss determined based on the middle layer output features generated by the middle layers included in the trained teacher machine learning model and the middle layer output features generated by the middle layers included in the student machine learning model after the point convolution layer conversion.
5. The method of claim 4, wherein the loss function further comprises a third portion comprising cross-entropy losses determined for output results of the training data and true labels of the training data based on the student machine learning model.
6. The method of claim 1, wherein the at least one teacher machine learning model comprises a plurality of teacher machine learning models, wherein the number of intermediate layers included in each of the plurality of teacher machine learning models and the number of intermediate layers included in the student machine learning models are the same, wherein training the at least one teacher machine learning model to obtain a trained teacher machine learning model based on the training data comprises:
and respectively training the plurality of teacher machine learning models according to the training data so as to obtain a plurality of corresponding trained teacher machine learning models.
7. The method of claim 6, wherein the bit widths used for the weights and activation values of the respective teacher machine learning models are greater than the bit widths used for the weights and activation values of the student machine learning models.
8. The method of claim 7, wherein inputting the training data into the trained teacher machine learning model to obtain output results of the trained teacher machine learning model comprises:
inputting the training data into the plurality of trained teacher machine learning models respectively to obtain output results of the plurality of trained teacher machine learning models respectively corresponding to each,
wherein, for each of the plurality of trained teacher machine learning models:
the plurality of intermediate layers included in the particular trained teacher machine learning model each generate a corresponding intermediate layer output feature in a process in which the particular trained teacher machine learning model generates an output result of the particular trained teacher machine learning model.
9. The method of claim 8, wherein the plurality of trained teacher machine learning models are in one-to-one correspondence with a plurality of first weight coefficients in a first set of weight coefficients, wherein the output results of the plurality of trained teacher machine learning models are weighted and summed according to the respective corresponding first weight coefficients to obtain a classification layer output result of a multi-bit teacher machine learning model, and wherein the first portion of the loss function comprises a Kullback-leibler (kl) divergence determined based on the classification layer output result of the multi-bit teacher machine learning model and the output result of the student machine learning model for the training data.
10. The method of claim 9, wherein the plurality of intermediate layers included in each of the plurality of trained teacher machine learning models has a one-to-one correspondence with a plurality of second weight coefficients in a second set of weight coefficients,
wherein for one or more of a plurality of intermediate layers comprised by the student machine learning model:
determining intermediate layer output characteristics of a multi-bit teacher machine learning model corresponding to a specific intermediate layer of the student machine learning model, wherein the intermediate layer output characteristics of the multi-bit teacher machine learning model are obtained by weighting and summing intermediate layer output characteristics generated by the intermediate layers of the plurality of trained teacher machine learning models which are positioned at the same level as the specific intermediate layer according to respective corresponding second weight coefficients;
the second part of the loss function includes distance losses determined based on the mid-level output features of the multi-bit teacher machine learning model and mid-level output features generated by a particular mid-level of the student machine learning model after the point convolution layer conversion.
11. The method of claim 10, wherein the second portion of the loss function further comprises summing distance losses corresponding to each of a plurality of intermediate layers included in the student machine learning model after smoothing.
12. The method of claim 11, wherein the loss function further comprises a third portion comprising cross-entropy losses determined for output results of the training data and true labels of the training data based on the student machine learning model.
13. The method of claim 12, further comprising:
deriving the first weight coefficient by the loss function to obtain a first gradient, wherein the first gradient is a gradient of the loss function with respect to a classification layer output result of the multi-bit teacher machine learning model, the first gradient being determined according to the KL divergence included in a first portion of the loss function;
deriving the second weight coefficient by the loss function to obtain a second gradient, wherein the second gradient is a gradient of the loss function with respect to an intermediate layer output feature of the multi-bit teacher machine learning model, the second gradient being determined from the distance loss included in a second portion of the loss function;
and performing back propagation through the first gradient and the second gradient respectively, thereby dynamically adjusting the first weight coefficient and the second weight coefficient respectively.
14. The method according to any one of claims 1-13, wherein the weight and activation value of the student machine learning model are both binarized.
15. A knowledge distillation system, characterized in that the knowledge distillation system comprises:
a plurality of teacher machine learning models, an
A student machine learning model;
the number of the middle layers included in the teacher machine learning models is the same as that of the middle layers included in the student machine learning models;
bit widths adopted by the weights and the activation values of the teacher machine learning models are all larger than bit widths adopted by the weights and the activation values of the student machine learning models;
wherein parameters of the student machine learning model are adjusted according to a loss function, so that the difference value between the output result of the student machine learning model for the training data and the output result of the plurality of teacher machine learning models for the training data is smaller than a preset threshold value;
the first part of the loss function comprises Kullback-Leibler (KL) divergence determined based on the output results of a classification layer of a multi-bit teacher machine learning model and the output results of the student machine learning model aiming at the training data, and the output results of the classification layer of the multi-bit teacher machine learning model are obtained by weighting and summing the output results of the plurality of teacher machine learning models aiming at the training data according to respective corresponding first weight coefficients;
the second part of the loss function comprises distance loss determined based on intermediate layer output characteristics of a multi-bit teacher machine learning model and intermediate layer output characteristics generated by a specific intermediate layer of the student machine learning model after point convolution layer conversion, and the intermediate layer output characteristics of the multi-bit teacher machine learning model are obtained by weighting and summing the intermediate layer output characteristics generated by the intermediate layers of the plurality of teacher machine learning models which are positioned at the same level as the specific intermediate layer according to respective corresponding second weight coefficients;
wherein a third portion of the loss function includes cross-entropy losses determined for output results of the training data and true labels of the training data based on the student machine learning model.
16. The system of claim 15,
a first gradient obtained by deriving the first weight coefficient by the loss function, wherein the first gradient is a gradient of the loss function with respect to a classification layer output result of the multi-bit teacher machine learning model, the first gradient being determined according to the KL divergence included in a first part of the loss function,
a second gradient obtained by deriving the second weight coefficient by the loss function, wherein the second gradient is a gradient of the loss function with respect to an intermediate layer output feature of the multi-bit teacher machine learning model, the second gradient being determined from the distance loss included in a second part of the loss function,
the first weight coefficient and the second weight coefficient are dynamically adjusted by counter-propagating through the first gradient and the second gradient, respectively.
17. A computer-readable storage medium holding computer instructions that, when executed by a processor, cause the processor to:
training at least one teacher machine learning model according to the training data to obtain a trained teacher machine learning model;
inputting the training data into the trained teacher machine learning model so as to obtain an output result of the trained teacher machine learning model, wherein an intermediate layer included in the trained teacher machine learning model generates a corresponding intermediate layer output feature in a generation process of the output result of the trained teacher machine learning model; and
and adjusting parameters of a student machine learning model according to a loss function, so that the difference between the output result of the student machine learning model aiming at the training data and the output result of the trained teacher machine learning model is smaller than a preset threshold value, wherein the loss function comprises a first part and a second part, the first part of the loss function is determined according to the output result of the trained teacher machine learning model, and the second part of the loss function is determined according to the intermediate layer output characteristics generated by the intermediate layer included in the trained teacher machine learning model.
18. The computer-readable storage medium of claim 17, wherein the at least one teacher machine learning model uses a bit width for weights and activation values that is greater than a bit width for weights and activation values of the student machine learning models.
19. The computer-readable storage medium of claim 17, wherein the first portion of the loss function comprises a Kullback-leibler (kl) divergence determined based on the output of the trained teacher machine learning model and the output of the student machine learning model for the training data.
20. The computer-readable storage medium of claim 19, wherein the number of intermediate layers included in the student machine learning model is the same as the number of intermediate layers included in the at least one teacher machine learning model, wherein the intermediate layers included in the student machine learning model generate corresponding intermediate layer output features during generation of the output results of the student machine learning model, and wherein the second part of the loss function comprises distance losses determined based on the intermediate layer output features generated by the intermediate layers included in the trained teacher machine learning model and the intermediate layer output features generated by the intermediate layers included in the student machine learning model after the point convolution layer conversion.
21. The computer-readable storage medium of claim 20, wherein the loss function further comprises a third portion, the third portion of the loss function comprising cross-entropy losses determined for output results of the training data and true labels of the training data based on the student machine learning model.
22. The computer-readable storage medium of claim 17, wherein the at least one teacher machine learning model comprises a plurality of teacher machine learning models, wherein a number of intermediate layers included in each of the plurality of teacher machine learning models and a number of intermediate layers included in the student machine learning models are the same, wherein training at least one teacher machine learning model to obtain a trained teacher machine learning model based on the training data comprises:
and respectively training the plurality of teacher machine learning models according to the training data so as to obtain a plurality of corresponding trained teacher machine learning models.
23. The computer-readable storage medium of claim 22, wherein the respective weights and activation values of the plurality of teacher machine learning models each employ a bit width that is greater than a bit width employed by the weights and activation values of the student machine learning models.
24. The computer-readable storage medium of claim 23, wherein inputting the training data into the trained teacher machine learning model to obtain output results of the trained teacher machine learning model comprises:
inputting the training data into the plurality of trained teacher machine learning models respectively to obtain output results of the plurality of trained teacher machine learning models respectively corresponding to each,
wherein, for each of the plurality of trained teacher machine learning models:
the plurality of intermediate layers included in the particular trained teacher machine learning model each generate a corresponding intermediate layer output feature in a process in which the particular trained teacher machine learning model generates an output result of the particular trained teacher machine learning model.
25. The computer-readable storage medium of claim 24, wherein the plurality of trained teacher machine learning models are in one-to-one correspondence with a plurality of first weight coefficients in a first set of weight coefficients, wherein the output results of the plurality of trained teacher machine learning models are weighted and summed according to the respective corresponding first weight coefficients to obtain a classification layer output result of a multi-bit teacher machine learning model, and wherein the first portion of the loss function comprises a Kullback-leibler (kl) divergence determined based on the classification layer output result of the multi-bit teacher machine learning model and the output result of the student machine learning model for the training data.
26. The computer-readable storage medium of claim 25, wherein the plurality of intermediate layers included in each of the plurality of trained teacher machine learning models are in one-to-one correspondence with a plurality of second weight coefficients in a second set of weight coefficients,
wherein for one or more of a plurality of intermediate layers comprised by the student machine learning model:
determining intermediate layer output characteristics of a multi-bit teacher machine learning model corresponding to a specific intermediate layer of the student machine learning model, wherein the intermediate layer output characteristics of the multi-bit teacher machine learning model are obtained by weighting and summing intermediate layer output characteristics generated by the intermediate layers of the plurality of trained teacher machine learning models which are positioned at the same level as the specific intermediate layer according to respective corresponding second weight coefficients;
the second part of the loss function includes distance losses determined based on the mid-level output features of the multi-bit teacher machine learning model and mid-level output features generated by a particular mid-level of the student machine learning model after the point convolution layer conversion.
27. The computer-readable storage medium of claim 26, wherein the second portion of the loss function further comprises summing distance losses corresponding to each of a plurality of intermediate layers included in the student machine learning model after smoothing.
28. The computer-readable storage medium of claim 27, wherein the loss function further comprises a third portion, the third portion of the loss function comprising cross-entropy losses determined for output results of the training data and true labels of the training data based on the student machine learning model.
29. The computer-readable storage medium of claim 28, wherein the processor further performs the following:
deriving the first weight coefficient by the loss function to obtain a first gradient, wherein the first gradient is a gradient of the loss function with respect to a classification layer output result of the multi-bit teacher machine learning model, the first gradient being determined according to the KL divergence included in a first portion of the loss function;
deriving the second weight coefficient by the loss function to obtain a second gradient, wherein the second gradient is a gradient of the loss function with respect to an intermediate layer output feature of the multi-bit teacher machine learning model, the second gradient being determined from the distance loss included in a second portion of the loss function;
and performing back propagation through the first gradient and the second gradient respectively, thereby dynamically adjusting the first weight coefficient and the second weight coefficient respectively.
CN202011273058.5A 2020-11-13 2020-11-13 Knowledge distillation method and system Pending CN112508169A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011273058.5A CN112508169A (en) 2020-11-13 2020-11-13 Knowledge distillation method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011273058.5A CN112508169A (en) 2020-11-13 2020-11-13 Knowledge distillation method and system

Publications (1)

Publication Number Publication Date
CN112508169A true CN112508169A (en) 2021-03-16

Family

ID=74957746

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011273058.5A Pending CN112508169A (en) 2020-11-13 2020-11-13 Knowledge distillation method and system

Country Status (1)

Country Link
CN (1) CN112508169A (en)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113139488A (en) * 2021-04-29 2021-07-20 北京百度网讯科技有限公司 Method and device for training segmented neural network
CN113255763A (en) * 2021-05-21 2021-08-13 平安科技(深圳)有限公司 Knowledge distillation-based model training method and device, terminal and storage medium
CN113281048A (en) * 2021-06-25 2021-08-20 华中科技大学 Rolling bearing fault diagnosis method and system based on relational knowledge distillation
CN113326940A (en) * 2021-06-25 2021-08-31 江苏大学 Knowledge distillation method, device, equipment and medium based on multiple knowledge migration
CN113360616A (en) * 2021-06-04 2021-09-07 科大讯飞股份有限公司 Automatic question-answering processing method, device, equipment and storage medium
CN113505614A (en) * 2021-07-29 2021-10-15 沈阳雅译网络技术有限公司 Small model training method for small CPU equipment
CN113610232A (en) * 2021-09-28 2021-11-05 苏州浪潮智能科技有限公司 Network model quantization method and device, computer equipment and storage medium
CN113610069A (en) * 2021-10-11 2021-11-05 北京文安智能技术股份有限公司 Knowledge distillation-based target detection model training method
CN113920540A (en) * 2021-11-04 2022-01-11 厦门市美亚柏科信息股份有限公司 Knowledge distillation-based pedestrian re-identification method, device, equipment and storage medium
CN113947196A (en) * 2021-10-25 2022-01-18 中兴通讯股份有限公司 Network model training method and device and computer readable storage medium
CN114358206A (en) * 2022-01-12 2022-04-15 合肥工业大学 Binary neural network model training method and system, and image processing method and system
CN114610500A (en) * 2022-03-22 2022-06-10 重庆邮电大学 Edge caching method based on model distillation
WO2022217853A1 (en) * 2021-04-16 2022-10-20 Huawei Technologies Co., Ltd. Methods, devices and media for improving knowledge distillation using intermediate representations
WO2022257614A1 (en) * 2021-06-10 2022-12-15 北京百度网讯科技有限公司 Training method and apparatus for object detection model, and image detection method and apparatus
CN115601536A (en) * 2022-12-02 2023-01-13 荣耀终端有限公司(Cn) Image processing method and electronic equipment
CN116091895A (en) * 2023-04-04 2023-05-09 之江实验室 Model training method and device oriented to multitask knowledge fusion
WO2023155183A1 (en) * 2022-02-21 2023-08-24 Intel Corporation Systems, apparatus, articles of manufacture, and methods for teacher-free self-feature distillation training of machine learning models
WO2024012255A1 (en) * 2022-07-11 2024-01-18 北京字跳网络技术有限公司 Semantic segmentation model training method and apparatus, electronic device, and storage medium

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107247989A (en) * 2017-06-15 2017-10-13 北京图森未来科技有限公司 A kind of neural network training method and device
CN108764462A (en) * 2018-05-29 2018-11-06 成都视观天下科技有限公司 A kind of convolutional neural networks optimization method of knowledge based distillation
CN111062951A (en) * 2019-12-11 2020-04-24 华中科技大学 Knowledge distillation method based on semantic segmentation intra-class feature difference
CN111091109A (en) * 2019-12-24 2020-05-01 厦门瑞为信息技术有限公司 Method, system and equipment for predicting age and gender based on face image
CN111105008A (en) * 2018-10-29 2020-05-05 富士通株式会社 Model training method, data recognition method and data recognition device
CN111160409A (en) * 2019-12-11 2020-05-15 浙江大学 Heterogeneous neural network knowledge reorganization method based on common feature learning
CN111199242A (en) * 2019-12-18 2020-05-26 浙江工业大学 Image increment learning method based on dynamic correction vector
CN111461212A (en) * 2020-03-31 2020-07-28 中国科学院计算技术研究所 Compression method for point cloud target detection model
CN111598213A (en) * 2020-04-01 2020-08-28 北京迈格威科技有限公司 Network training method, data identification method, device, equipment and medium
CN111709476A (en) * 2020-06-17 2020-09-25 浪潮集团有限公司 Knowledge distillation-based small classification model training method and device
US20200334538A1 (en) * 2019-04-16 2020-10-22 Microsoft Technology Licensing, Llc Conditional teacher-student learning for model training
CN111882031A (en) * 2020-06-30 2020-11-03 华为技术有限公司 Neural network distillation method and device

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107247989A (en) * 2017-06-15 2017-10-13 北京图森未来科技有限公司 A kind of neural network training method and device
CN108764462A (en) * 2018-05-29 2018-11-06 成都视观天下科技有限公司 A kind of convolutional neural networks optimization method of knowledge based distillation
CN111105008A (en) * 2018-10-29 2020-05-05 富士通株式会社 Model training method, data recognition method and data recognition device
US20200334538A1 (en) * 2019-04-16 2020-10-22 Microsoft Technology Licensing, Llc Conditional teacher-student learning for model training
CN111062951A (en) * 2019-12-11 2020-04-24 华中科技大学 Knowledge distillation method based on semantic segmentation intra-class feature difference
CN111160409A (en) * 2019-12-11 2020-05-15 浙江大学 Heterogeneous neural network knowledge reorganization method based on common feature learning
CN111199242A (en) * 2019-12-18 2020-05-26 浙江工业大学 Image increment learning method based on dynamic correction vector
CN111091109A (en) * 2019-12-24 2020-05-01 厦门瑞为信息技术有限公司 Method, system and equipment for predicting age and gender based on face image
CN111461212A (en) * 2020-03-31 2020-07-28 中国科学院计算技术研究所 Compression method for point cloud target detection model
CN111598213A (en) * 2020-04-01 2020-08-28 北京迈格威科技有限公司 Network training method, data identification method, device, equipment and medium
CN111709476A (en) * 2020-06-17 2020-09-25 浪潮集团有限公司 Knowledge distillation-based small classification model training method and device
CN111882031A (en) * 2020-06-30 2020-11-03 华为技术有限公司 Neural network distillation method and device

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022217853A1 (en) * 2021-04-16 2022-10-20 Huawei Technologies Co., Ltd. Methods, devices and media for improving knowledge distillation using intermediate representations
CN113139488A (en) * 2021-04-29 2021-07-20 北京百度网讯科技有限公司 Method and device for training segmented neural network
CN113139488B (en) * 2021-04-29 2024-01-12 北京百度网讯科技有限公司 Method and device for training segmented neural network
CN113255763A (en) * 2021-05-21 2021-08-13 平安科技(深圳)有限公司 Knowledge distillation-based model training method and device, terminal and storage medium
CN113255763B (en) * 2021-05-21 2023-06-09 平安科技(深圳)有限公司 Model training method, device, terminal and storage medium based on knowledge distillation
CN113360616A (en) * 2021-06-04 2021-09-07 科大讯飞股份有限公司 Automatic question-answering processing method, device, equipment and storage medium
WO2022257614A1 (en) * 2021-06-10 2022-12-15 北京百度网讯科技有限公司 Training method and apparatus for object detection model, and image detection method and apparatus
CN113326940A (en) * 2021-06-25 2021-08-31 江苏大学 Knowledge distillation method, device, equipment and medium based on multiple knowledge migration
CN113281048A (en) * 2021-06-25 2021-08-20 华中科技大学 Rolling bearing fault diagnosis method and system based on relational knowledge distillation
CN113505614A (en) * 2021-07-29 2021-10-15 沈阳雅译网络技术有限公司 Small model training method for small CPU equipment
CN113610232B (en) * 2021-09-28 2022-02-22 苏州浪潮智能科技有限公司 Network model quantization method and device, computer equipment and storage medium
CN113610232A (en) * 2021-09-28 2021-11-05 苏州浪潮智能科技有限公司 Network model quantization method and device, computer equipment and storage medium
WO2023050707A1 (en) * 2021-09-28 2023-04-06 苏州浪潮智能科技有限公司 Network model quantization method and apparatus, and computer device and storage medium
CN113610069A (en) * 2021-10-11 2021-11-05 北京文安智能技术股份有限公司 Knowledge distillation-based target detection model training method
CN113947196A (en) * 2021-10-25 2022-01-18 中兴通讯股份有限公司 Network model training method and device and computer readable storage medium
WO2023071743A1 (en) * 2021-10-25 2023-05-04 中兴通讯股份有限公司 Network model training method and apparatus, and computer-readable storage medium
CN113920540A (en) * 2021-11-04 2022-01-11 厦门市美亚柏科信息股份有限公司 Knowledge distillation-based pedestrian re-identification method, device, equipment and storage medium
CN114358206A (en) * 2022-01-12 2022-04-15 合肥工业大学 Binary neural network model training method and system, and image processing method and system
WO2023155183A1 (en) * 2022-02-21 2023-08-24 Intel Corporation Systems, apparatus, articles of manufacture, and methods for teacher-free self-feature distillation training of machine learning models
CN114610500A (en) * 2022-03-22 2022-06-10 重庆邮电大学 Edge caching method based on model distillation
CN114610500B (en) * 2022-03-22 2024-04-30 重庆邮电大学 Edge caching method based on model distillation
WO2024012255A1 (en) * 2022-07-11 2024-01-18 北京字跳网络技术有限公司 Semantic segmentation model training method and apparatus, electronic device, and storage medium
CN115601536A (en) * 2022-12-02 2023-01-13 荣耀终端有限公司(Cn) Image processing method and electronic equipment
CN116091895A (en) * 2023-04-04 2023-05-09 之江实验室 Model training method and device oriented to multitask knowledge fusion

Similar Documents

Publication Publication Date Title
CN112508169A (en) Knowledge distillation method and system
Fang et al. Post-training piecewise linear quantization for deep neural networks
US11657274B2 (en) Weakly-supervised semantic segmentation with self-guidance
US20200097818A1 (en) Method and system for training binary quantized weight and activation function for deep neural networks
CN110782008B (en) Training method, prediction method and device of deep learning model
Wang et al. Network pruning using sparse learning and genetic algorithm
EP3540654A1 (en) Learning classification device and learning classification method
KR102410820B1 (en) Method and apparatus for recognizing based on neural network and for training the neural network
EP3295381B1 (en) Augmenting neural networks with sparsely-accessed external memory
CN111723220A (en) Image retrieval method and device based on attention mechanism and Hash and storage medium
CN111400601B (en) Video recommendation method and related equipment
WO2016182671A1 (en) Fixed point neural network based on floating point neural network quantization
KR20200128938A (en) Model training method and apparatus, and data recognizing method
WO2021042857A1 (en) Processing method and processing apparatus for image segmentation model
US20230153631A1 (en) Method and apparatus for transfer learning using sample-based regularization
KR20220045424A (en) Method and apparatus of compressing artificial neural network
CN115511069A (en) Neural network training method, data processing method, device and storage medium
Gopalakrishnan et al. Sentiment analysis using simplified long short-term memory recurrent neural networks
CN113761868A (en) Text processing method and device, electronic equipment and readable storage medium
KR20190134965A (en) A method and system for training of neural networks
Pietron et al. Retrain or not retrain?-efficient pruning methods of deep cnn networks
CN112150497A (en) Local activation method and system based on binary neural network
He et al. Learned transferable architectures can surpass hand-designed architectures for large scale speech recognition
Peter et al. Resource-efficient dnns for keyword spotting using neural architecture search and quantization
CN110084356B (en) Deep neural network data processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination