CN115618921B

CN115618921B - Knowledge distillation method, apparatus, electronic device, and storage medium

Info

Publication number: CN115618921B
Application number: CN202211105070.4A
Authority: CN
Inventors: 祝毅晨; 杜杰; 区志财; 唐剑
Original assignee: Midea Group Co Ltd; Midea Group Shanghai Co Ltd
Current assignee: Midea Group Co Ltd; Midea Group Shanghai Co Ltd
Priority date: 2022-09-09
Filing date: 2022-09-09
Publication date: 2024-02-06
Anticipated expiration: 2042-09-09
Also published as: CN115618921A

Abstract

The application relates to the technical field of computers, and provides a knowledge distillation method, a device, electronic equipment and a storage medium, wherein the method comprises the following steps: determining a sample difficulty index of each image sample in an initial image sample set based on the initial image sample set and a teacher model; based on sample difficulty indexes of the image samples, performing difficult sample rejection on the initial image sample set to obtain a target image sample set; and carrying out knowledge distillation on the student model based on the target image sample set. According to the method and the device, the fact that the student model receives difficult sample label supervision which is difficult to distinguish by a teacher model in the knowledge distillation process can be avoided, the model training effect is improved, and the knowledge distillation efficiency of the neural network is improved.

Description

Knowledge distillation method, apparatus, electronic device, and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a knowledge distillation method, a knowledge distillation apparatus, an electronic device, and a storage medium.

Background

Knowledge distillation is a class of neural network model compression tasks that supervises and trains an untrained small model (student model) mainly by training a pre-trained large model (teacher model).

However, in the knowledge distillation process of the current student model, if the difficult sample label supervision which is difficult to distinguish by the teacher model is received, the model training effect is poor, and the current knowledge distillation efficiency of the neural network is low.

Disclosure of Invention

The present application aims to solve at least one of the technical problems existing in the prior art. Therefore, the knowledge distillation method can avoid the situation that the student model receives difficult sample label supervision which is difficult to distinguish by the teacher model in the knowledge distillation process, improves the model training effect, and further improves the knowledge distillation efficiency of the neural network.

The application also provides a knowledge distillation apparatus, an electronic device, a storage medium and a computer program product.

A knowledge distillation method according to an embodiment of the first aspect of the present application, comprising:

determining a sample difficulty index of each image sample in an initial image sample set based on the initial image sample set and a teacher model;

based on sample difficulty indexes of the image samples, performing difficult sample rejection on the initial image sample set to obtain a target image sample set;

and carrying out knowledge distillation on the student model based on the target image sample set.

According to the knowledge distillation method, the sample difficulty index of each image sample in the image sample set is determined through the teacher model, and difficult sample rejection is carried out on the image sample set according to the sample difficulty index, so that the student model is prevented from receiving difficult sample label supervision which is difficult to distinguish by the teacher model in the knowledge distillation process, the model training effect is improved, and the knowledge distillation efficiency of the neural network is further improved.

According to one embodiment of the present application, the sample difficulty index includes any one of a loss value corresponding to an image sample, a gradient value based on the loss value, and a variance value of the gradient value;

the loss value corresponding to the image sample is the loss value of the teacher model after the image sample is input to the teacher model.

According to one embodiment of the present application, the determining, based on an initial image sample set and a teacher model, a sample difficulty index of each image sample in the initial image sample set includes:

training a teacher model based on an initial image sample set, and determining sample difficulty indexes of all image samples in the initial image sample set after the teacher model is trained, wherein the sample difficulty indexes are used for eliminating difficult samples, and a target image sample set after the difficult samples are eliminated is used for retraining the teacher model.

in each round of parameter iterative training on the teacher model, executing the steps of determining sample difficulty indexes of all image samples in an initial image sample set based on the initial image sample set and the teacher model, and removing difficult samples from the initial image sample set based on the sample difficulty indexes of all the image samples to obtain a target image sample set;

and in each round of parameter iteration training on the teacher model, taking a target image sample set obtained by the previous round of parameter iteration training as an initial image sample set of the next round of parameter iteration.

According to an embodiment of the present application, the sample difficulty index further includes a variance value of a predicted value corresponding to the image sample, and the variance value of the predicted value corresponding to the image sample is determined based on the following steps:

inputting each image sample in an initial image sample set into a teacher model for multiple times respectively to obtain multiple predicted values corresponding to each image sample output by the teacher model;

And determining a sample difficulty index of each image sample based on a plurality of predicted values corresponding to each image sample.

According to one embodiment of the present application, the determining the sample difficulty index of each of the image samples based on the plurality of predicted values corresponding to each of the image samples includes:

determining the sample difficulty index as a variance value of a predicted value of an image sample, and respectively determining variance values of a plurality of predicted values corresponding to the image samples;

and respectively determining the variance values of the plurality of predicted values corresponding to the image samples as sample difficulty indexes of the image samples.

According to an embodiment of the present application, the performing difficult sample rejection on the initial image sample set based on the sample difficulty index of each image sample to obtain a target image sample set further includes:

determining the sample difficulty index as a variance value of predicted values of image samples, and determining a total variance value of all the predicted values of all the image samples;

and eliminating the image samples with variance values larger than the total variance values in the initial image sample set, and determining the rest image samples as a target image sample set.

A knowledge distillation apparatus according to an embodiment of the second aspect of the present application, comprising:

The determining module is used for determining sample difficulty indexes of all image samples in the initial image sample set based on the initial image sample set and the teacher model;

the rejecting module is used for rejecting the difficult sample of the initial image sample set based on the sample difficulty index of each image sample to obtain a target image sample set;

and the distillation module is used for carrying out knowledge distillation on the student model based on the target image sample set.

An electronic device according to an embodiment of the third aspect of the present application comprises a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing a knowledge distillation method as described in any of the above when executing the program.

A non-transitory computer readable storage medium according to an embodiment of the fourth aspect of the present application, having stored thereon a computer program which, when executed by a processor, implements a knowledge distillation method as described in any of the above.

A computer program product according to an embodiment of the fifth aspect of the present application comprises a computer program which, when executed by a processor, implements a knowledge distillation method as described above.

The above technical solutions in the embodiments of the present application have at least one of the following technical effects:

The method has the advantages that the fact that the student model receives difficult sample label supervision which is difficult to distinguish by a teacher model in the knowledge distillation process is avoided, the model training effect is improved, and the knowledge distillation efficiency of the neural network is improved.

Additional aspects and advantages of the application will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is one of the flow diagrams of the knowledge distillation method provided in the embodiments of the present application;

FIG. 2 is a second schematic flow chart of the knowledge distillation method according to the embodiment of the present application;

FIG. 3 is a third schematic flow chart of the knowledge distillation method according to the embodiment of the present application;

FIG. 4 is a flow chart of a knowledge distillation method according to an embodiment of the present application;

FIG. 5 is a fifth flow chart of a knowledge distillation method according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a knowledge distillation apparatus according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described in further detail below with reference to the accompanying drawings and examples. The following examples are illustrative of the present application but are not intended to limit the scope of the present application.

In the description of the embodiments of the present application, it should be noted that the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In the examples herein, a first feature "on" or "under" a second feature may be either the first and second features in direct contact, or the first and second features in indirect contact via an intermediary, unless expressly stated and defined otherwise. Moreover, a first feature being "above," "over" and "on" a second feature may be a first feature being directly above or obliquely above the second feature, or simply indicating that the first feature is level higher than the second feature. The first feature being "under", "below" and "beneath" the second feature may be the first feature being directly under or obliquely below the second feature, or simply indicating that the first feature is less level than the second feature.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the embodiments of the present application. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

Fig. 1 is one of flow charts of a knowledge distillation method provided in an embodiment of the present application, as shown in fig. 1, the knowledge distillation method includes:

step 110, determining a sample difficulty index of each image sample in the initial image sample set based on the initial image sample set and the teacher model.

It should be noted that, the implementation main body of the knowledge distillation method provided in the embodiment of the present application may be a server, a computer device, such as a mobile phone, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted electronic device, a wearable device, an ultra-mobile personal computer (ultra-mobile personal computer, UMPC), a netbook, a personal digital assistant (personal digital assistant, PDA), or the like.

In step S110, a plurality of image samples may be acquired to form an initial image sample set, and a teacher model is acquired.

The image sample can be used for tasks such as image classification, object recognition, fine-granularity image classification, pedestrian re-recognition and the like.

The teacher model is generally a large model formed by a single complex network or a collection of networks, which possesses good performance and generalization capability, so that the large model can be used as a teacher model for supervising the training of the student model. The student model is a model with a small network scale and limited expressive power.

It should be noted that the teacher model may be an untrained model.

After the initial image sample set and the teacher model are obtained, model training can be carried out on the teacher model through each image sample in the initial image sample set, and loss values corresponding to each image sample in the model training process or predicted values of each sample image output by the teacher model are obtained. Further, the sample difficulty index of each image sample can be determined according to the loss value or the predicted value corresponding to each image sample.

If the image sample is used for image classification, the predicted value may be a predicted value of the image class; if the image sample is used for object recognition, the predicted value may be a predicted value of an object class, and so on.

The sample difficulty index may include any one of a loss value corresponding to the image sample, a gradient value based on the loss value, a variance value of the gradient value, and a variance value of the predicted value. The loss value corresponding to the image sample is the loss value of the loss function in the teacher model after the image sample is input into the teacher model.

For example, the sample difficulty index may be a loss value corresponding to the image sample, a gradient value based on the loss value corresponding to the image sample, a variance value based on the gradient value corresponding to the image sample, a variance value of a predicted value corresponding to the image sample, or the like, and the sample difficulty index is not specifically limited herein.

The gradient is obtained by deriving a loss function of the teacher model, and the loss value is input into the gradient for calculation, so that the gradient value can be obtained.

It should be noted that, when the teacher model is a neural network with a dropout layer, the type of the sample difficulty index may be the variance value of the predicted value.

Step 120, performing difficult sample rejection on the initial image sample set based on the sample difficulty index of each image sample, so as to obtain a target image sample set.

After obtaining the sample difficulty index of each image sample in the initial image sample set, the sample difficulty index can be used as an evaluation index of the difficulty of each image sample in the initial image sample set, part of image samples with higher difficulty in each image sample in the initial image sample set are removed, and the image sample set formed by the rest image samples is used as a target image sample set.

The number of the removed image samples can be set according to actual requirements.

For example: the partial image samples with higher loss values corresponding to the image samples in the initial image sample set can be removed, and the residual partial image samples form a target image sample set.

The initial image sample set may be stripped out based on partial image samples with higher gradient values of the loss values corresponding to the image samples, and a target image sample set may be formed from the remaining partial image samples.

The initial image sample set may be stripped out based on the partial image samples with larger variance values of the gradient values corresponding to the image samples, and the target image sample set may be formed from the remaining partial image samples.

The partial image samples with larger variance values of the predicted values corresponding to the image samples in the initial image sample set can be removed, and the target image sample set is formed by the rest partial image samples.

And 130, performing knowledge distillation on the student model based on the target image sample set.

After the target image sample set is obtained, model training can be carried out on the teacher model again through each image sample in the target image sample set, so that overall difficult sample rejection or local dynamic difficult sample rejection is completed, the trained teacher model is obtained, more optimization of simple samples on update gradients is guaranteed, and in subsequent knowledge distillation, the student model can be effectively supervised through knowledge of the teacher model, and knowledge distillation of the student model is achieved. The knowledge distillation requires a preset distillation method and output of a trained teacher model, and the distillation method is not particularly limited here.

According to the knowledge distillation method provided by the embodiment of the application, based on the initial image sample set and the teacher model, sample difficulty indexes of all image samples in the initial image sample set are determined; based on sample difficulty indexes of all image samples, difficult sample rejection is carried out on an initial image sample set to obtain a target image sample set, and knowledge distillation is carried out on a student model based on the target image sample set, so that the student model can be prevented from being supervised by difficult sample labels which are difficult to distinguish by a teacher model in the knowledge distillation process, the model training effect is improved, and the knowledge distillation efficiency of a neural network is further improved.

Based on the above embodiment, fig. 2 is a second flowchart of the knowledge distillation method according to the embodiment of the present application, as shown in fig. 2, determining, based on an initial image sample set and a teacher model, a sample difficulty index of each image sample in the initial image sample set, including:

and step A, training a teacher model based on the initial image sample set, determining sample difficulty indexes of all image samples in the initial image sample set after the training of the teacher model is completed, wherein the sample difficulty indexes are used for removing difficult samples, and the target image sample set after the difficult samples are removed is used for retraining the teacher model.

After the initial image sample set and the teacher model are obtained, each image sample in the initial image sample set may be divided into a plurality of image samples, and each image sample is used for performing a round of training on the teacher model, so as to optimize parameters of the teacher model.

After the multi-round training of the teacher model is completed, sample difficulty indexes of all image samples in each round of training process can be obtained, and sample difficulty indexes of the image samples adopted in the last round of training can also be obtained.

After obtaining the sample difficulty index of each image sample in the initial image sample set, the difficult sample rejection can be performed on each image sample through the sample difficulty index, and the rest image samples are used as target image sample sets, so that the teacher model is retrained through the target image sample sets, and the overall difficult sample rejection is realized.

According to the embodiment, the image sample set can be subjected to overall difficult sample elimination, so that samples with higher difficulty in the image sample set can be filtered, the situation that a student model receives difficult sample label supervision which is difficult to distinguish by a teacher model in the knowledge distillation process can be avoided, the model training effect is improved, and the knowledge distillation efficiency of a neural network is improved.

Based on the above embodiment, fig. 3 is a third flowchart of the knowledge distillation method according to the embodiment of the present application, and as shown in fig. 3, determining, based on the initial image sample set and the teacher model, a sample difficulty index of each image sample in the initial image sample set includes:

step B1, in each round of parameter iterative training on a teacher model, executing the steps of determining sample difficulty indexes of all image samples in an initial image sample set based on the initial image sample set and the teacher model, and removing difficult samples from the initial image sample set based on the sample difficulty indexes of all the image samples to obtain a target image sample set;

and B2, in the parameter iteration training of each round of the teacher model, taking a target image sample set obtained by the previous round of parameter iteration training as an initial image sample set of the next round of parameter iteration.

When training the teacher model, each image sample in the initial image sample set can be used as one data set, and the teacher model can be trained for multiple rounds based on the data set, so that the teacher model can be trained for multiple rounds of iterative parameter.

It should be noted that, in each round of parameter iterative training on the teacher model, the steps of determining the sample difficulty index of each image sample in the initial image sample set based on the initial image sample set and the teacher model, and performing difficult sample rejection on the initial image sample set based on the sample difficulty index of each image sample to obtain the target image sample set are executed respectively, and detailed descriptions of the sample difficulty index and the difficult sample rejection process of the image sample described above and/or below are omitted.

In the parameter iteration training of each round, each round of model training is carried out and parameter iteration of the round is completed, a sample difficulty index of an image sample used in the round of training is obtained, and difficult sample rejection is carried out on the image sample used in the round according to the sample difficulty index, so that a target image sample set of the round is obtained.

In the parameter iteration training of each round of the teacher model, the target image sample set obtained by the previous round of parameter iteration training is used as the initial image sample set of the next round of parameter iteration, namely the initial image sample set in the second round of parameter iteration training is the image sample set obtained by the first round of parameter iteration training and difficult sample rejection, wherein the initial image sample set in the first round of parameter iteration training is the image sample set which is initially obtained. Therefore, local dynamic difficult sample rejection can be realized.

According to the embodiment, the local dynamic difficult sample rejection can be performed on the image sample set, so that samples with higher difficulty in the image sample set are filtered, the situation that a student model receives difficult sample label supervision which is difficult to distinguish by a teacher model in the knowledge distillation process can be avoided, the model training effect is improved, and the knowledge distillation efficiency of a neural network is improved.

Based on the above embodiment, the sample difficulty index includes any one of the following:

and determining the sample difficulty index as a loss value corresponding to the image sample, and respectively determining the loss value corresponding to each image sample as the sample difficulty index of each image sample.

The type of the sample difficulty index can be determined, if the type of the sample difficulty index is determined to be the loss value corresponding to the image sample, the loss value corresponding to the image sample can be directly used as the evaluation index of the image sample, so that the loss value corresponding to each image sample can be respectively determined to be the sample difficulty label of the corresponding image sample. For example: determining a loss value corresponding to a first image sample in the initial image sample set as a sample difficulty index of the image sample, determining a loss value corresponding to a second image sample in the initial image sample set as a sample difficulty index of the image sample, and the like, which are not described in detail herein.

The sample difficulty index is determined to be a gradient value based on the loss value, the gradient value based on the loss value corresponding to each image sample is determined, and the gradient value corresponding to each image sample is respectively determined to be the sample difficulty index of each image sample.

After determining the type of the sample difficulty index, if the type of the sample difficulty index is determined to be the gradient value of the loss value corresponding to the image sample, the gradient of the loss function of the teacher model can be determined, and the gradient value of the loss value corresponding to each image sample is calculated by combining the gradient of the loss function respectively.

Further, the gradient value corresponding to each image sample (i.e. the gradient value of the loss value corresponding to each image sample) is determined as the sample difficulty label of the corresponding image sample. For example: determining a gradient value corresponding to a first image sample in the initial image sample set as a sample difficulty index of the image sample, determining a gradient value corresponding to a second image sample in the initial image sample set as a sample difficulty index of the image sample, and the like, which are not described in detail herein.

Determining the sample difficulty index as a variance value of the gradient value, determining the gradient value based on the loss value corresponding to each image sample, calculating the variance value of the gradient value corresponding to each image sample, and determining the variance value of the gradient value corresponding to each image sample as the sample difficulty index of each image sample.

After determining the type of the sample difficulty index, if the type of the sample difficulty index is determined to be the gradient value of the loss value corresponding to the image sample, the gradient of the loss function of the teacher model can be determined, and the gradient value of the loss value corresponding to each image sample is calculated by combining the gradient of the loss function with the loss value corresponding to each image sample. Further, the average value of each gradient value is calculated, and the variance value of the gradient value corresponding to each image sample is calculated according to the gradient value corresponding to each image sample and the average value of each gradient value. Further, the variance value of the gradient value corresponding to each image sample is respectively determined as the sample difficulty label of the corresponding image sample.

For example: determining a variance value of a gradient value corresponding to a first image sample in the initial image sample set as a sample difficulty index of the image sample, determining a variance value of a gradient value corresponding to a second image sample in the initial image sample set as a sample difficulty index of the image sample, and so on, which are not described in detail herein.

Based on the above embodiment, fig. 4 is a flow chart of a knowledge distillation method provided in the embodiment of the present application, and as shown in fig. 4, the sample difficulty index further includes a variance value of a predicted value corresponding to an image sample, where the variance value of the predicted value corresponding to the image sample is determined based on the following steps:

And step C1, respectively inputting each image sample in the initial image sample set into the teacher model for multiple times to obtain multiple predicted values corresponding to each image sample output by the teacher model.

After the initial image sample set and the teacher model are obtained, each image sample in the initial image sample set can be respectively input into the teacher model, the teacher model is trained through each image sample, and after the training of the teacher model is completed, the predicted value which is output by the teacher model and obtained by predicting each image sample is obtained, so that the predicted value corresponding to each image sample in the initial image sample set can be obtained. That is, each image sample in the initial image sample set corresponds to a plurality of predicted values, for example, a first image sample in the initial image sample set corresponds to a plurality of predicted values obtained by multiple predictions, a second image sample corresponds to a plurality of predicted values obtained by multiple predictions, and so on, and the image samples are not described in detail herein.

And C2, determining a sample difficulty index of each image sample based on a plurality of predicted values corresponding to each image sample.

After the predicted values corresponding to the image samples in the initial image sample set are obtained, the type of the sample difficulty index can be determined, and the sample difficulty index of each image sample pair is respectively determined according to the type of the sample difficulty index and the predicted values corresponding to the image samples.

The sample difficulty index may include a loss value corresponding to the image sample, a gradient value based on the loss value, a variance value of the gradient value, a variance value of the predicted value, and the like, and one of the types may be determined as the sample difficulty index in this embodiment according to actual requirements.

The step C2 includes:

determining a sample difficulty index as a variance value of a predicted value of each image sample, and respectively determining variance values of a plurality of predicted values corresponding to each image sample; and determining the variance values of the plurality of predicted values corresponding to each image sample as sample difficulty indexes of each image sample.

The type of the sample difficulty index can be determined as to which type, and if the teacher model is determined to be the neural network with the dropout layer and the type of the sample difficulty index is the predicted value corresponding to the image sample, the variance values of the predicted values corresponding to the image sample are respectively determined for each image sample. The method can be obtained by calculating the average value of the plurality of predicted values and calculating the average value and the plurality of predicted values. I.e. each image sample corresponds to a variance value based on the predicted value.

After the variance values of the plurality of predicted values corresponding to the image samples are obtained, the variance values corresponding to the image samples are respectively determined as sample difficulty indexes of the corresponding image samples. For example: determining a variance value of a plurality of predicted values corresponding to a first image sample in an initial image sample set as a sample difficulty index of the image sample; the variance values of the plurality of predicted values corresponding to the second image sample in the initial image sample set are determined as sample difficulty indexes of the image sample, and the like, and each image sample is not described in detail herein.

According to the knowledge distillation method provided by the embodiment of the application, the loss value corresponding to the image sample, the gradient value based on the loss value, the variance value of the gradient value or the variance value of the predicted value are determined through the teacher model and used as the sample difficulty index of each image sample in the initial image sample set, so that the subsequent difficult sample rejection of the initial image sample set according to the sample difficulty index is facilitated, the difficult sample label supervision of the student model which is difficult to distinguish when the teacher model receives in the knowledge distillation process is avoided, the model training effect is improved, and the knowledge distillation efficiency of the neural network is further improved.

Based on any of the above embodiments, fig. 5 is a fifth flow chart of a knowledge distillation method according to the embodiment of the present application, as shown in fig. 5, the step 120 includes:

and determining the sample difficulty index as a variance value of the predicted values of the image samples, and determining the total variance value of all the predicted values of all the image samples.

After determining that the type of the sample difficulty index is the variance value of the predicted value (that is, the variance value of the predicted value corresponding to the image sample), the variance value of all the predicted values corresponding to all the image samples in the initial image sample set can be calculated as the total variance value, the mean value of all the predicted values corresponding to all the image samples can be calculated, and the variance value of all the predicted values corresponding to all the image samples can be obtained by combining the mean value of all the predicted values and each predicted value according to a variance calculation formula.

After the variance value and the total variance value of the predicted values corresponding to the image samples are obtained, the variance value of the predicted values corresponding to the image samples can be compared with the total variance value, the image samples corresponding to the variance value of the predicted values greater than the total variance value are removed, and the image sample set formed by the remaining image samples is determined as a target image sample set.

The step 120 further includes:

and eliminating the image samples of which the sample difficulty indexes are in the index range corresponding to the sample difficulty indexes in the initial image sample set, and determining the rest image samples as a target image sample set.

When the sample difficulty index is a loss value corresponding to an image sample, the index range corresponding to the sample difficulty index is a preset loss value range; when the sample difficulty index is a gradient value based on a loss value, the index range corresponding to the sample difficulty index is a preset gradient value range; when the sample difficulty index is the variance value of the gradient value, the index range corresponding to the sample difficulty index is a preset variance value range.

The step of removing the image samples in the initial image sample set, wherein the sample difficulty index is in the index range corresponding to the sample difficulty index, and determining the remaining image samples as the target image sample set may comprise any one of the following steps:

determining a sample difficulty index as a loss value corresponding to the image sample, eliminating the image samples with the loss values within a preset loss value range in the initial image sample set, and determining the rest image samples as a target image sample set.

After determining that the type of the sample difficulty index is the loss value corresponding to the image sample, the loss values corresponding to the image samples can be sorted, and ascending sorting or descending sorting can be adopted because the loss values corresponding to the image samples are obtained.

After the sorting is completed, if ascending sorting is adopted, a preset loss value range is determined, and the preset loss value range corresponding to the ascending sorting can be a loss value of x% after sorting, wherein x is a value set according to actual requirements, for example, can be 3, 5, 10, 20 and the like, and the method is not particularly limited. Further, image samples corresponding to the loss values of the last x% of the image samples after the ascending order are removed, so that partial image samples with high difficulty are removed, and an image sample set formed by the rest image samples is determined as a target image sample set.

If the descending order is adopted, a preset loss value range is determined, and the preset loss value range corresponding to the descending order may be the loss value of x% arranged at the front, where x is a value set according to the actual requirement, for example, may be 3, 5, 10, 20, etc., and is not limited specifically herein. Further, image samples corresponding to the loss values of the front x% of the image samples which are sorted in a descending order are removed, so that partial image samples with high difficulty are removed, and an image sample set formed by the rest image samples is determined as a target image sample set.

Determining a sample difficulty index as a gradient value based on a loss value, rejecting image samples in the initial image sample set, wherein the gradient value based on the loss value is in a preset gradient value range, and determining the rest image samples as a target image sample set.

After determining that the type of the sample difficulty index is a gradient value based on the loss value corresponding to the image sample (i.e., a gradient value based on the loss value), since the gradient value of the loss value corresponding to each image sample has been obtained, the gradient values of the loss values corresponding to each image sample may be sorted, and ascending sort or descending sort may be adopted.

After the sorting is completed, if ascending sorting is adopted, a preset gradient value range is determined, and the preset gradient value range corresponding to the ascending sorting can be a gradient value of x% after the ascending sorting, wherein x is a value set according to actual requirements, for example, can be 3, 5, 10, 20 and the like, and the method is not particularly limited. Further, image samples corresponding to gradient values of the last x% of the image samples after the ascending order are removed, so that partial image samples with high difficulty are removed, and an image sample set formed by the rest image samples is determined as a target image sample set.

If the descending order is adopted, a preset gradient value range is determined, and the preset gradient value range corresponding to the descending order can be the gradient value of x% arranged at the front, wherein x is a value set according to actual requirements, for example, can be 3, 5, 10, 20 and the like, and the method is not particularly limited. Further, image samples corresponding to gradient values of the first x% of the image samples which are sorted in descending order are removed, so that partial image samples with high difficulty are removed, and an image sample set formed by the rest image samples is determined as a target image sample set.

And determining the sample difficulty index as a variance value of the gradient value, rejecting the image samples in the initial image sample set, wherein the variance value of the gradient value is in a preset variance value range, and determining the rest image samples as a target image sample set.

After determining that the type of the sample difficulty index is the variance value of the gradient value (i.e., the variance value of the gradient value of the loss value corresponding to the image sample), the variance values of the gradient values corresponding to the image samples may be sorted, and may be sorted in ascending order or descending order, because the variance value of the gradient value corresponding to each image sample is already obtained.

After the sorting is completed, if ascending sorting is adopted, a preset variance value range is determined, and the preset variance value range corresponding to the ascending sorting can be a variance value of x% after the sorting, wherein x is a value set according to actual requirements, for example, can be 3, 5, 10, 20, and the like, and the method is not particularly limited herein. Further, image samples corresponding to the variance values of the last x% of the image samples after the ascending order are removed, so that partial image samples with high difficulty are removed, and an image sample set formed by the rest image samples is determined as a target image sample set.

If the descending order is adopted, a preset loss value range is determined, and the preset variance value range corresponding to the descending order may be a variance value of x% arranged at the front, where x is a value set according to actual requirements, for example, may be 3, 5, 10, 20, etc., and is not limited specifically herein. Further, image samples corresponding to the variance values of the first x% of the image samples which are sorted in descending order are removed, so that partial image samples with high difficulty are removed, and an image sample set formed by the rest image samples is determined as a target image sample set.

Based on the above embodiment, based on the sample difficulty index of each image sample, the step of performing difficult sample rejection on the initial image sample set to obtain the target image sample set further includes:

and determining the sample difficulty index as a variance value of the gradient value, determining a box graph range based on the variance value of the gradient value corresponding to each image sample, rejecting the image samples with the variance value of the gradient value being out of the box graph range in the initial image sample set, and determining the rest image samples as a target image sample set.

After obtaining the variance value of the gradient value corresponding to each image sample and determining that the type of the sample difficulty index is the variance value of the gradient value, the Box-shaped graph range can be determined according to the variance value of the gradient value corresponding to each image sample, wherein the Box-shaped graph (Box-plot) is also called a Box whisker graph, a Box graph or a Box line graph, and is a statistical graph used for displaying a group of data dispersion situation data, and the statistical graph is named as a Box due to the shape. Are also frequently used in various fields, commonly in quality management. The method is mainly used for reflecting the characteristics of original data distribution and can also be used for comparing multiple groups of data distribution characteristics. The box diagram drawing method comprises the following steps: firstly, finding out the upper edge, the lower edge, the median and two quartiles of a group of data; then, connecting two quartiles to draw a box body; and then the upper edge and the lower edge are connected with the box body, and the median is arranged in the middle of the box body. The box graph range in this application may be the range between the upper edge and the lower edge, including the median and the two quartiles, etc.

Therefore, among the image samples, the image sample whose gradient value is outside the box-shaped image range can be determined as the image sample having the high difficulty and removed, and the image sample set formed by the remaining image samples can be determined as the target image sample set.

According to the knowledge distillation method, the loss value, the gradient value based on the loss value, the variance value of the gradient value or the variance value of the predicted value corresponding to each image sample in the image sample set are used as sample difficulty indexes, and difficult samples are removed from the image sample set, so that the situation that a student model receives difficult sample label supervision which is difficult to distinguish by a teacher model in the knowledge distillation process is avoided, the model training effect is improved, and the knowledge distillation efficiency of a neural network is improved.

The knowledge distillation apparatus provided in the present application is described below, and the knowledge distillation apparatus described below and the knowledge distillation method described above may be referred to correspondingly to each other.

Fig. 6 is a schematic structural diagram of a knowledge distillation apparatus according to an embodiment of the present application, as shown in fig. 6, the knowledge distillation apparatus includes:

a determining module 110, configured to determine a sample difficulty index of each image sample in an initial image sample set based on the initial image sample set and a teacher model;

The rejecting module 120 is configured to reject the difficult sample from the initial image sample set based on the sample difficulty index of each image sample, so as to obtain a target image sample set;

a distillation module 130 for performing knowledge distillation on the student model based on the target image sample set.

The knowledge distillation device provided by the embodiment of the application determines sample difficulty indexes of all image samples in an initial image sample set based on the initial image sample set and a teacher model; based on sample difficulty indexes of all image samples, difficult sample rejection is carried out on an initial image sample set to obtain a target image sample set, and knowledge distillation is carried out on a student model based on the target image sample set, so that the student model can be prevented from being supervised by difficult sample labels which are difficult to distinguish by a teacher model in the knowledge distillation process, the model training effect is improved, and the knowledge distillation efficiency of a neural network is further improved.

Based on any of the above embodiments, the determining module 110 is specifically configured to:

Based on any of the above embodiments, the determining module 110 is specifically further configured to:

Based on any of the above embodiments, the determining module 110 includes a first determining unit, where the first determining unit is specifically configured to:

Based on any of the above embodiments, the first determining unit includes a second determining unit, the second determining unit being specifically configured to:

Based on any one of the above embodiments, the second determining unit includes a third determining unit, and the third determining unit is specifically configured to:

Fig. 7 illustrates a physical schematic diagram of an electronic device, as shown in fig. 7, which may include: processor 710, communication interface (Communications Interface) 720, memory 730, and communication bus 740, wherein processor 710, communication interface 720, memory 730 communicate with each other via communication bus 740. Processor 710 may invoke logic instructions in memory 730 to perform a knowledge distillation method comprising: determining a sample difficulty index of each image sample in an initial image sample set based on the initial image sample set and a teacher model;

Further, the logic instructions in the memory 730 described above may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand alone product. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present application also provides a computer program product comprising a computer program, the computer program being storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of performing the knowledge distillation method provided by the methods described above, the method comprising: determining a sample difficulty index of each image sample in an initial image sample set based on the initial image sample set and a teacher model;

In yet another aspect, the present application also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform a knowledge distillation method provided by the above methods, the method comprising: determining a sample difficulty index of each image sample in an initial image sample set based on the initial image sample set and a teacher model;

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting thereof; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

The above embodiments are merely illustrative of the present application and are not limiting thereof. While the present application has been described in detail with reference to the embodiments, those skilled in the art will understand that various combinations, modifications, or equivalents of the technical solutions of the present application may be made without departing from the spirit and scope of the technical solutions of the present application, and all such modifications are intended to be covered by the claims of the present application.

Claims

1. A method of knowledge distillation, comprising:

determining a sample difficulty index of each image sample in an initial image sample set based on the initial image sample set and a teacher model; the sample difficulty index comprises a variance value of a predicted value corresponding to the image sample;

performing knowledge distillation on the student model based on the target image sample set;

the determining, based on the initial image sample set and the teacher model, a sample difficulty index of each image sample in the initial image sample set includes:

in each round of parameter iteration training of the teacher model, taking a target image sample set obtained by the previous round of parameter iteration training as an initial image sample set of the next round of parameter iteration;

the method for eliminating the difficult sample of the initial image sample set based on the sample difficulty index of each image sample to obtain a target image sample set further comprises the following steps:

2. The knowledge distillation method according to claim 1, wherein the sample difficulty index includes any one of a loss value corresponding to an image sample, a gradient value based on the loss value, and a variance value of the gradient value;

3. The knowledge distillation method according to claim 1, wherein determining a sample difficulty index for each image sample in the initial image sample set based on the initial image sample set and a teacher model comprises:

4. The knowledge distillation method according to claim 2, wherein the variance value of the predicted value corresponding to the image sample is determined based on the steps of:

5. The knowledge distillation method according to claim 4, wherein determining a sample difficulty index for each of the image samples based on a plurality of predictors for each of the image samples, comprises:

6. A knowledge distillation apparatus, comprising:

the determining module is used for determining sample difficulty indexes of all image samples in the initial image sample set based on the initial image sample set and the teacher model; the sample difficulty index comprises a variance value of a predicted value corresponding to the image sample;

The distillation module is used for carrying out knowledge distillation on the student model based on the target image sample set;

the determining module is further configured to perform, in performing each round of parameter iterative training on the teacher model, a step of determining a sample difficulty index of each image sample in the initial image sample set based on an initial image sample set and the teacher model, and performing difficult sample rejection on the initial image sample set based on the sample difficulty index of each image sample, to obtain a target image sample set; in each round of parameter iteration training of the teacher model, taking a target image sample set obtained by the previous round of parameter iteration training as an initial image sample set of the next round of parameter iteration;

the determining module is further configured to determine the sample difficulty index as a variance value of predicted values of the image samples, and determine a total variance value of all the predicted values of all the image samples; and eliminating the image samples with variance values larger than the total variance values in the initial image sample set, and determining the rest image samples as a target image sample set.

7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the knowledge distillation method of any of claims 1 to 5 when the program is executed by the processor.

8. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the knowledge distillation method according to any one of claims 1 to 5.