CN113887699A

CN113887699A - Knowledge distillation method, electronic device and storage medium

Info

Publication number: CN113887699A
Application number: CN202111028205.7A
Authority: CN
Inventors: 祝毅晨
Original assignee: Midea Group Co Ltd; Midea Group Shanghai Co Ltd
Current assignee: Midea Group Co Ltd; Midea Group Shanghai Co Ltd
Priority date: 2021-09-02
Filing date: 2021-09-02
Publication date: 2022-01-04

Abstract

The application discloses knowledge distillation method, electronic equipment and storage medium, wherein the knowledge distillation method comprises the following steps: respectively utilizing the student model and the teacher model to carry out prediction processing on the first training sample, wherein the prediction processing process comprises a plurality of processing stages, and different processing stages are executed by different network layers in the student model and the teacher model; acquiring contribution indexes of output data of the teacher model in each processing stage to training the student model; selecting at least one processing stage as a target stage based on the corresponding contribution indexes of the processing stages; and adjusting parameters of the student model by using output data of the teacher model and the student model in each target stage. Above-mentioned scheme can improve knowledge distillation effect.

Description

Knowledge distillation method, electronic device and storage medium

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a knowledge distillation method, electronic equipment and a storage medium.

Background

Knowledge distillation is a training method for model compression for neural networks, is generally used for designing lightweight neural networks or high-performance neural networks for embedded devices, and is widely used in various visual tasks. Knowledge distillation usually needs to train a large neural network (called as a teacher model) with more parameter quantity or slower operation speed but higher model performance index in advance, and a small neural network (called as a student model) with less parameter quantity or faster operation speed but poor model performance is guided through data characteristics or data labels learned by the teacher model, so that the student model can obtain better performance.

At present, in each round of training of the existing knowledge distillation, a teacher model uses data features or labels with fixed preset numbers to guide the training of student models, and as the network scale of the teacher model is usually much larger than that of the student models, the features learned by the teacher model cannot be effectively learned by the student models through a distillation mode, the final performance of the student models is reduced, namely, the model Capacity Gap (Capacity Gap) between the teacher model and the student models causes that the teacher model cannot distill out the student models with better performance. In view of the above, how to improve the knowledge distillation effect is a problem to be solved urgently.

Disclosure of Invention

The application provides a knowledge distillation method, electronic equipment and a storage medium.

In order to solve the technical problem, the application adopts a technical scheme that: a knowledge distillation method comprising: respectively utilizing the student model and the teacher model to carry out prediction processing on the first training sample, wherein the prediction processing process comprises a plurality of processing stages, and different processing stages are executed by different network layers in the student model and the teacher model; acquiring contribution indexes of output data of the teacher model in each processing stage to training the student model; selecting at least one processing stage as a target stage based on the corresponding contribution indexes of the processing stages; and adjusting parameters of the student model by using output data of the teacher model and the student model in each target stage.

According to an embodiment of the present application, the output data of each processing stage includes at least one of: at least one intermediate processing stage and a prediction of the final processing stage, wherein the resolution of the intermediate layer features of different intermediate processing stages differs.

According to an embodiment of the application, the intermediate layer feature of the intermediate processing stage is a first intermediate layer feature output by the intermediate processing stage, or at least one preset type of second intermediate layer feature extracted from the first intermediate layer feature.

According to an embodiment of the application, in the case of at least one preset type of second intermediate layer feature obtained by extracting the first intermediate layer feature, at each intermediate processing stage, each preset type corresponds to a contribution index, and the contribution index corresponding to each preset type is obtained based on the preset type of second intermediate layer feature.

According to one embodiment of the application, the method for acquiring the contribution index of the output data of the teacher model in each processing stage to the training of the student model comprises the following steps: for each processing stage, acquiring a first optimization target of the student model in the processing stage by utilizing the difference between output data of the teacher model and output data of the student model in the processing stage respectively; determining a first parameter gradient corresponding to each processing stage based on the first optimization target of each processing stage; and respectively utilizing the first parameter gradients corresponding to the processing stages to obtain the contribution indexes corresponding to the processing stages.

According to an embodiment of the present application, before the first parameter gradients corresponding to each processing stage are respectively used to obtain the contribution index corresponding to each processing stage, the method further includes: obtaining a second optimization target of the student model by using the difference between the prediction result of the student model and the label result of the first training sample; determining a second parameter gradient of the student model based on a second optimization objective; respectively utilizing the first parameter gradients corresponding to each processing stage to obtain the contribution indexes corresponding to each processing stage, and the method comprises the following steps: and for each processing stage, acquiring the similarity between the second parameter gradient and the first parameter gradient corresponding to the processing stage as a contribution index corresponding to the processing stage.

According to an embodiment of the present application, the similarity between the second parameter gradient and the first parameter gradient is determined based on an included angle between the second parameter gradient and the first parameter gradient, wherein the included angle and the similarity are in a negative correlation relationship.

According to one embodiment of the present application, selecting at least one processing stage as a target stage based on the contribution index corresponding to each processing stage includes: and selecting a processing stage with contribution indexes meeting preset conditions as a target stage.

According to one embodiment of the present application, adjusting parameters of a student model using output data of a teacher model and output data of the student model at each goal stage includes: adjusting parameters of the student model by utilizing a first optimization target corresponding to each target stage of the student model and a second optimization target of the student model; the first optimization target corresponding to each target stage is obtained by utilizing output data of the teacher mode and the student model in each target stage, and the second optimization target is obtained by utilizing a prediction result of the learning model and a label result of the first training sample.

According to an embodiment of the present application, adjusting parameters of a student model by using a first optimization objective and a second optimization objective of the student model corresponding to each objective stage includes: weighting the first optimization target and the second optimization target to obtain a third optimization target; obtaining a third parameter gradient based on a third optimization target; and adjusting the parameters of the student model by using the third parameter gradient.

According to an embodiment of the present application, before the performing the prediction processing on the first training sample by using the student model and the teacher model, respectively, the method further includes: and training the teacher model by using the second training sample.

In order to solve the above technical problem, another technical solution adopted by the present application is: an electronic device comprising a processor, a memory, and communication circuitry, the memory and communication circuitry coupled to the processor; the memory stores program instructions for execution by the processor to implement the knowledge distillation method of the above scheme.

In order to solve the above technical problem, the present application adopts another technical solution: a computer readable storage medium storing program instructions executable by a processor for implementing the knowledge distillation method of the above scheme.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings can be obtained by those skilled in the art without inventive efforts, wherein:

FIG. 1 is a schematic flow diagram of one embodiment of a distillation process of the present knowledge;

FIG. 2 is a block diagram of one embodiment of a student model and a teacher model;

FIG. 3 is a flowchart illustrating an embodiment of step S12 in FIG. 1;

FIG. 4 is a schematic flow diagram of another embodiment of a distillation process of the present knowledge;

FIG. 5 is a schematic process diagram of an embodiment of a distillation method of the present knowledge;

FIG. 6 is a block diagram of an embodiment of an electronic device of the present application;

FIG. 7 is a block diagram of an embodiment of a computer-readable storage medium of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It is to be understood that the specific embodiments described herein are merely illustrative of the application and are not limiting of the application. It should be further noted that, for the convenience of description, only some of the structures related to the present application are shown in the drawings, not all of the structures. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring to FIG. 1, FIG. 1 is a schematic flow diagram of an embodiment of a distillation method according to the present application. Specifically, the method may include the steps of:

step S11: and respectively utilizing the student model and the teacher model to carry out prediction processing on the first training sample.

In the disclosed embodiment, the process of prediction processing includes multiple processing stages, with different processing stages being performed by different network layers in the student model and the teacher model. Specifically, the number of processing stages included in the process of the prediction processing is not limited herein, for example, the process of the prediction processing may include 1 processing stage, and the student model and the teacher model may include 1 network layer correspondingly, or the process of the prediction processing may also include 2 processing stages, and the student model and the teacher model may include 2 network layers correspondingly, or the process of the prediction processing may also include 3 processing stages, and the student model and the teacher model may include 3 network layers correspondingly, which is not limited herein.

In one implementation scenario, the resolution of the features output by different network layers is different, for example, the resolution of the feature output by the 1 st network layer may be w × h/2, the resolution of the feature output by the 2 nd network layer may be w × h/4, the resolution of the feature output by the third network layer may be w × h/8, and so on, which is not exemplified herein.

In one implementation scenario, the resolution of the ith network layer output feature in the student model may be the same as the resolution of the ith network layer output feature in the teacher model. Referring to fig. 2, fig. 2 is a schematic diagram of a framework of an embodiment of a student model and a teacher model. As shown in FIG. 2, the student model and the teacher model both have 4 network layers, and for convenience of description, the resolution of the ith network layer output feature in the student model can be recorded as R_i ^sThe characteristic resolution output by the ith network layer in the teacher model can be recorded as R_i ^tAnd R is_i ^sMay be equal to R_i ^tWhere i ∈ {1,2,3,4}, i.e., i may be an integer between 1 and 4. Other cases may be analogized and will not be described herein. It should be noted that the neural network included in each network layer may be set according to actual application requirements, and is not limited herein. For example, for a teacher model, a complex network such as a Residual Block (Residual Block) may be included in the network layer to improve the performance of the teacher model, and for a student model, the network layerThe network layer may include a small number of lightweight networks such as convolutional layers, etc., for simplifying the model structure, and is not limited herein.

In one implementation scenario, the first training sample may be set according to actual application requirements. For example, in the case of a target detection scene, the first training sample may be a sample image of a target area marked with a target object, or in the case of a target segmentation scene, the first training sample may be a sample image marked with an edge contour of the target object. Other scenarios may be analogized, and are not exemplified here. It should be noted that the target object may also be set according to actual application requirements, for example, in a home monitoring application, the target object may include but is not limited to: humans, and pets such as cats, dogs, etc., or, in smart door lock applications, target objects may include, but are not limited to: biometric features such as human face, fingerprint, etc., or, in smart refrigerator applications, the target object may include, but is not limited to: vegetables, fish and shrimp. Other applications may be analogized, and are not exemplified here.

In one implementation scenario, to improve the knowledge distillation effect, the teacher model may be trained with a second training sample prior to the prediction process. Specifically, a second training sample of a target task may be obtained, which may include, but is not limited to: object detection tasks, object segmentation tasks, and the like, without limitation. In addition, the second training sample may be labeled with a labeling result (e.g., a target area, an edge contour, etc.), and based on this, the second training sample may be predicted by the teacher model to obtain a prediction result, and parameters of the teacher model may be adjusted based on a difference between the prediction result and the labeling result. Specifically, parameters may be adjusted by using an optimization manner such as gradient descent, and the specific process may refer to technical details of the optimization manner such as gradient descent, which are not described herein again. According to the mode, before the first training samples are subjected to prediction processing by the student models and the teacher models respectively, the second training samples are firstly utilized to train the teacher models, so that the performance of the teacher models is promoted, knowledge distillation is carried out on the basis, and the knowledge distillation effect is promoted.

Step S12: and acquiring the contribution indexes of the output data of the teacher model in each processing stage to the training of the student model.

In one implementation scenario, the contribution indicator may represent the respective benefit of the various processing stages of the teacher model to training the student model. For example, the larger the index of contribution of a teacher model to training a student model in a certain processing stage is, the higher the benefit degree of the teacher model to training the student model in the processing stage is, and conversely, the smaller the index of contribution of a teacher model to training a student model in a certain processing stage is, the lower the benefit degree of the teacher model to training the student model in the processing stage is.

In one implementation scenario, the output data of each processing stage may include at least one of: the intermediate layer characteristics of at least one intermediate processing stage and the predicted outcome of the final processing stage, and as previously mentioned, the resolution of the intermediate layer characteristics of different intermediate processing stages is different. In the above manner, since the output data of each processing stage includes at least one of the intermediate layer characteristics of at least one intermediate processing stage and the prediction result of the final processing stage, the contribution index can be evaluated from different processing stages such as the intermediate processing stage and the final processing stage, which is beneficial to evaluating the contribution index corresponding to each processing stage as comprehensively as possible, and is beneficial to improving the effect of knowledge distillation.

In a specific implementation scenario, the output data of each processing stage may include at least one intermediate layer feature of an intermediate processing stage, for example, the output data of each processing stage may include 1 intermediate layer feature of an intermediate processing stage, 2 intermediate layer features of an intermediate processing stage, and 3 intermediate layer features of an intermediate processing stage, which is not limited herein. In addition, when the process of the prediction processing includes a plurality of intermediate processing stages, it is not limited which intermediate layer feature of which intermediate processing stage or stages is specifically included in the output data of each processing stage. Referring to fig. 2 in conjunction, the process of the prediction process includes 3 intermediate processing stages, which correspond to the 1 st to 3 rd network layers, respectively, then the output data of each processing stage may include the intermediate layer characteristics output by the 1 st network layer, the intermediate layer characteristics output by the 2 nd network layer, and the intermediate layer characteristics output by the 3 rd network layer, or the output data of each processing stage may include the intermediate layer characteristics output by the 1 st network layer and the intermediate layer characteristics output by the 2 nd network layer, or the output data of each processing stage may include only the intermediate layer characteristics output by the 1 st network layer, or the output data of each processing stage may include only the intermediate layer characteristics output by the 2 nd network layer, or the output data of each processing stage may include only the intermediate layer characteristics output by the 3 rd network layer, and so on, this is not exemplified.

In a specific implementation scenario, the output data of each processing stage may include a prediction result of the final processing stage, for example, in an object detection scenario, the prediction result may include a prediction region of the target object in the first training sample, or in an object segmentation scenario, the prediction result may include a prediction contour of the target object in the first training sample, and so on in other scenarios, which is not illustrated here. Referring to fig. 2, the process of prediction processing may include 1 final processing stage, which corresponds to the 4 th network layer, and so on for other network structures, which are not illustrated herein.

In a specific implementation scenario, the intermediate layer characteristic of the intermediate processing stage may be a first intermediate layer characteristic output by the intermediate processing stage, please refer to fig. 2, as described above, the process of the prediction processing includes 3 intermediate processing stages respectively corresponding to the 1 st to 3 rd network layers, and then the intermediate layer characteristic of the 1 st intermediate processing stage may be the first intermediate layer characteristic output by the 1 st network layer, the intermediate layer characteristic of the 2 nd intermediate processing stage may be the first intermediate layer characteristic output by the 2 nd network layer, and the intermediate layer characteristic of the 3 rd intermediate processing stage may be the first intermediate layer characteristic output by the 3 rd network layer, and in other network structures, the analogy may be performed, which is not illustrated one by one.

In a specific implementation scenario, the intermediate layer features of the intermediate processing stage may also be at least one preset type of second intermediate layer features extracted from the first intermediate layer features. Specifically, the at least one preset type may include, but is not limited to: texture features, location features, etc., without limitation. Taking the example that the at least one preset type includes a texture feature and a position feature, texture feature extraction and position feature extraction may be performed on the first intermediate layer feature, respectively, to obtain a second intermediate layer feature regarding the texture feature and a second intermediate layer feature regarding the position feature. It should be noted that the texture feature may reflect a visual feature of a homogeneous phenomenon in the image, and represent a surface tissue structure arrangement attribute of which the surface changes slowly or periodically, and the position feature may reflect a position attribute of the target object in the image.

In one implementation scenario, as described above, each processing stage may include at least one of an intermediate processing stage and a final processing stage, and the intermediate layer features of the intermediate processing stage may include a first intermediate layer feature output by the intermediate processing stage or at least one preset type of second intermediate layer feature extracted from the first intermediate layer feature. On the basis, corresponding contribution indexes can be calculated for different processing stages and different intermediate layer characteristics.

In a specific implementation scenario, for the final processing stage, the contribution index of the final processing stage of the teacher model to the training of the student model may be obtained through analysis based on the difference between the prediction results of the teacher model and the student models in the final processing stage respectively. For a specific calculation process, reference may be made to the following disclosure embodiments, which are not repeated herein.

In a specific implementation scenario, for the intermediate processing stage, the contribution index of the intermediate processing stage of the teacher model to the training of the student model may be obtained through analysis based on the difference between the first intermediate layer features of the teacher model and the first intermediate layer features of the student models in the intermediate processing stage. For a specific calculation process, reference may be made to the following disclosure embodiments, which are not repeated herein.

In a specific implementation scenario, for the intermediate processing stage, the contribution indexes of the at least one preset type of the intermediate processing stage of the teacher model to the training of the student model are obtained through analysis based on the difference between the characteristics of the teacher model and the characteristics of the student model in the second intermediate layer of the at least one preset type of the intermediate processing stage. That is to say, in the case that the second interlayer features of at least one preset type are obtained by extracting the first interlayer features, in the intermediate processing stage, each preset type corresponds to a contribution index, and the contribution index corresponding to each preset type is obtained based on the second interlayer features of the preset type. For example, for the ith intermediate processing stage, the contribution index of the position feature of the teacher model to the training student model in the ith intermediate processing stage may be obtained according to the difference between the second intermediate layer features of the teacher model and the student models respectively in the ith intermediate processing stage with respect to the position feature, and at the same time, the contribution index of the texture feature of the teacher model in the ith intermediate processing stage to the training student model may be obtained according to the difference between the second intermediate layer features of the teacher model and the student models respectively in the ith intermediate processing stage with respect to the texture feature.

Step S13: and selecting at least one processing stage as a target stage based on the corresponding contribution indexes of the processing stages.

In one implementation scenario, the processing stage whose contribution index satisfies the preset condition may be selected as the target stage. According to the mode, the target stage is screened according to whether the contribution indexes meet the preset conditions or not, so that the parameters of the student model are adjusted according to the output data of the teacher model and the student model in the target stage in the follow-up process, the influence of the processing stage with the contribution indexes not meeting the preset conditions on the training of the student model can be favorably eliminated, and the effect of knowledge distillation is favorably improved.

In a specific implementation scenario, as described above, the contribution index may represent the beneficial degree of each processing stage of the teacher model to the training of the student model, and the contribution index and the beneficial degree are in a positive correlation relationship, in this case, the preset condition may be that the contribution index is greater than the preset index threshold, that is, the processing stage with the greater beneficial degree may be selected as the target stage.

In a specific implementation scenario, for an intermediate processing result, if a contribution index of the intermediate processing stage of the teacher model to training of the student model is obtained through analysis based on a difference between first intermediate layer characteristics of the teacher model and first intermediate layer characteristics of the student model in the intermediate processing stage, for the intermediate processing stage, whether the contribution index of the intermediate processing stage meets a preset condition or not may be directly detected, if so, the intermediate processing stage is taken as a target stage, otherwise, the intermediate processing stage may not be selected.

In a specific implementation scenario, for the intermediate processing stage, if the contribution indexes of the at least one preset type of the intermediate processing stage of the teacher model to the training of the student model are obtained through analysis based on the difference between the characteristics of the teacher model and the characteristics of the student model in the second intermediate layer of the at least one preset type of the intermediate processing stage, it may be detected whether the contribution indexes corresponding to the various preset types meet preset conditions for the intermediate processing stage, if there is at least one contribution index corresponding to the preset type meeting the preset conditions, the intermediate processing stage may be selected as the target stage, otherwise, if none of the contribution indexes corresponding to the preset types meets the preset conditions, the intermediate processing stage may not be selected.

In a specific implementation scenario, for the final processing stage, as described above, the contribution index of the final processing stage of the teacher model to the training of the student model may be obtained through analysis based on the difference between the prediction results of the teacher model and the prediction results of the student models in the final processing stage, and for the final processing stage, it may be detected whether the contribution index of the final processing stage meets a preset condition, if so, the final processing stage is taken as the target stage, otherwise, the final processing stage may not be selected.

In an implementation scenario, all the processing stages may be used as target stages, and the reference weight of each processing stage may be determined according to the contribution index of each processing stage. For example, the contribution index may represent the benefit degree of each processing stage of the teacher model to the training of the student model, and the contribution index and the benefit degree are in a positive correlation, the reference weight of the processing stage may be in a positive correlation with the contribution index, that is, the larger the contribution index of the processing stage is, the larger the reference weight of the processing stage is, and conversely, the smaller the contribution index of the processing stage is, the smaller the reference weight of the processing stage is. Further, in the case where the contribution index of the processing stage is negative, the reference weight of the processing stage may be set to a preset value (e.g., 0, 0.001, 0.005, etc.). Therefore, the reference weight is set according to the contribution index, so that the treatment stage with low benefit degree can be inhibited, the treatment stage with high benefit degree can be enhanced, and the knowledge distillation effect can be improved.

In a specific implementation scenario, for example, in order to reduce the calculation complexity of the reference weight, the contribution index may be directly used as the reference weight. In addition, if the contribution index of the processing stage is negative, the reference weight of the processing stage may be set to 0.

In a specific implementation scenario, for the intermediate processing stage, if the contribution index of the at least one preset type of the intermediate processing stage of the teacher model to the training of the student model is obtained through analysis based on the difference between the teacher model and the student model in the second intermediate layer features of the at least one preset type of the intermediate processing stage, the reference weight corresponding to each preset type of the intermediate processing stage can be obtained according to the contribution index corresponding to each preset type of the intermediate processing stage. For example, taking the preset type as the texture feature and the position feature as an example, for the ith intermediate processing stage, the contribution index corresponding to the texture feature and the contribution index corresponding to the position feature may be calculated, and on this basis, the contribution index corresponding to the texture feature may be directly used as the reference weight of the texture feature of the ith intermediate processing stage, and the contribution index corresponding to the position feature may be directly used as the reference weight of the position feature of the ith intermediate processing stage. Other cases may be analogized, and no one example is given here.

In a specific implementation scenario, for the intermediate processing stage, if the contribution index of the intermediate processing stage of the teacher model to the training of the student model is obtained through analysis based on the difference between the first intermediate layer characteristics of the teacher model and the first intermediate layer characteristics of the student model in the intermediate processing stage, the reference weight of the intermediate processing stage can be obtained according to the contribution index of the intermediate processing stage for the intermediate processing stage. For example, for the ith intermediate processing stage, the contribution index thereof may be directly used as the reference weight of the ith intermediate processing stage. Other cases may be analogized, and no one example is given here.

In a specific implementation scenario, for the final processing stage, as described above, the contribution index of the final processing stage of the teacher model to the training of the student model may be obtained through analysis based on the difference between the prediction results of the teacher model and the student models in the final processing stage, and then the reference weight of the final processing stage may be obtained according to the contribution index of the final processing stage. For example, the contribution index of the final processing stage may be directly used as the reference weight of the final processing stage. Other cases may be analogized, and no one example is given here.

Step S14: and adjusting parameters of the student model by using output data of the teacher model and the student model in each target stage.

Specifically, the network parameters of the student model can be adjusted by using a first optimization objective of the student model in each objective stage and a second optimization objective of the student model, wherein the first optimization objective of each objective stage is obtained by using output data of the teacher model and the output data of the student model in each objective stage, and the second optimization objective is obtained by using a test result of the student model and a label result of the first training sample. By means of the method, when parameters of the student model are adjusted, the first optimization target of the student model in each target stage is concerned, the second optimization target of the student model is also concerned, so that knowledge transferred by the student model learning teacher model can be restrained by the first optimization target and the second optimization target together, and the knowledge distillation effect can be improved.

In one implementation scenario, as described above, a processing stage with a contribution index satisfying a preset condition may be selected as a target stage, and for the intermediate processing stage, the contribution index of the intermediate processing stage with respect to each preset type may be obtained according to at least one second intermediate layer characteristic of the preset type, and then the target stage is selected, or the contribution index of the intermediate processing stage may be obtained according to a first intermediate layer characteristic output by the intermediate processing stage, and then the target stage is selected, and for different processing manners of the contribution index, a first optimization target of each target stage may be obtained by adopting different calculation manners.

In a specific implementation scenario, if the goal phase is the final processing phase, the first optimization goal of the final processing phase may be obtained based on a difference between the predicted results of the teacher model and the student model in the final processing phase. Specifically, a function such as KL Divergence (i.e., Kullback-Leibler Divergence) may be used to measure the difference between the predicted results to obtain the first optimization goal of the final processing stage. It should be noted that, taking target detection as an example, the prediction result may include a prediction region of the target object, or, taking target segmentation as an example, the prediction result may include a prediction contour of the target object (for example, a probability value that each pixel belongs to the target object), and the like may be performed in other cases, which is not illustrated herein.

In a specific implementation scenario, if the target stage is an intermediate processing stage and the contribution index is calculated based on the first intermediate layer characteristics output by the intermediate processing stage, the first optimization target of the intermediate processing stage may be obtained based on a difference between the teacher model and the student model in the first intermediate layer characteristics of the intermediate processing stage. In particular, a function such as the L2 norm may be employed to measure the difference between the first intermediate layer characteristics to arrive at a first optimization goal for the intermediate processing stage.

In a specific implementation scenario, if the target stage is an intermediate processing stage and the contribution index is calculated based on the second intermediate layer feature of at least one preset type obtained by extracting the first intermediate layer feature, that is, each preset type corresponds to the contribution index, in this case, the preset type with the contribution index meeting the preset condition may be selected as the target type of the intermediate processing stage, and then the first optimization target of the intermediate processing stage is obtained based on a difference between the teacher model and the student model in the second intermediate layer feature of the target type of the intermediate processing stage. Specifically, a function such as the L2 norm may be employed to measure the difference between the second mid-level features of the target type to arrive at the first optimization goal for the intermediate processing stage. Taking at least one preset type including a position feature and a texture feature as an example, the target stage is an ith intermediate processing stage, that is, for the ith intermediate processing stage, the position feature and the texture feature respectively correspond to contribution indexes, and if the contribution indexes corresponding to the position feature meet preset conditions, the position feature can be taken as the target type.

In a specific implementation scenario, a function such as cross entropy loss, dice loss, etc. may be used to measure the difference between the prediction result of the student model and the labeling result of the first training sample, so as to obtain the second optimization goal. It should be noted that, taking target detection as an example, the labeling result may include a target region of the target object labeled by the first training sample, and the prediction result may include a prediction region of the target object, or, taking target segmentation as an example, the labeling result may include an edge contour of the target object labeled by the first training sample (for example, a type to which each pixel belongs, for example, 0-1 label may be adopted to label as belonging to the target object, or not belonging to the target object), and the prediction result may include a prediction contour of the target object (for example, a probability value that each pixel belongs to the target object), and so on in other cases, which is not illustrated here.

In a specific implementation scenario, the first optimization objective and the second optimization objective of the student model corresponding to each objective stage may be weighted to obtain a third optimization objective, a third parameter voxel may be obtained based on the third optimization objective, and the parameter of the student model may be adjusted by using the third parameter gradient. For a specific process, reference may be made to technical details of model optimization manners such as gradient descent, which are not described herein again. It should be noted that the weights of the first optimization goal and the second optimization goal may be set according to application requirements. For example, the weights of the optimization targets may be set to be the same, or the weights may be set according to the processing stages corresponding to the optimization targets, for example, the processing stage at the former position may be set with a smaller weight, and the processing stage at the later position may be set with a larger weight, which is not limited herein. According to the mode, the third optimization target is obtained by weighting the first optimization target and the second optimization target, the third parameter gradient is obtained based on the third optimization target, the parameters of the student model are adjusted by utilizing the third parameter gradient, and the first optimization target and the second optimization target are favorably weighted to be referred to when the parameters of the student model are adjusted, so that the knowledge distillation effect is favorably improved.

In one implementation scenario, as described above, all the processing stages may be used as target stages, the reference weight of each processing stage is determined according to the contribution index of each processing stage, for an intermediate processing stage, the contribution index of the intermediate processing stage with respect to each preset type may be obtained according to at least one preset type of second intermediate layer characteristic, or the contribution index of the intermediate processing stage may be obtained according to a first intermediate layer characteristic output by the intermediate processing stage, and for different processing modes of the contribution index, the first optimization target of each target stage may be obtained by adopting different calculation modes.

In a specific implementation scenario, if the goal phase is the final processing phase, the first optimization goal of the final processing phase may be obtained based on a difference between the predicted results of the teacher model and the student model in the final processing phase. Reference may be made to the foregoing description for details, which are not repeated herein.

In a specific implementation scenario, if the target stage is an intermediate processing stage and the contribution index is calculated based on the first intermediate layer characteristics output by the intermediate processing stage, the first optimization target of the intermediate processing stage may be obtained based on a difference between the teacher model and the student model in the first intermediate layer characteristics of the intermediate processing stage. Reference may be made to the foregoing description for details, which are not repeated herein.

In a specific implementation scenario, if the target stage is an intermediate processing stage and the contribution index is calculated based on the second intermediate layer features of at least one preset type obtained by extracting the first intermediate layer features, that is, each preset type corresponds to a contribution index, in this case, for each preset type, the first optimization target corresponding to the preset type of the intermediate processing layer may be obtained based on a difference between the teacher model and the student model in the second intermediate layer features of the preset type of the intermediate processing stage. Specifically, as previously described, the L2 norm may be employed to measure the difference to arrive at the first optimization goal. Taking the example that the at least one preset type comprises the position feature and the texture feature, the target stage is the ith intermediate processing stage, that is, for the ith intermediate processing stage, the first optimization target of the ith intermediate processing layer on the position feature can be obtained based on the difference between the teacher model and the student model in the second intermediate feature of the ith intermediate processing stage on the position feature, and the first optimization target of the ith intermediate processing layer on the texture feature can be obtained based on the difference between the teacher model and the student model in the second intermediate feature of the ith intermediate processing stage on the texture feature.

In a specific implementation scenario, a function such as cross entropy loss, dice loss, etc. may be used to measure the difference between the prediction result of the student model and the labeling result of the first training sample, so as to obtain the second optimization goal. Reference may be made to the foregoing description for details, which are not repeated herein.

In a specific scenario, as described above, all the processing stages may be used as target stages, and the reference weight of each processing stage is determined according to the contribution index of each processing stage, on this basis, the weighted weight of the second optimization target may be obtained, and the first optimization target and the second optimization target are weighted (e.g., weighted average) by using the reference weight and the weighted weight, so as to obtain a third optimization target, and a third parameter gradient is obtained based on the third optimization target, and the parameter of the student model is adjusted by using the third parameter gradient. Reference may be made to the foregoing description for details, which are not repeated herein. It should be noted that the weighting weight of the second optimization goal may be set according to the range of the reference weight. For example, in the case that the reference weight has a range of value between 0 and 1, the weighting weight may also be set to be between 0 and 1, or to be close to the reference weight, so as to avoid over-weighting the first optimization goal or the second optimization goal due to over-small or over-large weighting relative to the reference weight, thereby affecting the knowledge distillation effect.

In the scheme, the first training sample is subjected to prediction processing by respectively utilizing the student model and the teacher model, and the process of the prediction process includes a plurality of processing stages, different processing stages being performed by different network layers in the student model and the teacher model, and obtains the contribution index of the output data of the teacher model in each processing stage to the training of the student model, and selecting at least one processing stage as a target stage based on the corresponding contribution index of each processing stage, on the basis, the output data of the teacher model and the student models in each target stage are utilized to adjust the parameters of the student models, namely, in the knowledge distillation process, the contribution indexes of each processing stage to the training of the student model are measured, and the reference target stage is emphasized according to the contribution index, so that the influence of the model capacity difference on knowledge distillation can be greatly relieved, and the knowledge distillation effect is favorably improved.

Referring to fig. 3, fig. 3 is a schematic flowchart illustrating an embodiment of step S12 in fig. 1. Specifically, the method may include the steps of:

step S31: for each processing stage, the first optimization target of the student model in the processing stage is obtained by utilizing the difference between the output data of the teacher model and the output data of the student model in the processing stage respectively.

Specifically, as described above, the process of the prediction processing includes an intermediate processing stage and a final processing stage, on this basis, the output data of the intermediate processing stage may include an intermediate layer feature, the output data of the final processing stage may include a prediction result, and the intermediate layer feature of the intermediate processing stage may include a first intermediate layer feature output by the intermediate processing stage, or at least one preset type of second intermediate layer feature extracted from the first intermediate layer feature, and then the first optimization target may be calculated in different manners for different processing stages.

In one implementation scenario, for the final processing stage, a first optimization objective for the final processing stage may be derived based on a difference between the predicted results of the teacher model and the student models in the final processing stage. Reference may be made to the related description in the foregoing embodiments, which are not repeated herein.

In one implementation scenario, for an intermediate processing stage, if the output data includes the first intermediate layer characteristics output by the intermediate processing stage, the first optimization goal for the intermediate processing stage may be obtained based on the difference between the teacher model and the student model in the first intermediate layer characteristics of the intermediate processing stage. Reference may be made to the related description in the foregoing embodiments, which are not repeated herein.

In an implementation scenario, for an intermediate processing stage, if output data of the intermediate processing stage includes at least one preset type of second intermediate layer feature extracted from the first intermediate layer feature, for each preset type, a first optimization goal corresponding to the preset type of the intermediate processing stage may be obtained based on a difference between the teacher model and the student model in the second intermediate layer feature of the preset type of the intermediate processing stage. Reference may be made to the related description in the foregoing embodiments, which are not repeated herein.

Step S32: and determining the first parameter gradient corresponding to each processing stage respectively based on the first optimization target of each processing stage.

Specifically, for each processing stage, the partial derivative of each parameter may be obtained for the first optimization objective, and expressed in a vector form, so as to obtain the first parameter gradient corresponding to the processing stage. It should be noted that, for each intermediate processing stage, if the output data of the intermediate processing stage includes at least one second intermediate layer feature of a preset type obtained by extracting the first intermediate layer feature, the first parameter gradient of the first intermediate processing stage may be obtained from the first optimization target corresponding to each preset type, that is, in each intermediate processing stage, each preset type corresponds to the first parameter gradient. The specific calculation process of the parameter gradient may refer to the technical details of the gradient, and is not described herein again.

Step S33: and respectively utilizing the first parameter gradients corresponding to the processing stages to obtain the contribution indexes corresponding to the processing stages.

Specifically, the second optimization target of the student model may be obtained by using a difference between the prediction result of the student model and the label result of the first training sample, and the specific calculation manner may refer to the relevant description in the foregoing disclosed embodiment, which is not described herein again. On this basis, a second parameter gradient of the student model can be determined based on the second optimization objective, and the specific calculation process is similar to the first parameter gradient and is not described herein again. After the second parameter gradient is obtained, for each processing stage, a similarity between the second parameter gradient and the first parameter gradient corresponding to the processing stage may be obtained as a contribution index corresponding to the processing stage. According to the mode, the similarity between the second parameter gradient and the first parameter gradient corresponding to the processing stage is obtained and used as the contribution index corresponding to the processing stage, the beneficial degree of the student model corresponding to each processing stage can be accurately measured, and the accuracy of the contribution index is favorably improved.

In an implementation scenario, it should be noted that, for each intermediate processing stage, if output data of the intermediate processing stage includes at least one second intermediate layer feature of a preset type obtained by extracting the first intermediate layer feature, a similarity between first parameter gradients corresponding to various preset types of the second parameter gradients may be obtained as a contribution index corresponding to each preset type. Taking at least one preset type including the position feature and the texture feature as an example, for the ith intermediate processing stage, the similarity between the second parameter gradient and the first parameter gradient corresponding to the position feature may be obtained as a contribution index corresponding to the position feature in the ith intermediate processing stage, and the similarity between the second parameter gradient and the first parameter gradient corresponding to the texture feature may be obtained as a contribution index corresponding to the texture feature in the ith intermediate processing stage. Other cases may be analogized, and no one example is given here.

In one implementation scenario, the similarity between the second parameter gradient and the first parameter gradient may be determined based on an included angle between the second parameter gradient and the first parameter gradient, and the included angle and the similarity are in a negative correlation relationship. That is, the larger the angle is, the smaller the similarity is, whereas the smaller the angle is, the larger the similarity is. In particular, cosine similarity may be employed to measure the similarity between the first and second parameter gradients. For example, for convenience of description, in the ith processing stage, the first parameter gradient corresponding to the kth preset type can be recorded as

Where x represents a first training sample input to the teacher model, and the second parameter gradient may be recorded as

In the ith processing stage, the contribution index corresponding to the kth preset type can be recorded as

In an implementation scenario, after the first parameter gradient and the second parameter gradient are obtained, the first parameter gradient and the second parameter gradient may be further converted into unit vectors, and on this basis, the similarity between the second parameter gradient and the first parameter gradient may also be determined based on the distance between the second parameter gradient and the second parameter gradient, and the distance and the similarity are in a negative correlation relationship. That is, the greater the distance, the smaller the similarity, and conversely, the smaller the distance, the greater the similarity. In particular, the euclidean distance may be used to measure the similarity between the first and second parameter gradients.

It should be noted that the above-mentioned measuring the similarity by using the included angle, the distance, and the like is only two possible ways for measuring the similarity in the practical application process, and the measurement of the similarity by using other ways is not limited herein.

According to the scheme, for each processing stage, the first optimization target of the student model in the processing stage is obtained by utilizing the difference between the output data of the teacher model and the output data of the student model in the processing stage respectively, the first parameter gradient corresponding to each processing stage is determined based on the first optimization target of each processing stage respectively, and the contribution index corresponding to each processing stage is obtained by utilizing the first parameter gradient of each processing stage respectively, so that the contribution index of the processing stage can be accurately measured through the difference between the output data.

Referring to FIG. 4, FIG. 4 is a schematic flow diagram of another embodiment of a distillation method according to the teachings of the present application. Specifically, the method may include the steps of:

step S41: and respectively utilizing the student model and the teacher model to carry out prediction processing on the first training sample.

In the disclosed embodiment, the process of prediction processing includes multiple processing stages, with different processing stages being performed by different network layers in the student model and the teacher model. In addition, in order to improve the knowledge distilling effect, the teacher model may be trained with the second training sample before the prediction processing until the teacher model converges. Reference may be made to the related description in the foregoing embodiments, which are not repeated herein.

Step S42: and acquiring the contribution indexes of the output data of the teacher model in each processing stage to the training of the student model.

In embodiments of the present disclosure, each processing stage may include at least one of an intermediate processing stage and a final processing stage, the output data of the final processing stage including the predicted outcome. In addition, the intermediate processing stage outputs a first intermediate layer feature, the first intermediate layer feature is subjected to feature extraction to obtain at least one preset type of second intermediate layer feature, and output data of the intermediate processing stage includes the at least one preset type of second intermediate layer feature or the first intermediate layer feature, which is not limited herein.

In an implementation scenario, in the intermediate processing stage, under the condition that output data of the intermediate processing stage includes at least one preset type of second intermediate layer feature, each preset type corresponds to a contribution index, and the contribution index corresponding to each preset type is acquired based on the preset type of second intermediate layer feature. The specific process can refer to the related description in the foregoing disclosed embodiments, and is not repeated herein.

In an implementation scenario, in the case that the output data of the intermediate processing stage includes the first intermediate layer characteristic, a contribution index may be calculated for each intermediate processing stage. The specific process can refer to the related description in the foregoing disclosed embodiments, and is not repeated herein.

Step S43: and selecting a processing stage with contribution indexes meeting preset conditions as a target stage.

In an implementation scenario, in the case that output data of the intermediate processing stage includes second intermediate layer features of at least one preset type, in the intermediate processing stage, each preset type corresponds to a contribution index, in the processing stage, if the contribution index corresponding to at least one preset type meets a preset condition, the processing stage may be regarded as a target stage, and a preset type whose contribution index meets the preset condition is regarded as a target type, otherwise, if none of the contribution indexes corresponding to the preset type meets the preset condition, the processing stage may not be selected. Reference may be made in particular to the description relating to the embodiments of the disclosure above.

In an implementation scenario, under the condition that output data of the intermediate processing stages includes the first intermediate layer feature, each intermediate processing stage may obtain one contribution index by corresponding calculation, and then it may be directly detected whether the contribution index corresponding to the intermediate processing stage meets a preset condition, if so, the intermediate processing stage may be selected as the target stage, otherwise, the processing stage may not be selected. Referring to fig. 5, fig. 5 is a schematic process diagram of an embodiment of distillation according to the teachings of the present application. As shown in fig. 5, the solid arrow in the dashed box represents the second parameter gradient (the specific meaning can be referred to the foregoing disclosure), the dashed arrow in the dashed box represents the first parameter gradient (the specific meaning can be referred to the foregoing disclosure), the number at the dashed arrow represents the processing stage to which the first parameter gradient belongs, the dashed arrow numbered 01 represents the first parameter gradient of the 1 st processing stage, and so on, which are not illustrated herein. Taking the cosine similarity between the first parameter gradient and the second parameter gradient to characterize the contribution index as an example, if the included angle between the first parameter gradient and the second parameter gradient corresponding to the jth processing stage is greater than or equal to 90 degrees and less than or equal to 180 degrees, the contribution index corresponding to the jth processing stage is less than or equal to 0, and if the included angle between the first parameter gradient and the second parameter gradient corresponding to the jth processing stage is greater than or equal to 0 degrees and less than 90 degrees, the contribution index corresponding to the jth processing stage is greater than 0. Therefore, under the condition that the preset condition is set that the cosine similarity is greater than 0, in the process of conducting the t-th iterative training on the student model, because the included angle between the first parameter gradient and the second parameter gradient corresponding to the 1 st processing stage is greater than 90 degrees, and the included angles between the first parameter gradient and the second parameter gradient corresponding to the rest 2 nd processing stage, the 3 rd processing stage and the 4 th processing stage respectively are all between 0 degree and 90 degrees, the 2 nd processing stage, the 3 rd processing stage and the 4 th processing stage can be used as target stages. Similarly, in the process of performing the t + i times of iterative training on the student model, since only the included angle between the first parameter gradient and the second parameter gradient corresponding to the 4 th processing stage is between 0 degree and 90 degrees, and the included angles between the first parameter gradient and the second parameter gradient corresponding to the remaining 1 st processing stage, the 2 nd processing stage and the 3 rd processing stage respectively are all greater than 90 degrees, the 4 th processing stage can be taken as the target stage.

Step S44: and adjusting parameters of the student model by using output data of the teacher model and the student model in each target stage.

In one implementation scenario, for the goal phase including the final processing phase, the first optimization goal may be obtained based on a difference between predicted results of the teacher model and the student model in the final processing phase, and for the goal phase including the intermediate processing phase, if output data of the intermediate processing phase includes at least one preset type of second intermediate layer feature, the first optimization goal of the intermediate processing phase may be obtained based on a difference between the teacher model and the student model in the intermediate processing phase of the second intermediate layer feature of the goal type. On the basis, the difference between the prediction result of the student model and the label result of the first training sample is used for obtaining a second optimization target, the first optimization target and the second optimization target are weighted to obtain a third optimization target, a third parameter gradient is obtained based on the third optimization target, and the parameter of the student model is adjusted by using the third parameter gradient. Reference may be made to the related description in the foregoing embodiments, which are not repeated herein.

In one implementation scenario, for the goal phase including the final processing phase, the first optimization objective of the student model at the final processing phase may be calculated with reference to the foregoing description, and for the goal phase including the intermediate phase, if the output data of the intermediate processing phase includes the first intermediate layer features, the first optimization objective of the intermediate processing phase is obtained based on the difference between the teacher model and the student model at the first intermediate layer features of the intermediate processing phase. On the basis, the second optimization target is obtained by utilizing the difference between the prediction result of the student model and the label result of the first training sample, and the parameters of the student model are adjusted based on the first optimization target and the second optimization target. Reference may be made to the related description in the foregoing embodiments, which are not repeated herein. Referring to fig. 5, in the process of the tth iteration, since the 2 nd, 3 rd and 4 th processing stages are taken as the target stages, in the process of calculating the optimization target, the switch of the 1 st processing stage is open, i.e. the 1 st processing stage is not considered in the process of the tth iteration, and the 2 nd, 3 rd and 4 th processing stages are closed, i.e. the 2 nd, 3 rd and 4 th processing stages are considered in the process of the tth iteration; similarly, in the process of t + i iterations, since the 4 th processing stage is taken as the target stage, in the optimization target calculation process, the switches of the 1 st processing stage, the 2 nd processing stage and the 3 rd processing stage are open, i.e., indicating that the 1 st processing stage, the 2 nd processing stage and the 3 rd processing stage are not considered in the process of t iteration, and the 4 th processing stage is closed, i.e., indicating that the 4 th processing stage is considered in the process of t + i iteration.

According to the scheme, the target stage is screened according to whether the contribution indexes meet the preset conditions or not, so that the parameters of the student model are adjusted according to the output data of the teacher model and the student model in the target stage in the follow-up process, the influence of the processing stage with the contribution indexes not meeting the preset conditions on the training of the student model can be favorably eliminated, and the effect of knowledge distillation is favorably improved.

Referring to fig. 6, fig. 6 is a schematic diagram of a frame of an embodiment of an electronic device 60 according to the present application. As shown in fig. 6, the electronic device 60 may include a processor 61 and a memory 62 coupled to each other, the memory 62 storing program instructions, the processor 61 being configured to execute the program instructions to implement the steps in any of the above-described embodiments of the knowledge distillation method. Specifically, the electronic device 60 may include, but is not limited to: server, notebook computer, desktop computer, etc., without limitation.

The processor 61 may also be referred to as a CPU (Central Processing Unit), and the processor 61 may be an integrated circuit chip having signal Processing capability. The Processor 61 may also be a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor 61 may be any conventional processor or the like.

In the embodiment of the present disclosure, the processor 61 is configured to perform prediction processing on the first training sample by using the student model and the teacher model, respectively, where a process of the prediction processing includes a plurality of processing stages, and different processing stages are executed by different network layers in the student model and the teacher model; the processor 61 is used for acquiring the contribution indexes of the output data of the teacher model in each processing stage to the training of the student model; the processor 61 is configured to select at least one processing stage as a target stage based on the contribution index corresponding to each processing stage; the processor 61 is configured to adjust parameters of the student model using output data of the teacher model and the student model at each goal stage.

According to the scheme, in the knowledge distillation process, the contribution indexes of each processing stage to the training of the student model are measured, and the reference target stage is emphasized according to the contribution indexes, so that the influence of the model capacity difference on the knowledge distillation can be greatly relieved, and the knowledge distillation effect is favorably improved.

In some disclosed embodiments, the output data of each processing stage includes at least one of: at least one intermediate processing stage and a prediction of the final processing stage, wherein the resolution of the intermediate layer features of different intermediate processing stages differs.

In some disclosed embodiments, the intermediate layer feature of the intermediate processing stage is a first intermediate layer feature output by the intermediate processing stage, or at least one preset type of second intermediate layer feature extracted from the first intermediate layer feature.

In some disclosed embodiments, in the case that the second interlayer features of at least one preset type are obtained by extracting the first interlayer features, in the intermediate processing stage, each preset type corresponds to a contribution index, and the contribution index corresponding to each preset type is obtained based on the second interlayer features of the preset type.

In some disclosed embodiments, the processor 61 is configured to obtain, for each processing stage, a first optimization goal of the student model at the processing stage using a difference between output data of the teacher model and output data of the student model at the processing stage, respectively; the processor 61 is configured to determine a first parameter gradient corresponding to each processing stage based on the first optimization target of each processing stage; the processor 61 is configured to obtain the contribution index corresponding to each processing stage by using the first parameter gradient corresponding to each processing stage.

In some disclosed embodiments, the processor 61 is configured to obtain a second optimization goal of the student model by using a difference between the prediction result of the student model and the labeling result of the first training sample; the processor 61 is configured to determine a second parameter gradient of the student model based on the second optimization objective; the processor 61 is configured to, for each processing stage, obtain a similarity between the second parameter gradient and the first parameter gradient corresponding to the processing stage as a contribution index corresponding to the processing stage.

In some disclosed embodiments, the similarity between the second parameter gradient and the first parameter gradient is determined based on an angle between the second parameter gradient and the first parameter gradient, wherein the angle and the similarity are in a negative correlation relationship.

In some disclosed embodiments, the processor 61 is configured to select, as the target phase, a processing phase in which the contribution indicator satisfies a preset condition.

In some disclosed embodiments, the predetermined condition is that the contribution indicator is greater than a predetermined indicator threshold.

In some disclosed embodiments, the processor 61 is configured to adjust parameters of the student model using the first optimization objective and the second optimization objective of the student model corresponding to each objective stage; the first optimization target corresponding to each target stage is obtained by utilizing output data of the teacher mode and the student model in each target stage, and the second optimization target is obtained by utilizing a prediction result of the student model and a label result of the first training sample.

In some disclosed embodiments, the processor 61 is configured to weight the first optimization objective and the second optimization objective to obtain a third optimization objective; the processor 61 is configured to obtain a third parameter gradient based on a third optimization objective; the processor 61 is configured to adjust the parameters of the student model using the third parameter gradient.

In some disclosed embodiments, processor 61 is configured to train the teacher model using a second training sample.

Referring to fig. 7, fig. 7 is a block diagram illustrating an embodiment of a computer readable storage medium 70 according to the present application. The computer readable storage medium 70 stores program instructions 71 capable of being executed by a processor, the program instructions 71 for implementing the steps in any of the above-described knowledge distillation method embodiments.

The storage device may be a medium that can store program instructions, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, or may be a server that stores the program instructions, and the server may send the stored program instructions to another device for operation, or may self-operate the stored program instructions.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a module or a unit may be divided into only one logical function, and may be implemented in other ways, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only for the purpose of illustrating embodiments of the present application and is not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application or are directly or indirectly applied to other related technical fields, are also included in the scope of the present application.

Claims

1. A method of knowledge distillation, comprising:

respectively utilizing a student model and a teacher model to carry out prediction processing on a first training sample, wherein the prediction processing process comprises a plurality of processing stages, and different processing stages are executed by different network layers in the student model and the teacher model;

acquiring contribution indexes of output data of the teacher model in each processing stage to training of the student model;

selecting at least one processing stage as a target stage based on the contribution index corresponding to each processing stage;

and adjusting parameters of the student model by using the output data of the teacher model and the output data of the student model in each target stage.

2. The method of claim 1, wherein the output data of each of the processing stages comprises at least one of: intermediate layer features of at least one intermediate processing stage and predicted results of a final processing stage, wherein the resolution of the intermediate layer features of different intermediate processing stages differs.

3. The method according to claim 2, wherein the intermediate layer features of the intermediate processing stage are first intermediate layer features output by the intermediate processing stage, or at least one preset type of second intermediate layer features extracted from the first intermediate layer features.

4. The method according to claim 3, wherein in the case that at least one preset type of second intermediate layer feature is obtained by extracting the first intermediate layer feature, at the intermediate processing stage, each preset type corresponds to the contribution index, and the contribution index corresponding to each preset type is obtained based on the preset type of second intermediate layer feature.

5. The method of claim 1, wherein obtaining the contribution index of the output data of the teacher model at each of the processing stages to training the student model comprises:

for each processing stage, obtaining a first optimization target of the student model in the processing stage by using the difference between the output data of the teacher model and the output data of the student model in the processing stage respectively;

determining a first parameter gradient corresponding to each processing stage based on a first optimization target of each processing stage;

and respectively obtaining the contribution indexes corresponding to the processing stages by using the first parameter gradients corresponding to the processing stages.

6. The method according to claim 5, wherein before the obtaining the contribution indicator corresponding to each of the processing stages by using the first parameter gradient corresponding to each of the processing stages, the method further comprises:

obtaining a second optimization target of the student model by using the difference between the prediction result of the student model and the label result of the first training sample;

determining a second parameter gradient of the student model based on the second optimization objective;

the obtaining the contribution index corresponding to each processing stage by respectively using the first parameter gradient corresponding to each processing stage includes:

and for each processing stage, acquiring the similarity between the second parameter gradient and the first parameter gradient corresponding to the processing stage as a contribution index corresponding to the processing stage.

7. The method of claim 6, wherein a similarity between the second parametric gradient and the first parametric gradient is determined based on an included angle between the second parametric gradient and the first parametric gradient, wherein the included angle is inversely related to the similarity.

8. The method according to claim 1, wherein the selecting at least one of the processing stages as a target stage based on the contribution indicator corresponding to each of the processing stages comprises:

and selecting the processing stage with the contribution index meeting a preset condition as the target stage.

9. The method of claim 8, wherein the predetermined condition is that the contribution indicator is greater than a predetermined indicator threshold.

10. The method of claim 1, wherein said adjusting parameters of said student model using output data of said teacher model and said student model at each of said goal phases comprises:

adjusting parameters of the student model by utilizing a first optimization objective corresponding to each objective stage of the student model and a second optimization objective of the student model;

and the second optimization target is obtained by using the prediction result of the student model and the label result of the first training sample.

11. The method of claim 10, wherein the adjusting parameters of the student model using the first optimization objective and the second optimization objective of the student model corresponding to each objective stage comprises:

weighting the first optimization target and the second optimization target to obtain a third optimization target;

obtaining a third parameter gradient based on the third optimization target;

adjusting the parameter of the student model using the third parameter gradient.

12. The method of claim 1, wherein prior to said predictive processing of the first training sample using the student model and the teacher model, respectively, the method further comprises:

and training the teacher model by using the second training sample.

13. An electronic device comprising a memory and a processor coupled to each other, the memory storing program instructions, the processor being configured to execute the program instructions to implement the knowledge distillation method of any one of claims 1 to 12.

14. A computer readable storage medium storing program instructions executable by a processor for implementing the knowledge distillation method of any one of claims 1 to 12.