CN115375986A

CN115375986A - Model distillation method and device

Info

Publication number: CN115375986A
Application number: CN202210806822.3A
Authority: CN
Inventors: 陆强
Original assignee: International Network Technology Shanghai Co Ltd
Current assignee: International Network Technology Shanghai Co Ltd
Priority date: 2022-07-08
Filing date: 2022-07-08
Publication date: 2022-11-22

Abstract

The invention provides a model distillation method and a device, wherein the method comprises the following steps: acquiring a trained teacher model and a student model to be trained, wherein model parameters of the teacher model are more than those of the student model to be trained; acquiring an image to be processed, and inputting the image to be processed into a teacher model and a student model to be trained to obtain characteristics corresponding to the teacher model and characteristics corresponding to the student model to be trained; determining a first mask image based on model parameters requiring distillation, and determining a second mask image based on features requiring distillation and desired weights for the features requiring distillation; determining the training loss of the student model to be trained according to the first mask image, the second mask image, the model parameters of the teacher model, the model parameters of the student model to be trained, the characteristics corresponding to the teacher model and the characteristics corresponding to the student model to be trained, and obtaining the trained student model according to the training loss. The application effect of the model in tasks such as detection, segmentation and the like is improved.

Description

Model distillation method and device

Technical Field

The invention relates to the field of model compression, in particular to a model distillation method and a model distillation device.

Background

The common method for model distillation is that a large model (or teacher model) distills a small model (or student model) to obtain distillation loss, and the distillation loss is added when the small model is trained. Distillation penalty is defined as the loss of large and small model output results. A common distillation loss is Kullback-Leibler Divergence, which is the K-L Divergence, a way to quantify the difference between two probability distributions, also called relative entropy.

However, since the output of tasks such as classification is usually a fixed-length vector, and the output of tasks such as detection and division is usually a pixel-by-pixel output, the loss of the K-L divergence method, which is commonly used as a model distillation, is good for the model of classification tasks, but not necessarily good for the tasks such as detection and division.

Disclosure of Invention

The invention provides a model distillation method and a device.

In a first aspect, the present invention provides a model distillation method comprising: obtaining a trained teacher model and a student model to be trained, wherein model parameters of the teacher model are more than those of the student model to be trained; acquiring an image to be processed, and inputting the image to be processed into the teacher model and the student model to be trained to obtain the characteristics corresponding to the teacher model and the characteristics corresponding to the student model to be trained; determining a first mask image based on model parameters requiring distillation, determining a second mask image based on features requiring distillation and desired weights for the features of distillation; and determining the training loss of the student model to be trained according to the first mask image, the second mask image, the model parameters of the teacher model, the model parameters of the student model to be trained, the characteristics corresponding to the teacher model and the characteristics corresponding to the student model to be trained, and obtaining the trained student model according to the training loss.

Further, the inputting the image to be processed into the teacher model and the student model to be trained to obtain a feature corresponding to the teacher model and a feature corresponding to the student model to be trained includes: inputting the image to be processed into the teacher model and the student model to be trained to obtain at least one characteristic corresponding to the teacher model and at least one characteristic corresponding to the student model to be trained; selecting one characteristic from at least one characteristic corresponding to the teacher model as the target characteristic corresponding to the teacher model; and selecting one feature from at least one feature corresponding to the student model to be trained as the target feature corresponding to the student model to be trained.

Further, the determining a training loss of the student model to be trained according to the first mask image, the second mask image, the model parameters of the teacher model, the model parameters of the student model to be trained, the features corresponding to the teacher model, and the features corresponding to the student model to be trained includes: determining a first loss value according to the first mask image, model parameters of a teacher model and model parameters of the student model to be trained; determining a second loss value according to the second mask image, the target feature corresponding to the teacher model and the target feature corresponding to the student model to be trained; and determining the training loss of the student model to be trained according to the first loss value and the second loss value.

Further, the determining the training loss of the student model to be trained according to the first loss value and the second loss value includes: determining a third loss value according to the at least one characteristic corresponding to the student model to be trained; and determining the training loss of the student model to be trained according to the first loss value, the second loss value and the third loss value.

Further, the determining the training loss of the student model to be trained according to the first loss value, the second loss value and the third loss value includes: and weighting and summing the value obtained by weighting and summing the first loss value and the second loss value and the third loss value to determine the training loss of the student model to be trained.

Further, the method further comprises: and detecting or segmenting the image through the trained student model.

In a second aspect, the present invention also provides a model distillation apparatus comprising: the first processing module is used for acquiring a trained teacher model and a student model to be trained, wherein model parameters of the teacher model are more than those of the student model to be trained; the second processing module is used for acquiring images to be processed, inputting the images to be processed into the teacher model and the student models to be trained, and obtaining characteristics corresponding to the teacher model and characteristics corresponding to the student models to be trained; a third processing module for determining a first mask image based on the model parameters requiring distillation, and a second mask image based on the feature requiring distillation and the required weight of the feature of distillation; and the fourth processing module is used for determining the training loss of the student model to be trained according to the first mask image, the second mask image, the model parameters of the teacher model, the model parameters of the student model to be trained, the characteristics corresponding to the teacher model and the characteristics corresponding to the student model to be trained, and obtaining the trained student model according to the training loss.

In a third aspect, the present invention further provides an electronic device, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to implement the steps of any of the above-mentioned model distillation methods.

In a fourth aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the model distillation method as described in any one of the above.

According to the model distillation method and device, the output of tasks such as detection, segmentation and the like is usually pixel-by-pixel output, the balance is removed through the first mask image and the second mask image, namely the training loss of the student model to be trained is determined according to the first mask image, the second mask image, the model parameters of the teacher model, the model parameters of the student model to be trained, the characteristics corresponding to the teacher model and the characteristics corresponding to the student model to be trained, and the trained student model is obtained according to the training loss. And simultaneously, the model parameters of the teacher model/the student model to be trained and the characteristics output by the model are used as the basis of model distillation. The method solves the problems of unbalanced foreground and unbalanced category of the model of tasks such as detection, segmentation and the like during model distillation, remarkably improves the model distillation effect, and enables the trained student model to have better detection and segmentation effects.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a schematic flow diagram of some embodiments of a model distillation process provided in accordance with the present invention;

FIG. 2 is a schematic flow diagram of a model distillation based method for determining training loss for a student model to be trained;

FIG. 3 is a schematic flow chart for determining a first loss value and a second loss value based on a model distillation process;

FIG. 4 is a schematic block diagram of some embodiments of a model distillation apparatus provided in accordance with the present invention;

fig. 5 is a schematic structural diagram of an electronic device provided in accordance with the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings. The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.

It should be noted that the terms "first", "second", and the like in the present invention are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in the present invention are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that reference to "one or more" unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present invention are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.

Referring to fig. 1, fig. 1 is a schematic flow diagram of some embodiments of a model distillation method provided in the present disclosure. As shown in fig. 1, the method comprises the steps of:

step 101, a trained teacher model and a student model to be trained are obtained, wherein model parameters of the teacher model are more than model parameters of the student model to be trained.

Generally, a large model is often a single complex network or a collection of several networks, and has good performance and generalization capability, while a small model has limited expression capability because of the small network size. Therefore, the knowledge learned by the large model can be used for guiding the training of the small model, so that the small model has the performance equivalent to that of the large model, but the number of parameters is greatly reduced, and the model compression and acceleration are realized. For example, model parameters of a trained teacher model (or called a big model) are used to guide model parameters in a to-be-trained student model to be trained, so that the to-be-trained student model acquires model parameters with performance equivalent to that of the trained teacher model, and the model parameters can enable the to-be-trained student model to have model effect equivalent to that of the trained teacher model. The method is an application of knowledge distillation and transfer learning in model optimization, and is also called model distillation.

102, acquiring an image to be processed, inputting the image to be processed into a teacher model and a student model to be trained, and obtaining characteristics corresponding to the teacher model and characteristics corresponding to the student model to be trained.

In some embodiments, at least one feature layer may exist in the teacher model/the student model to be trained, and the feature obtained corresponding to the teacher model/the feature obtained corresponding to the student model to be trained may be an output of any one of the feature layers, or may be an output of one or more feature layers selected according to the situation. If the output of the characteristic layers is the output of the plurality of characteristic layers, the output of the characteristic layers can be processed in a splicing mode, a fusion mode or a mode of carrying out characteristic extraction on the spliced characteristics again and the like, and the characteristics corresponding to the teacher model/the characteristics corresponding to the student model to be trained are obtained.

Step 103, determining a first mask image based on the model parameters requiring distillation, and determining a second mask image based on the feature requiring distillation and the required weight of the feature requiring distillation.

The mask image is the area or process of image processing that is occluded from the image to be processed (in whole or in part) with a selected image, graphic or object. As an example, the mask image may be a matrix, in which a part with a parameter of 0 represents an occlusion region and a part other than 0 is a region to be processed.

The first mask image is used for processing model parameters of the teacher model to obtain model parameters needing distillation, and the model parameters needing distillation are the model parameters of the teacher model needing to be learned by the student model to be trained.

The second mask image is a feature acting on/corresponding to a student model to be trained, and as an example, the second mask image may be set as a matrix, where a portion of the matrix with a parameter of 0 represents an occlusion region, and then a feature of the teacher model/a feature of the student model to be trained corresponding to the occlusion region is not considered, and a portion other than 0 in the matrix may be used as a feature of/a feature weight of the student model to be trained corresponding to the teacher model.

And step 104, determining the training loss of the student model to be trained according to the first mask image, the second mask image, the model parameters of the teacher model, the model parameters of the student model to be trained, the characteristics of the corresponding teacher model and the characteristics of the student model to be trained, and obtaining the trained student model according to the training loss.

The essence of the detection and segmentation task is to distinguish the background from the target, and for different targets, the dimensions of the model parameters represented on the model and the features obtained by final training are different. Therefore, the training loss of the student model to be trained is determined according to the first mask image, the second mask image, the model parameters of the teacher model, the model parameters of the student model to be trained, the features of the corresponding teacher model and the features of the student model to be trained, that is, the model parameters of the teacher model to be learned (or distilled) and the fixed-length features of the corresponding teacher model are extracted through the first mask image and the second mask image, so that the model parameters of the student model to be trained and the features of the student model to be trained can be learned to the performance of the teacher model.

According to the model distillation method disclosed by some embodiments of the invention, for the characteristic that the output of tasks such as detection, segmentation and the like is usually the output pixel by pixel, the balance is removed through the first mask image and the second mask image, that is, the training loss of the student model to be trained is determined according to the first mask image, the second mask image, the model parameters of the teacher model, the model parameters of the student model to be trained, the characteristics corresponding to the teacher model and the characteristics corresponding to the student model to be trained, and the trained student model is obtained according to the training loss. And simultaneously using the model parameters of the teacher model/the student model to be trained and the characteristics of the model output as the basis of model distillation. The method solves the problems of unbalanced foreground and unbalanced category of the models of tasks such as detection, segmentation and the like during model distillation, remarkably improves the model distillation effect, and enables the trained student models to have better detection and segmentation effects.

In some optional implementation manners, inputting the image to be processed into the teacher model and the student model to be trained, and obtaining the features corresponding to the teacher model and the features corresponding to the student model to be trained, including: inputting the image to be processed into a teacher model and a student model to be trained to obtain at least one characteristic corresponding to the teacher model and at least one characteristic corresponding to the student model to be trained; selecting one characteristic from at least one characteristic corresponding to the teacher model as a target characteristic corresponding to the teacher model; and selecting one characteristic from at least one characteristic corresponding to the student model to be trained as a target characteristic corresponding to the student model to be trained.

In some practical application scenarios, a certain feature may be selected as an object of model distillation (usually, an important feature of the model output), or multiple features may be selected as an object of model distillation. Taking the anchor free model (which is a target detection model) as an example, referring to fig. 3, the characteristic of hm branch output by the model can be selected as the object of model distillation. The hm branch is used for predicting the center point position of the target and the category confidence of the target, and the step of finally obtaining the target frame is to obtain the center position larger than a certain confidence threshold (such as 0.3) according to the feature of the hm branch, and then obtain the prediction results of the reg and offset branch features at the position according to the center position, so the hm branch feature is the most important output branch feature.

In addition, the parameters of the learning teacher model can be determined according to specific situations. Because, when carrying out model distillation, the student model probably only needs to study some characteristic parameters of teacher's model, rather than all characteristic parameters of teacher's model, moreover, draw out the characteristic that student's model needs study, also can reduce the power of calculation, improve student's training efficiency.

In some optional implementations, determining a training loss of the student model to be trained based on the first mask image, the second mask image, the model parameters of the teacher model, the model parameters of the student model to be trained, the features corresponding to the teacher model, and the features corresponding to the student model to be trained comprises: determining a first loss value according to the first mask image, the model parameters of the teacher model and the model parameters of the student model to be trained; determining a second loss value according to the second mask image, the target features corresponding to the teacher model and the target features corresponding to the student model to be trained; and determining the training loss of the student model to be trained according to the first loss value and the second loss value.

Take the anchor free model (the anchor free model is a target detection model) as an example, see fig. 3.Net _ t and Net _ s are respectively a teacher model (or called big model) and a student model (or called small model) to be trained based on the anchor free model. Wherein hm represents the probability of the center of the target object, reg represents the size of the target object, and offset represents the offset of the center.

For example, kd _ loss _1 is recorded as a first loss value, and the loss calculation formula of the first loss value is as follows:

MSE (mask _1 layer-mask _1 layer \ s) formula (1)

Wherein, MSE is mean square error, layer is a feature layer in a large model, layer _ s is a feature layer in a small model, and mask _1 is a binary mask image (or matrix).

The generation of mask _1 can be referred to as: the initial size of the mask _1 is consistent with the size of the hm characteristic diagram, then the mask _1 is zoomed and dimension expanded according to the dimension of the characteristic layer to be distilled, the central area value of the target frame in the mask _1 is 1, and the values of the rest positions are 0. Determination of mask _1 center region: and taking the central point of the target frame as a zooming center, and taking the occupied area of the target frame after the target frame is reduced to 1/4 of the original area as the central area, wherein the mask _1 is used for determining that the parameters need to be distilled.

By way of example, referring to FIG. 3, kd _loss _2is noted as the second loss value, the loss calculation formula of the second loss value is as follows:

MSE (mask _2 layer-mask _2 layer \ s) formula (2)

The MSE is a mean square error, layer is a feature layer of a large model hm branch (target feature of a teacher model), layer _ s is a feature layer of a small model hm branch (target feature of a student model to be trained), and mask _2 is a binary mask image (or matrix).

For kd _ loss _2, to account for the problem of class imbalance, one can set which classes (i.e., the corresponding channels of that class in the hm branching characteristics) to distill.

The generation manner of mask _2 can refer to: mask _2 is consistent with the dimension of hm. For the class to be distilled (usually the class where small models do not perform well and are more important), i.e. "the channel corresponding to the class to be distilled" in mask _2, the values are as follows: the value of the central area of the target frame is in a probability graph form, namely the closer the distance from the central area to the central point, the higher the probability, the farther the distance, the smaller the value, and the value is between 0 and 1. The remaining positions outside the central region have a value of 0. The center region is generated in the same manner as mask _ 1. For the class of non-distillation, i.e. the "channel corresponding to the class of non-distillation" in mask _2, the value is 0.

The above design for the mask image, aiming at the problem of the imbalance of foreground and background of the detection or segmentation model at the pixel level, only distills some areas in the foreground area, and plays a role of attention mechanism; for the mask images for screening distillation areas in the feature layer and the network output layer, which are designed differently (namely, a first mask image and a second mask image are distinguished), only the classes which are not well represented and are more important are distilled, and the problem of class imbalance is solved. Thus, in some embodiments, images may be detected or segmented by a trained student model.

Determining the training loss of the student model to be trained according to the first loss value and the second loss value, wherein the training loss can be a distillation loss value, and is marked as kd _ loss:

kd _ loss = k kd _ loss _1+ kd _ loss \/u _ loss \2formula (3)

Where k = 1/(2 × n), n is the number of distilled feature layers in kd _ loss _ 1. kd _ loss _1 is the first loss value obtained by distilling the characteristic diagram. kd _ loss _2 is the second loss value obtained by distilling the hm branch (i.e. the output characteristic diagram of the hm branch) in the head (the head is composed of several convolution layers after the decoder structure and used for obtaining the final output branch of the target detection, such as the hm branch, the reg branch and the like).

The first loss value and the second loss value are obtained by model distillation, and the student model to be trained is trained by utilizing the loss value obtained by model distillation, so that the student model to be trained can more fully learn the performance of the teacher model.

In some optional implementations, determining a training loss of the student model to be trained according to the first loss value and the second loss value includes: determining a third loss value according to at least one characteristic corresponding to the student model to be trained; and determining the training loss of the student model to be trained according to the first loss value, the second loss value and the third loss value.

Referring to fig. 3, the third loss value is obtained by weighted summation of the hm branch, the reg branch and the offset branch. The weights of the first loss value, the second loss value and the third loss value may be determined on a case-by-case basis. The training loss of the student model to be trained is determined according to the first loss value, the second loss value and the third loss value, so that the student model to be trained can learn the loss of model distillation and the training loss of the student model to be trained per se, the student model to be trained can refer to the training condition of the student model to be trained, and the convergence of the student model to be trained is facilitated.

In some optional implementations, determining a training loss of the student model to be trained according to the first loss value, the second loss value, and the third loss value includes: and weighting and summing the first loss value and the second loss value to determine the training loss of the student model to be trained.

Referring to fig. 2, kd represents distillation of a feature layer of a student model by a teacher model, kd _ loss represents a total loss value of model distillation, and total loss represents training loss of the student model to be trained plus the total loss value of model distillation. Taking the above as an example, kd _ loss is determined, then the weights of kd _ loss and the third loss value (denoted as loss in the figure) are determined according to specific conditions, and the training loss of the student model to be trained is determined according to the weighted sum of the weights. According to training needs, the proportion of each loss value is determined by a weighted summation mode, so that the model learns a certain loss value with tendency to achieve the purpose of model optimization.

In some alternative implementations, the first/second mask images are consistent with dimensions corresponding to features of the teacher model/features corresponding to the student model to be trained. To ensure that the features corresponding to the teacher model/the features corresponding to the student model to be trained can be covered by the first mask image/the second mask image.

Referring to fig. 4, fig. 4 is a schematic structural diagram of some embodiments of a model distillation apparatus according to the present invention, and as an implementation of the methods shown in the above figures, the present invention further provides some embodiments of a model distillation apparatus, which correspond to the embodiments of the methods shown in fig. 1, and which can be applied to various electronic devices.

As shown in fig. 4, the model distillation apparatus of some embodiments includes a first process module 401, a second process module 402, a third process module 403, and a fourth process module 404: the first processing module 401 is configured to obtain a trained teacher model and a student model to be trained, where model parameters of the teacher model are greater than model parameters of the student model to be trained; the second processing module 402 is configured to obtain an image to be processed, input the image to be processed into the teacher model and the student model to be trained, and obtain features corresponding to the teacher model and features corresponding to the student model to be trained; a third processing module 403 for determining a first mask image based on the model parameters requiring distillation, and a second mask image based on the feature requiring distillation and the required weight of the feature of distillation; a fourth processing module 404, configured to determine a training loss of the student model to be trained according to the first mask image, the second mask image, the model parameters of the teacher model, the model parameters of the student model to be trained, the features of the corresponding teacher model, and the features of the student model to be trained, and obtain the trained student model according to the training loss.

In an optional implementation manner of some embodiments, the second processing module is further configured to: inputting the image to be processed into the teacher model and the student model to be trained to obtain at least one characteristic corresponding to the teacher model and at least one characteristic corresponding to the student model to be trained; selecting one characteristic from at least one characteristic corresponding to the teacher model as a target characteristic corresponding to the teacher model; and selecting one feature from at least one feature corresponding to the student model to be trained as a target feature corresponding to the student model to be trained.

In an optional implementation manner of some embodiments, the fourth processing module is further configured to: determining a first loss value according to the first mask image, the model parameters of the teacher model and the model parameters of the student model to be trained; determining a second loss value according to the second mask image, the target characteristics corresponding to the teacher model and the target characteristics corresponding to the student model to be trained; and determining the training loss of the student model to be trained according to the first loss value and the second loss value.

In an optional implementation manner of some embodiments, the fourth processing module is further configured to: determining a third loss value according to at least one characteristic corresponding to the student model to be trained; and determining the training loss of the student model to be trained according to the first loss value, the second loss value and the third loss value.

In an optional implementation manner of some embodiments, the fourth processing module is further configured to: and weighting and summing the first loss value and the second loss value to determine the training loss of the student model to be trained.

In an alternative implementation of some embodiments, the first/second mask images are consistent with dimensions corresponding to features of the teacher model/features corresponding to the student model to be trained.

In an optional implementation manner of some embodiments, the apparatus further includes a fifth processing module, configured to: and detecting or segmenting the image through the trained student model.

It will be appreciated that the modules described in the apparatus correspond to the steps in the method described with reference to figure 1. Therefore, the operations, features and advantages of the methods described above are also applicable to the apparatus and the modules and units included therein, and are not described herein again.

Fig. 5 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 5: a processor (processor) 510, a communication Interface (Communications Interface) 520, a memory (memory) 530, and a communication bus 540, wherein the processor 510, the communication Interface 520, and the memory 530 communicate with each other via the communication bus 540. Processor 510 may invoke logic instructions in memory 530 to perform a model distillation method comprising: acquiring a trained teacher model and a student model to be trained, wherein model parameters of the teacher model are more than those of the student model to be trained; acquiring an image to be processed, and inputting the image to be processed into a teacher model and a student model to be trained to obtain characteristics corresponding to the teacher model and characteristics corresponding to the student model to be trained; determining a first mask image based on model parameters requiring distillation, and determining a second mask image based on features requiring distillation and desired weights for the features requiring distillation; determining the training loss of the student model to be trained according to the first mask image, the second mask image, the model parameters of the teacher model, the model parameters of the student model to be trained, the characteristics corresponding to the teacher model and the characteristics corresponding to the student model to be trained, and obtaining the trained student model according to the training loss.

Furthermore, the logic instructions in the memory 530 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the above method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the model distillation method provided by the above methods, the method comprising: acquiring a trained teacher model and a student model to be trained, wherein model parameters of the teacher model are more than those of the student model to be trained; acquiring an image to be processed, and inputting the image to be processed into a teacher model and a student model to be trained to obtain characteristics corresponding to the teacher model and characteristics corresponding to the student model to be trained; determining a first mask image based on model parameters requiring distillation, determining a second mask image based on features requiring distillation and desired weights for the features of distillation; determining the training loss of the student model to be trained according to the first mask image, the second mask image, the model parameters of the teacher model, the model parameters of the student model to be trained, the characteristics corresponding to the teacher model and the characteristics corresponding to the student model to be trained, and obtaining the trained student model according to the training loss.

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program that when executed by a processor is implemented to perform the model distillation method provided above, the method comprising: acquiring a trained teacher model and a student model to be trained, wherein model parameters of the teacher model are more than model parameters of the student model to be trained; acquiring an image to be processed, and inputting the image to be processed into a teacher model and a student model to be trained to obtain characteristics corresponding to the teacher model and characteristics corresponding to the student model to be trained; determining a first mask image based on model parameters requiring distillation, determining a second mask image based on features requiring distillation and desired weights for the features of distillation; determining the training loss of the student model to be trained according to the first mask image, the second mask image, the model parameters of the teacher model, the model parameters of the student model to be trained, the characteristics corresponding to the teacher model and the characteristics corresponding to the student model to be trained, and obtaining the trained student model according to the training loss.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. Based on the understanding, the above technical solutions substantially or otherwise contributing to the prior art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A model distillation method, comprising:

acquiring a trained teacher model and a student model to be trained, wherein model parameters of the teacher model are more than those of the student model to be trained;

acquiring an image to be processed, and inputting the image to be processed into the teacher model and the student model to be trained to obtain the characteristics corresponding to the teacher model and the characteristics corresponding to the student model to be trained;

determining a first mask image based on model parameters requiring distillation, and determining a second mask image based on features requiring distillation and desired weights for the features requiring distillation;

and determining the training loss of the student model to be trained according to the first mask image, the second mask image, the model parameters of the teacher model, the model parameters of the student model to be trained, the characteristics corresponding to the teacher model and the characteristics corresponding to the student model to be trained, and obtaining the trained student model according to the training loss.

2. The model distilling method according to claim 1, wherein the inputting the image to be processed into the teacher model and the student model to be trained to obtain features corresponding to the teacher model and features corresponding to the student model to be trained comprises:

inputting the image to be processed into the teacher model and the student model to be trained to obtain at least one characteristic corresponding to the teacher model and at least one characteristic corresponding to the student model to be trained;

selecting one characteristic from at least one characteristic corresponding to the teacher model as the target characteristic corresponding to the teacher model; and selecting one feature from at least one feature corresponding to the student model to be trained as a target feature corresponding to the student model to be trained.

3. The model distilling method according to claim 2, wherein the determining of the training loss of the student model to be trained from the first mask image, the second mask image, the model parameters of the teacher model, the model parameters of the student model to be trained, the features corresponding to the teacher model, and the features corresponding to the student model to be trained comprises:

determining a first loss value according to the first mask image, model parameters of a teacher model and model parameters of the student model to be trained;

determining a second loss value according to the second mask image, the target features corresponding to the teacher model and the target features corresponding to the student model to be trained;

and determining the training loss of the student model to be trained according to the first loss value and the second loss value.

4. The model distillation method according to claim 3, wherein said determining a training loss of the student model to be trained from the first loss value and the second loss value comprises:

determining a third loss value according to all the characteristics corresponding to the student model to be trained;

and determining the training loss of the student model to be trained according to the first loss value, the second loss value and the third loss value.

5. The model distillation method as claimed in claim 4, wherein said determining a training loss of the student model to be trained from the first loss value, the second loss value and the third loss value comprises:

and weighting and summing the first loss value and the second loss value to determine the training loss of the student model to be trained.

6. The model distillation method as claimed in claim 1, further comprising: and detecting or segmenting the image through the trained student model.

7. A model distillation apparatus, comprising:

the first processing module is used for acquiring a trained teacher model and a student model to be trained, wherein model parameters of the teacher model are more than those of the student model to be trained;

the second processing module is used for acquiring images to be processed, inputting the images to be processed into the teacher model and the student models to be trained, and obtaining characteristics corresponding to the teacher model and characteristics corresponding to the student models to be trained;

a third processing module for determining a first mask image based on the model parameters requiring distillation, and a second mask image based on the feature requiring distillation and the required weight of the feature of distillation;

and the fourth processing module is used for determining the training loss of the student model to be trained according to the first mask image, the second mask image, the model parameters of the teacher model, the model parameters of the student model to be trained, the characteristics corresponding to the teacher model and the characteristics corresponding to the student model to be trained, and obtaining the trained student model according to the training loss.

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the steps of the model distillation method as claimed in any one of claims 1 to 6.

9. A non-transitory computer readable storage medium, having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the steps of the model distillation method of any of claims 1 to 6.