CN113642605A

CN113642605A - Model distillation method, device, electronic equipment and storage medium

Info

Publication number: CN113642605A
Application number: CN202110778166.6A
Authority: CN
Inventors: 戴兵
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-07-09
Filing date: 2021-07-09
Publication date: 2021-11-12

Abstract

The present disclosure provides a model distillation method, device, electronic device and storage medium, which relates to the field of artificial intelligence such as deep learning and computer vision, wherein the method can comprise: respectively acquiring a first score output by the last layer of the teacher model and a second score output by the last layer of the student model; determining JS divergence according to the first score and the second score; and (5) carrying out distillation training on the student model according to the JS divergence. By applying the scheme disclosed by the disclosure, the accuracy of the obtained student model can be improved.

Description

Model distillation method, device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to the fields of deep learning and computer vision, and in particular, to a model distillation method and apparatus, an electronic device, and a storage medium.

Background

When the image classification model is trained, a distillation mode can be adopted, namely a large model is used for guiding a small model to train and learn, the large model is generally called a teacher model, and the small model is generally called a student model, so that the obtained student model has higher speed, and can learn the capability of the teacher model and has higher accuracy.

In the existing distillation mode, the score output by the last layer of a teacher model and the score output by the last layer of a student model are usually used for calculating the Kullback-Leibler divergence (divergence), but the actual effect of the mode is often poor, and the student model is difficult to learn the capability of the teacher model, so that the accuracy is low.

Disclosure of Invention

The disclosure provides a model distillation method, a model distillation device, an electronic device and a storage medium.

According to one aspect of the present disclosure, there is provided a model distillation method comprising:

respectively acquiring a first score output by the last layer of the teacher model and a second score output by the last layer of the student model;

determining the JS divergence of the Jiansen Shannon according to the first fraction and the second fraction;

and performing distillation training on the student model according to the JS divergence.

According to an aspect of the present disclosure, there is provided a model distillation apparatus including: the device comprises an acquisition module, a determination module and a training module;

the acquisition module is used for respectively acquiring a first score output by the last layer of the teacher model and a second score output by the last layer of the student model;

the determining module is used for determining the JS divergence of the Jiansen Shannon according to the first fraction and the second fraction;

the training module is used for carrying out distillation training on the student model according to the JS divergence.

According to an aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method as described above.

According to an aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method as described above.

According to an aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the method as described above.

One embodiment in the above disclosure has the following advantages or benefits: when utilizing teacher's model to carry out the distillation training to student's model, usable JS divergence replaces KL divergence, compares in KL divergence, and the distillation effect of JS divergence is better to promoted the rate of accuracy of the student model who obtains, can obtain the student model etc. of taking into account rate of accuracy and speed.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow diagram of a first embodiment of a model distillation process according to the present disclosure;

FIG. 2 is a schematic diagram of a process for obtaining a new training picture according to the present disclosure;

FIG. 3 is a flow diagram of a second embodiment of a model distillation method according to the present disclosure;

FIG. 4 is a schematic view of an acquisition process of JS divergence in accordance with the present disclosure;

FIG. 5 is a schematic diagram of the composition of an embodiment 500 of a model distillation apparatus according to the present disclosure;

FIG. 6 illustrates a schematic block diagram of an electronic device 600 that may be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In addition, it should be understood that the term "and/or" herein is merely one type of association relationship that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

FIG. 1 is a flow diagram of a first embodiment of a model distillation method according to the present disclosure. As shown in fig. 1, the following specific implementation steps are included.

In step 101, a first score output by the last layer of the teacher model and a second score output by the last layer of the student model are obtained respectively.

In step 102, the divergence of the johnson (JS, Jensen-Shannon) is determined according to the obtained first score and the obtained second score.

In step 103, the student model is distillation trained according to the determined JS divergence.

According to the method, in the scheme, when the teacher model is used for carrying out distillation training on the student model, the JS divergence can be used for replacing the KL divergence, and compared with the KL divergence, the distillation effect of the JS divergence is better, so that the accuracy of the obtained student model is improved, and the student model and the like which take the accuracy and the speed into consideration can be obtained.

In an image classification scene, the student model may be a classic image classification model, that is, a Deep residual Network (ResNet) 60 model, and the teacher model may adopt a more complex model, such as a ResNet101 model, and has more convolution layers, thereby having a stronger learning ability, and the like. For the sake of distinction, the score output by the final layer of the teacher model and the score output by the final layer of the student model are referred to as a first score and a second score, respectively. The score refers to the confidence score of the input image corresponding to each category.

Before the teacher model is used for distilling training of the student model, the teacher model can be obtained through pre-training, namely the teacher model can be obtained through training by using training samples (such as training pictures).

Preferably, the teacher model can be trained in a data enhancement mode, so that the accuracy of the obtained teacher model is improved.

The data enhancement mode is not limited, and may be, for example, a cut and mix (CutMix) mode. The CutMix mode is realized by the following steps: and for two training pictures used in training, cutting out a part from one of the two training pictures, and covering the corresponding area in the other training picture by using the cut-out part so as to obtain a new training picture. As shown in fig. 2, fig. 2 is a schematic diagram of a process of obtaining a new training picture according to the present disclosure.

Through the processing, the sample space can be increased, so that the accuracy of the teacher model is improved.

When the teacher model is used for carrying out distillation training on the student models, the gradient of the teacher model needs to be fixed, the teacher model is ensured to be kept unchanged in the training process, and the first score output by the last layer of the teacher model and the second score output by the last layer of the student model can be respectively obtained. Typically, the final layers of the teacher model and the student model are all fully connected layers.

And according to the obtained first score output by the last layer of the teacher model and the obtained second score output by the last layer of the student model, the JS divergence can be further determined.

As described above, in the conventional processing method, the KL divergence is generally calculated using the score output from the final layer of the teacher model and the score output from the final layer of the student model. The KL divergence is generally used for calculating the difference of two fraction distributions, and the smaller the value of the KL divergence is, the smaller the fraction distribution difference is, and the closer the prediction distributions of the student model and the teacher model are.

In the mode, the average value of the first fraction and the second fraction can be obtained, the first KL divergence can be determined according to the first fraction and the average value, the second KL divergence can be determined according to the second fraction and the average value, and then the needed JS divergence can be determined according to the first KL divergence and the second KL divergence.

Accordingly, the JS divergence may be defined as follows:

wherein, P₁First score, P, representing the final layer output of the teacher model₂A second score representing the last layer output of the student model,

a first KL-divergence is indicated,

indicates the second KL divergence, JS (P)₁||P₂) Indicating JS divergence.

The JS divergence is a variation based on the KL divergence, the problems of asymmetric KL divergence and the like are solved, and the student model can well learn the distribution of the teacher model, so that the student model has the accuracy rate basically the same as that of the teacher model, namely, compared with the KL divergence, the distillation effect of the JS divergence is better.

In practical applications, the number of teacher models may be one or more.

If the number of the teacher models is one, the JS divergence can be directly calculated according to the obtained first score output by the last layer of the teacher models and the obtained second score output by the last layer of the student models according to the formula (1).

If the number of the teacher models is more than one, namely a plurality of teacher models are obtained, the first scores output by the last layer of each teacher model can be respectively obtained, the average value of the obtained first scores output by the last layer of each teacher model can be calculated, and the JS divergence is calculated according to a formula (1) according to the average value and the second scores output by the last layer of the student models.

For example, if the number of teacher models is 3, and for convenience of expression, the teacher models are respectively called teacher model 1, teacher model 2 and teacher model 3, then the teacher models can be obtained respectivelyThe score P11 output from the last layer of type 1, the score P12 output from the last layer of teacher model 2, and the score P13 output from the last layer of teacher model 3 can be calculated, and the average of P11, P12, and P13 can be calculated, and the calculation result is taken as P in formula (1)₁And then the JS divergence can be calculated according to the formula (1).

Through the above processing, the obtained average value is the result of integration of a plurality of teacher models, and compared with a single teacher model, the accuracy rate and the like are higher.

When the number of the teacher models is more than one, the different teacher models can be the teacher models with the same structure or the teacher models with different structures. For example, when the number of teacher models is 3, the 3 teacher models can be a ResNet101 model, a ResNeSt269 model, and a Res2Net101 model, respectively, wherein ResNeSt and Res2Net are variations of ResNet series, which are relatively complex and have relatively high accuracy.

When the number of the teacher models is larger than one, the specific number can be determined according to actual needs, usually 2-3 teacher models are used, and the improvement of effects by more teacher models is limited.

After the JS divergence is obtained, the JS divergence can be used for distillation training of the student model, and the JS divergence can be used as Loss (Loss) for distillation training.

Preferably, the student model can be subjected to distillation training in a data enhancement mode so as to improve the accuracy of the obtained student model. The data enhancement mode is not limited, and may be, for example, a CutMix mode.

Preferably, the student model is further distillation trained using a first learning rate, the first learning rate being less than an original learning rate of the student model. Taking the student model as the ResNet60 model as an example, the original learning rate is usually 0.01, and in the scheme of the present disclosure, the learning rate of the student model can be reduced by 10 times, namely to 0.001 when distillation training is performed.

Through the processing, Loss oscillation can be reduced, so that the device can be more easily converged to a proper position, and the training efficiency, the training effect and the like are improved.

The above processing of reducing the learning rate of the student model by 10 times is only an example, and is not used to limit the technical solution of the present disclosure, and in practical applications, the specific value of the first learning rate may be determined according to actual needs.

Based on the above description, FIG. 3 is a flow chart of a second embodiment of the model distillation method of the present disclosure. Assuming that the number of teacher models in this embodiment is 3, which are teacher model 1, teacher model 2, and teacher model 3, the following implementation steps may be included as shown in fig. 3.

In step 301, the training results in teacher model 1, teacher model 2, and teacher model 3.

Wherein, teacher model 1, teacher model 2 and teacher model 3 obtained by training in a data enhancement mode can be adopted. The data enhancement mode can be a CutMix mode.

In step 302, when the teacher model is used to perform distillation training on the student models, the score P11 output from the last layer of the teacher model 1, the score P12 output from the last layer of the teacher model 2, the score P13 output from the last layer of the teacher model 3, and the score P2 output from the last layer of the student models are obtained.

When the teacher model is used for carrying out distillation training on the student model, the gradient of the teacher model needs to be fixed, and the teacher model is ensured to be kept unchanged in the training process. Typically, the final layers of the teacher model and the student model are all fully connected layers.

In step 303, the mean P1 of P11, P12, and P13 is calculated.

I.e., P1 ═ P11+ P12+ P13)/3.

In step 304, JS divergence is calculated from P1 and P2.

The JS divergence can be calculated from P1 and P2 as shown in equation (1).

In step 305, the student model is distillation trained according to the JS divergence.

After the JS divergence is obtained, the student model can be subjected to distillation training by utilizing the JS divergence.

In addition, the student model can be subjected to distillation training in a data enhancement mode, and the data enhancement mode can be a CutMix mode.

Furthermore, the student model may be distillation trained using a first learning rate that is less than an original learning rate of the student model. Taking the student model as the ResNet60 model as an example, the learning rate is usually 0.01, in the solution described in this embodiment, the learning rate of the student model can be reduced by 10 times, i.e. to 0.001 when performing distillation training.

Corresponding to the embodiment shown in fig. 3, fig. 4 is a schematic diagram of an obtaining process of JS divergence according to the present disclosure, and please refer to the related description for specific implementation, which is not repeated.

It is noted that while for simplicity of explanation, the foregoing method embodiments are described as a series of acts, those skilled in the art will appreciate that the present disclosure is not limited by the order of acts, as some steps may, in accordance with the present disclosure, occur in other orders and concurrently. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required for the disclosure. In addition, for parts which are not described in detail in a certain embodiment, reference may be made to relevant descriptions in other embodiments.

In a word, compared with the existing mode, the method and the device for obtaining the student model have the advantages that the accuracy of the obtained student model is remarkably improved, too much extra overhead is not needed, the obtained student model is guaranteed to have higher reasoning speed, and the student model with both accuracy and speed can be obtained.

The above is a description of embodiments of the method, and the embodiments of the apparatus are further described below.

Fig. 5 is a schematic diagram of a composition structure of an embodiment 500 of a model distillation apparatus according to the present disclosure. As shown in fig. 5, includes: an acquisition module 501, a determination module 502, and a training module 503.

The obtaining module 501 is configured to obtain a first score output by the last layer of the teacher model and a second score output by the last layer of the student model respectively.

And a determining module 502, configured to determine the JS divergence according to the obtained first score and the obtained second score.

And the training module 503 is configured to perform distillation training on the student model according to the determined JS divergence.

It can be seen that, among the above-mentioned device embodiment scheme, when utilizing teacher's model to carry out the distillation training to student's model, usable JS divergence replaces KL divergence, compares in KL divergence, and the distillation effect of JS divergence is better to promoted the rate of accuracy of the student model who obtains, can obtain the student model etc. of giving consideration to rate of accuracy and speed.

In the image classification scene, the student model can be a classic image classification model, namely a ResNet60 model, the teacher model can adopt a more complex model, such as a ResNet101 model, and the teacher model has more convolutional layers, so that the teacher model has stronger learning ability and the like.

Before the teacher model is used for performing distillation training on the student model, the obtaining module 501 may first train to obtain the teacher model, that is, may train to obtain the teacher model by using a training sample (e.g., a training picture).

Preferably, the obtaining module 501 may train to obtain the teacher model in a data enhancement manner, so as to improve the accuracy of the obtained teacher model. The data enhancement mode is not limited, and may be, for example, a CutMix mode.

When the teacher model is used for performing distillation training on the student model, the obtaining module 501 needs to fix the gradient of the teacher model to ensure that the teacher model remains unchanged during the training process, and can respectively obtain a first score output by the last layer of the teacher model and a second score output by the last layer of the student model. Typically, the final layers of the teacher model and the student model are all fully connected layers.

According to the obtained first score output by the last layer of the teacher model and the obtained second score output by the last layer of the student model, the determining module 502 can further determine the JS divergence. For example, a mean value of the first score and the second score may be obtained, a first KL divergence may be determined according to the first score and the mean value, a second KL divergence may be determined according to the second score and the mean value, and the JS divergence may be determined according to the first KL divergence and the second KL divergence, and a specific calculation manner may be as shown in formula (1).

In practical applications, the number of teacher models may be one or more.

If the number of the teacher models is one, the JS divergence can be calculated by the determining module 502 according to the formula (1) directly according to the acquired first score output by the last layer of the teacher model and the acquired second score output by the last layer of the student model.

If the number of the teacher models is greater than one, the obtaining module 501 may respectively obtain the first scores output by the last layer of each teacher model, and may calculate an average value of the obtained first scores output by the last layer of each teacher model, and further may calculate the JS divergence according to the formula (1) by the determining module 502 according to the average value and the second scores output by the last layer of the student models.

When the number of the teacher models is more than one, the different teacher models can be the teacher models with the same structure or the teacher models with different structures.

After the JS divergence is obtained, the training module 503 may perform distillation training on the student model by using the JS divergence, that is, perform distillation training using the JS divergence as Loss.

Preferably, the training module 503 can also perform distillation training on the student model in a data enhancement manner to improve the accuracy of the obtained student model. The data enhancement mode is not limited, and may be, for example, a CutMix mode.

Preferably, the training module 503 can also perform distillation training on the student model with a first learning rate, which is less than the original learning rate of the student model.

For a specific work flow of the apparatus embodiment shown in fig. 5, reference is made to the related description in the foregoing method embodiment, and details are not repeated.

In a word, compared with the existing mode, the method and the device for obtaining the student model have the advantages that accuracy of the obtained student model is remarkably improved, too much extra overhead is not needed, the obtained student model is guaranteed to have higher reasoning speed, and the student model taking accuracy and speed into consideration can be obtained.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 6 illustrates a schematic block diagram of an electronic device 600 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the apparatus 600 includes a computing unit 601, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the device 600 can also be stored. The calculation unit 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

A number of components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, a mouse, or the like; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 601 performs the various methods and processes described above, such as the methods described in this disclosure. For example, in some embodiments, the methods described in this disclosure may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into RAM 603 and executed by the computing unit 601, one or more steps of the methods described in the present disclosure may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured by any other suitable means (e.g., by means of firmware) to perform the methods described in the present disclosure.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A model distillation method comprising:

2. The method of claim 1, further comprising: and training in a data enhancement mode to obtain the teacher model.

3. The method of claim 1, wherein determining the JS divergence from the first score and the second score comprises:

obtaining a mean value of the first score and the second score;

determining a first kulbeck-leibler KL divergence according to the first fraction and the mean;

determining a second KL divergence according to the second fraction and the mean value;

and determining the JS divergence according to the first KL divergence and the second KL divergence.

4. The method of claim 1, wherein determining the JS divergence from the first score and the second score comprises:

and when the number of the teacher models is greater than 1, calculating the average value of the first scores output by the last layer of each obtained teacher model, and determining the JS divergence according to the average value and the second scores.

5. The method of claim 1, wherein the distillation training of the student model comprises: and performing distillation training on the student model in a data enhancement mode.

6. The method of claim 2 or 5, wherein the manner of data enhancement comprises: cut mix CutMix format.

7. The method of any one of claims 1-5, wherein the distillation training of the student model comprises:

performing distillation training on the student model using a first learning rate, the first learning rate being less than an original learning rate of the student model.

8. A model distillation apparatus comprising: the device comprises an acquisition module, a determination module and a training module;

9. The apparatus of claim 8, wherein,

the acquisition module is further used for training in a data enhancement mode to obtain the teacher model.

10. The apparatus of claim 8, wherein,

the determining module is further configured to obtain a mean value of the first score and the second score, determine a first kulbeck-leibler KL divergence according to the first score and the mean value, determine a second KL divergence according to the second score and the mean value, and determine the JS divergence according to the first KL divergence and the second KL divergence.

11. The apparatus of claim 8, wherein,

the obtaining module is further used for calculating the average value of the first scores output by the last layer of the obtained teacher models when the number of the teacher models is larger than 1;

the determining module is further configured to determine the JS divergence according to the mean value and the second score.

12. The apparatus of claim 8, wherein the training module is further configured to perform distillation training on the student model in a data-enhanced manner.

13. The apparatus of claim 9 or 12, wherein the data enhancement comprises: cut mix CutMix format.

14. The apparatus of any one of claims 8 to 12,

the training module is further configured to perform distillation training on the student model using a first learning rate that is less than an original learning rate of the student model.

15. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any one of claims 1-7.

17. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-7.