CN115687914B

CN115687914B - Model distillation method, apparatus, electronic device, and computer-readable medium

Info

Publication number: CN115687914B
Application number: CN202211091084.5A
Authority: CN
Inventors: 施丽佳; 游丽娜; 秦金晓; 吴淑川
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2022-09-07
Filing date: 2022-09-07
Publication date: 2024-01-30
Anticipated expiration: 2042-09-07
Also published as: CN115687914A

Abstract

The disclosure relates to a model distillation method, a model distillation device, electronic equipment and a computer readable medium, and belongs to the technical field of deep learning. The method comprises the following steps: respectively inputting training data in a model training set into a teacher model and a student model to obtain a first output probability corresponding to the teacher model and a second output probability corresponding to the student model; dynamically transforming the first output probability to obtain a first transformation probability, and dynamically transforming the second output probability to obtain a second transformation probability; determining a model loss value according to the first transformation probability and the second transformation probability and a loss function; and carrying out iterative training on the student model according to the model loss value to obtain a target model after knowledge distillation. According to the dynamic adjustment mechanism, the output probabilities of the teacher model and the student model are dynamically transformed, so that the model output probabilities become smoother, the student model is convenient to grasp finer granularity information, and the model distillation effect is improved.

Description

Model distillation method, apparatus, electronic device, and computer-readable medium

Technical Field

The present disclosure relates to the field of deep learning technology, and in particular, to a model distillation method, a model distillation apparatus, an electronic device, and a computer readable medium.

Background

The deep learning model has wide application and good effect, but due to large model and large parameter quantity, delay can be caused in some scenes, and the requirement on resources is high. The model distillation technology is one of model compression methods, and can utilize a student model with a simple structure to learn the generalization capability of a teacher model, so that the complexity of the model can be reduced, and the efficiency can be improved.

In some application scenarios, for example, in the scenario of detecting a Web attack by using AI (Artificial Intelligence ) technology, the result of the occurrence of a Web attack event is very serious, which often requires that the model has a high confidence, i.e. that the output probability of the teacher model is very close to 0-1. However, this situation may cause that it is difficult for the student model to grasp fine-grained information during model distillation, and negative tag information may be ignored, so that it is difficult to exert the maximum efficacy of model distillation.

In addition, in some security scenarios, in order to pursue very low false alarm or missing alarm, the original teacher model may set a higher judgment threshold, and even if the error is small during knowledge distillation, the judgment result of the student model may be directly affected.

In view of this, there is a need in the art for a model distillation method that can smooth the output probability of a model and enhance the model distillation effect.

It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the present disclosure and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

The present disclosure aims to provide a model distillation method, a model distillation device, an electronic device and a computer readable medium, which can smooth the output probability of a model at least to a certain extent and improve the model distillation effect.

According to a first aspect of the present disclosure, there is provided a model distillation method comprising:

respectively inputting training data in a model training set into a teacher model and a student model to obtain a first output probability corresponding to the teacher model and a second output probability corresponding to the student model;

dynamically transforming the first output probability to obtain a first transformation probability, and dynamically transforming the second output probability to obtain a second transformation probability;

determining a model loss value according to the first transformation probability and the second transformation probability and a loss function;

and carrying out iterative training on the student model according to the model loss value to obtain a target model after knowledge distillation.

In an exemplary embodiment of the disclosure, the dynamically transforming the first output probability to obtain a first transformation probability includes:

and dynamically transforming the first output probability according to the probability judgment threshold value and the transformation scale factor of the teacher model to obtain a first transformation probability.

In an exemplary embodiment of the disclosure, the dynamically transforming the first output probability according to the probability judgment threshold and the transformation scale factor of the teacher model to obtain a first transformation probability includes:

taking the maximum value in the original probability values corresponding to the sample categories in the first output probability as a probability parameter;

obtaining a first transformation probability value according to a probability judgment threshold value and a transformation scale factor of the teacher model and the probability parameter, and obtaining a second transformation probability value according to the first transformation probability value;

and obtaining a first transformation probability after dynamic transformation according to the first transformation probability value and the second transformation probability value.

In an exemplary embodiment of the disclosure, the sample class includes positive samples and negative samples, the deriving the dynamically transformed first transformation probability from the first transformation probability value and the second transformation probability value includes:

if the original probability value of the negative sample is larger than the original probability value of the positive sample, the first transformation probability value is used as a transformation probability value after the negative sample is subjected to dynamic transformation, and the second transformation probability value is used as a transformation probability value after the positive sample is subjected to dynamic transformation;

and if the original probability value of the negative sample is smaller than or equal to the original probability value of the positive sample, taking the second transformation probability value as a transformation probability value of the negative sample after dynamic transformation, and taking the first transformation probability value as a transformation probability value of the positive sample after dynamic transformation.

In an exemplary embodiment of the disclosure, the dynamically transforming the second output probability to obtain a second transformation probability includes:

and dynamically transforming the second output probability according to the probability judgment threshold value and the transformation scale factor of the teacher model to obtain a second transformation probability.

In an exemplary embodiment of the present disclosure, the method further comprises:

and determining a probability judgment threshold corresponding to the target model according to the target model subjected to knowledge distillation.

and testing the target model based on the test data in the test data set and the probability judgment threshold corresponding to the target model.

According to a second aspect of the present disclosure, there is provided a model distillation apparatus comprising:

the output probability determining module is used for inputting training data in the model training set into a teacher model and a student model respectively to obtain a first output probability corresponding to the teacher model and a second output probability corresponding to the student model;

the transformation probability determining module is used for carrying out dynamic transformation on the first output probability to obtain a first transformation probability, and carrying out dynamic transformation on the second output probability to obtain a second transformation probability;

a model loss determination module for determining a model loss value according to the first and second transformation probabilities and a loss function;

and the target model training module is used for carrying out iterative training on the student model according to the model loss value to obtain a target model after knowledge distillation.

According to a third aspect of the present disclosure, there is provided an electronic device comprising: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the model distillation method of any preceding claim via execution of the executable instructions.

According to a fourth aspect of the present disclosure, there is provided a computer readable medium having stored thereon a computer program which, when executed by a processor, implements the model distillation method of any of the above.

Exemplary embodiments of the present disclosure may have the following advantageous effects:

in the model distillation method of the disclosed example embodiment, the output probabilities of the teacher model and the student model are dynamically transformed, then the model loss value is determined according to the transformed probability after the dynamic transformation, and the student model is iteratively trained according to the model loss value, so that the target model after knowledge distillation is obtained. According to the model distillation method in the disclosed example embodiment, the output probabilities of the teacher model and the student model are dynamically transformed through the dynamic adjustment mechanism, so that the output probabilities are smoother, and scaled nearby the threshold value, on one hand, the student model can grasp finer granularity information, the model distillation effect can be improved, the knowledge distillation function can be better exerted, on the other hand, the reasonable threshold value can be more conveniently set, and the distillation result of the model is more stable by combining with the adjustment of the threshold value.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure. It will be apparent to those of ordinary skill in the art that the drawings in the following description are merely examples of the disclosure and that other drawings may be derived from them without undue effort.

FIG. 1 shows a schematic flow diagram of a model distillation process according to an example embodiment of the present disclosure;

FIG. 2 illustrates a flow diagram of dynamically transforming a first output probability to obtain a first transformed probability according to an example embodiment of the present disclosure;

FIG. 3 illustrates a schematic flow diagram of a model distillation process according to one embodiment of the present disclosure;

FIG. 4 illustrates a block diagram of a model distillation apparatus according to an example embodiment of the present disclosure;

fig. 5 shows a schematic diagram of a computer system suitable for use in implementing embodiments of the present disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the present disclosure. One skilled in the relevant art will recognize, however, that the aspects of the disclosure may be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software or in one or more hardware modules or integrated circuits or in different networks and/or processor devices and/or microcontroller devices.

The present exemplary embodiment first provides a model distillation method. Referring to fig. 1, the above model distillation method may include the steps of:

s110, training data in the model training set are respectively input into a teacher model and a student model, and a first output probability corresponding to the teacher model and a second output probability corresponding to the student model are obtained.

S120, carrying out dynamic transformation on the first output probability to obtain a first transformation probability, and carrying out dynamic transformation on the second output probability to obtain a second transformation probability.

And S130, determining a model loss value according to the first transformation probability and the second transformation probability and the loss function.

And S140, performing iterative training on the student model according to the model loss value to obtain a target model after knowledge distillation.

The model distillation method in the example embodiments of the present disclosure can be applied to knowledge distillation scenes of web attack detection models, and malicious requests are detected in the field of network security.

The above steps of the present exemplary embodiment will be described in more detail with reference to fig. 2.

In step S110, training data in the model training set is input into a teacher model and a student model, respectively, to obtain a first output probability corresponding to the teacher model and a second output probability corresponding to the student model.

Knowledge distillation is a model compression method, and is a training method based on the 'teacher-student network thought'. In general, a large model is often a single complex network or a set of several networks, and has good performance and generalization capability, while a small model has limited expression capability because of smaller network scale, so that knowledge learned by the large model can be used to guide training of the small model, so that the small model has performance equivalent to that of the large model, but the number of parameters is greatly reduced, and thus the effects of compressing and accelerating the model are realized.

In this exemplary embodiment, the teacher model refers to a model with higher complexity and larger parameter amount, and the student model refers to a model with simple structure and smaller parameter scale. The generalization capability of the teacher model is learned by using the student model with a simple structure, so that the model compression effect can be realized, the model complexity is reduced, and the model parameters are reduced.

Taking knowledge distillation of a web attack detection model as an example, the web attack detection model may be used to determine whether an incoming request (request) is a malicious request, and the output of the model is the probability that the request is determined to be either a positive or negative sample.

The training data in the model training set may be a plurality of different requests, such as http protocol, and if the output probability of the model is score, score= { p ₀ ,p ₁ P is }, where ₀ Probability of being a negative sample of class 0, p ₁ Is the probability of a positive sample of class 1. Training data in the model training set are respectively input into a teacher model and a student model, and a first output probability score corresponding to the teacher model can be respectively obtained _t A second output probability score corresponding to the student model _s 。

In step S120, the first output probability is dynamically transformed to obtain a first transformation probability, and the second output probability is dynamically transformed to obtain a second transformation probability.

In this example embodiment, the first output probability may be dynamically transformed according to the probability determination threshold and the transformation scale factor of the teacher model to obtain the first transformation probability.

Assuming that the probability judgment threshold b e (0, 1) of the model, the output probability score= { p for the model ₀ ,p ₁ For the purposes of p ₁ If b is greater, the input sample is judged as a positive sample. When the score distribution is very sharp, the value of b is also very close to 1, which is detrimental to the model distillation effect.

The transform scale factor we (0, 1) can adjust the degree of scaling in dynamic transforms. The smaller w, the flatter the probability value distribution after dynamic transformation. For example, when b=0.999, the original score is [5.5278e-04,9.9945e-01]. If w=0.5, then the dynamically transformed score' = [0.2755,0.7245]; if w=0.3, then the score' after the dynamic transformation= [0.3853,0.6147].

In this example embodiment, as shown in fig. 2, the method for dynamically transforming the first output probability according to the probability judgment threshold and the transformation scale factor of the teacher model to obtain the first transformation probability may specifically include the following steps:

and S210, taking the maximum value in the original probability values corresponding to the sample categories in the first output probability as a probability parameter.

In the present exemplary embodiment, the sample class may include both positive and negative samples, according to the original probability value p of the negative sample ₀ And the original probability value p of the positive sample ₁ The maximum value of (2) can obtain probability parameter s _max The method comprises the following steps:

s _max ＝max([p ₀ ,p ₁ ])

s220, obtaining a first transformation probability value according to a probability judgment threshold value, a transformation scale factor and probability parameters of the teacher model, and obtaining a second transformation probability value according to the first transformation probability value.

In this exemplary embodiment, the threshold b and the transformation scale factor w, and the probability parameter S may be determined according to the probability of the teacher model _max Dynamically transforming the output probability of the model to obtain a first transformed probability value s' _max A second transition probability value s' _min The calculation formula is as follows:

s′ _max ＝b*w+(s _nax -b)*(1-w)*(1/(1-b))

s′ _min ＝1-s′ _max

and S230, obtaining a first transformation probability after dynamic transformation according to the first transformation probability value and the second transformation probability value.

In this example embodiment, if the original probability value of the negative sample is greater than the original probability value of the positive sample, the first transformed probability value is used as a transformed probability value of the negative sample after dynamic transformation, and the second transformed probability value is used as a transformed probability value of the positive sample after dynamic transformation; if the original probability value of the negative sample is smaller than or equal to the original probability value of the positive sample, the second transformation probability value is used as a transformation probability value after the negative sample is dynamically transformed, and the first transformation probability value is used as a transformation probability value after the positive sample is dynamically transformed, namely:

if p ₀ >p ₁ :score′＝[s′ _max ,s′ _min ]

else:score′＝[s′ _min ,s′ _max ]

for example, input samples input into the teacher model as training data are: /col/col5/? col/col51/? col/col 53/id= 'and method for producing the same'&&sleep('0 3')&&'1, first output probability score of teacher model _t Is { p } ₀ ,p ₁ }＝{5.5278e-04,9.9945e-01}，p1>0.999, the probability distribution is close to {0,1}. After dynamic transformation, the output probability is converted into { p } ₀ ’,p ₁ ’}＝{0.2755,0.7245}。

The method for dynamically transforming the second output probability of the student model is similar to that of the teacher model, that is, the second output probability is dynamically transformed according to the probability judgment threshold value and the transformation scale factor of the teacher model to obtain the second transformation probability, and the specific method is not described herein.

In this example embodiment, by dynamically transforming the output probabilities of the teacher model and the student model by using the probability judgment threshold and the transformation scale factor, the output probability distribution after the dynamic transformation is more gentle, the generalization capability of the student model can be improved, the learning of the student model is facilitated, and the new threshold can be approximately pre-judged.

In step S130, a model loss value is determined from the first and second transformation probabilities and the loss function.

In this example embodiment, the first transformation probability score corresponding to the dynamically transformed teacher model may be based on a preset loss function _t ' second transformation probability score corresponding to student model _s ' calculate model loss value loss, which can be used to iteratively train the student model by back propagation.

In step S140, iterative training is performed on the student model according to the model loss value, so as to obtain a target model after knowledge distillation.

In this example embodiment, the original student model may be iteratively trained according to the model loss value loss, to obtain a student model after knowledge distillation, i.e., a target model.

In this example embodiment, after obtaining the target model after knowledge distillation, a probability judgment threshold corresponding to the target model may be determined according to the target model after knowledge distillation, and then the target model may be tested based on the test data in the test data set and the probability judgment threshold corresponding to the target model.

The student model after knowledge distillation can be used for carrying out attack detection on the requests in the test data set, and a new probability judgment threshold value and the output probability of the student model after dynamic transformation are used for judgment.

A complete flow chart of the model distillation method in one embodiment of the present disclosure is shown in fig. 3, which is an illustration of the above steps in this example embodiment, and the specific steps of the flow chart are as follows:

s302, inputting request data in a model training set into a teacher model.

The model training set comprises a plurality of requests for training, and the requests in the training set are input into a teacher model to obtain corresponding output probability score _t 。

And S304, obtaining the output probability of the teacher model.

Step S306, dynamic transformation.

And dynamically transforming the output probability of the teacher model according to the probability judging threshold b and the transformation scale factor w of the teacher model.

And S308, obtaining the output probability of the teacher model after dynamic transformation.

Output probability score for teacher model _t After dynamic transformation, new products can be obtainedOutput probability score of (c) _t '。

S310, inputting the request data in the model training set into the student model.

Inputting the requests in the model training set into the student model to obtain corresponding output probability score _s 。

And S312, obtaining the output probability of the student model.

Step S314, dynamic transformation.

And dynamically transforming the output probability of the student model according to the probability judgment threshold b and the transformation scale factor w of the teacher model.

And S316, obtaining the output probability of the student model after dynamic transformation.

Output probability score for student model _s After dynamic transformation, a new output probability score can be obtained _s '。

And S318, calculating a loss value based on the loss function.

Based on the loss function, according to score _t ' sum score _s The' calculated loss value loss is counter-propagated.

And S320, performing iterative training on the student model.

And performing iterative training on the student model according to the loss value loss.

And S322, obtaining a student model after knowledge distillation.

Step S324, updating the threshold value.

And S326, detecting the request data in the test data set.

Attack detection of requests in a test dataset using a knowledge distilled student model and using score _s ' and a new threshold.

It should be noted that although the steps of the methods in the present disclosure are depicted in the accompanying drawings in a particular order, this does not require or imply that the steps must be performed in that particular order, or that all illustrated steps be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform, etc.

Further, the present disclosure also provides a model distillation apparatus. Referring to fig. 4, the model distillation apparatus may include an output probability determination module 410, a transformation probability determination module 420, a model loss determination module 430, and a target model training module 440. Wherein:

the output probability determining module 410 may be configured to input training data in the model training set into a teacher model and a student model, respectively, to obtain a first output probability corresponding to the teacher model and a second output probability corresponding to the student model;

the transformation probability determining module 420 may be configured to dynamically transform the first output probability to obtain a first transformation probability, and dynamically transform the second output probability to obtain a second transformation probability;

model loss determination module 430 may be configured to determine a model loss value based on the first and second transformation probabilities and the loss function;

the objective model training module 440 may be configured to iteratively train the student model according to the model loss value to obtain a knowledge distilled objective model.

In some exemplary embodiments of the present disclosure, the transformation probability determining module 420 may include a first transformation probability determining unit that may be configured to dynamically transform the first output probability according to a probability determination threshold and a transformation scale factor of the teacher model to obtain a first transformation probability.

In some exemplary embodiments of the present disclosure, the first transformation probability determining unit may include a probability parameter determining unit, a transformation probability value determining unit, and a first transformation probability calculating unit. Wherein:

the probability parameter determining unit may be configured to use, as a probability parameter, a maximum value of original probability values corresponding to respective sample categories in the first output probability;

the transformation probability value determining unit may be configured to obtain a first transformation probability value according to a probability judgment threshold value and a transformation scale factor of the teacher model, and obtain a second transformation probability value according to the first transformation probability value;

the first transformation probability calculation unit may be configured to obtain a first transformation probability after the dynamic transformation based on the first transformation probability value and the second transformation probability value.

In some exemplary embodiments of the present disclosure, the first transition probability calculation unit may include a first positive and negative sample probability value determination unit and a second positive and negative sample probability value determination unit. Wherein:

the first positive and negative sample probability value determining unit may be configured to use the first transformed probability value as a transformed probability value obtained by dynamically transforming the negative sample and use the second transformed probability value as a transformed probability value obtained by dynamically transforming the positive sample if the original probability value of the negative sample is greater than the original probability value of the positive sample;

the second positive and negative sample probability value determining unit may be configured to use the second transformed probability value as a transformed probability value obtained by dynamically transforming the negative sample if the original probability value of the negative sample is smaller than or equal to the original probability value of the positive sample, and use the first transformed probability value as a transformed probability value obtained by dynamically transforming the positive sample.

In some exemplary embodiments of the present disclosure, the transformation probability determining module 420 may further include a second transformation probability determining unit that may be configured to dynamically transform the second output probability according to the probability determination threshold and the transformation scale factor of the teacher model to obtain a second transformation probability.

In some exemplary embodiments of the present disclosure, a model distillation apparatus provided by the present disclosure may further include a probability judgment threshold updating module that may be configured to determine a probability judgment threshold corresponding to a target model according to a knowledge distilled target model.

In some exemplary embodiments of the present disclosure, a model distillation apparatus provided by the present disclosure may further include a target model testing module that may be configured to test a target model based on test data in a test data set and a probability judgment threshold corresponding to the target model.

The details of each module/unit in the above model distillation apparatus are described in detail in the corresponding method embodiment section, and will not be repeated here.

Fig. 5 shows a schematic diagram of a computer system suitable for use in implementing an embodiment of the invention.

It should be noted that, the computer system 500 of the electronic device shown in fig. 5 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present invention.

As shown in fig. 5, the computer system 500 includes a Central Processing Unit (CPU) 501, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 502 or a program loaded from a storage section 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the system operation are also stored. The CPU 501, ROM 502, and RAM 503 are connected to each other through a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

The following components are connected to the I/O interface 505: an input section 506 including a keyboard, a mouse, and the like; an output portion 507 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker, and the like; a storage portion 508 including a hard disk and the like; and a communication section 509 including a network interface card such as a LAN card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. The drive 510 is also connected to the I/O interface 505 as needed. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as needed so that a computer program read therefrom is mounted into the storage section 508 as needed.

In particular, according to embodiments of the present invention, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present invention include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 509, and/or installed from the removable media 511. When executed by a Central Processing Unit (CPU) 501, performs the various functions defined in the system of the present application.

It should be noted that the computer readable medium shown in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

As another aspect, the present application also provides a computer-readable medium that may be contained in the electronic device described in the above embodiment; or may exist alone without being incorporated into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to implement the method as described in the above embodiments.

It should be noted that although in the above detailed description several modules of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules described above may be embodied in one module in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module described above may be further divided into a plurality of modules to be embodied.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A model distillation method, comprising:

performing iterative training on the student model according to the model loss value to obtain a target model after knowledge distillation;

the dynamically transforming the first output probability to obtain a first transformation probability includes:

taking the maximum value in the original probability values corresponding to each sample category in the first output probability as a probability parameter, wherein the sample category comprises a positive sample and a negative sample;

dynamically transforming the output probability of the teacher model according to the probability judgment threshold value and the transformation scale factor of the teacher model and the probability parameter to obtain a first transformation probability value, and obtaining a second transformation probability value according to the first transformation probability value; the probability judging threshold is used for judging whether an input sample is a positive sample or a negative sample, the transformation scale factor is used for adjusting the scaling degree in dynamic transformation, the first transformation probability value is a transformation probability value obtained after the negative sample is subjected to dynamic transformation, the second transformation probability value is a transformation probability value obtained after the positive sample is subjected to dynamic transformation, or the first transformation probability value is a transformation probability value obtained after the positive sample is subjected to dynamic transformation, and the second transformation probability value is a transformation probability value obtained after the negative sample is subjected to dynamic transformation;

2. The model distillation method according to claim 1, wherein the deriving the dynamically transformed first transformation probability from the first transformation probability value and the second transformation probability value comprises:

3. The model distillation method according to claim 1, wherein dynamically transforming the second output probability to obtain a second transformation probability comprises:

4. The model distillation method according to claim 1, further comprising:

5. The model distillation method according to claim 4, further comprising:

6. A model distillation apparatus, comprising:

the target model training module is used for carrying out iterative training on the student model according to the model loss value to obtain a target model after knowledge distillation;

7. An electronic device, comprising:

a processor; and

a memory for storing one or more programs that, when executed by the processor, cause the processor to implement the model distillation method of any of claims 1-5.

8. A computer readable medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the model distillation method according to any one of claims 1 to 5.