WO2022178948A1

WO2022178948A1 - Model distillation method and apparatus, device, and storage medium

Info

Publication number: WO2022178948A1
Application number: PCT/CN2021/084539
Authority: WO
Inventors: 王健宗; 宋青原; 吴天博; 程宁
Original assignee: 平安科技（深圳）有限公司
Priority date: 2021-02-26
Filing date: 2021-03-31
Publication date: 2022-09-01
Also published as: CN112836762A

Abstract

A model distillation method and apparatus, a device, and a storage medium, relating to the technical field of artificial intelligence. The method comprises: acquiring a pre-trained model, a student model, a plurality of labeled training samples, and a plurality of unlabeled training samples, the pre-trained model being a model obtained by training on the basis of a Bert network (S1); using the unlabeled training samples and the student model to perform overall distillation learning on the pre-trained model to obtain a student model after first distillation (S2); using the unlabeled training samples and the student model after the first distillation to perform hierarchical distillation learning on the pre-trained model to obtain a student model after second distillation (S3); and using the labeled training samples to perform hierarchical distillation learning on the student model after the second distillation to obtain a trained student model (S4). Thus, the accuracy of the model obtained after distillations is improved by means of three distillations, the requirements for labeling of training samples are reduced, and distillation costs are reduced.

Description

Model distillation method, device, equipment and storage medium

This application claims the priority of the Chinese patent application with the application number 2021102205129 and the invention titled "Model Distillation Method, Apparatus, Equipment and Storage Medium" filed with the China Patent Office on February 26, 2021, the entire contents of which are incorporated by reference in in this application.

technical field

The present application relates to the technical field of artificial intelligence, and in particular, to a model distillation method, device, equipment and storage medium.

Background technique

At present, the pre-training model has strong coding ability and generalization ability, and when using the pre-training model, processing downstream tasks can greatly reduce the amount of labeled data used, so it plays a huge role in various fields. Because the pre-trained model usually has a large amount of parameters, it cannot be used online.

The inventor realizes that the prior art is used to reduce the amount of parameters and improve the inference speed by distilling a pre-trained model with a large amount of parameters into a model with a small amount of parameters. However, there is a gap in accuracy between the small model distilled by the current distillation method and the original model, and many even reach a gap of about 10 points. At the same time, many distillation schemes currently require a large amount of labeled data, which greatly increases the cost of distillation.

technical problem

The distillation method of the prior art has a gap in accuracy between the distilled small model and the original model. Many distillation schemes require a large amount of labeled data, which greatly increases the technical problem of the cost of distillation.

technical solutions

The main purpose of this application is to provide a model distillation method, device, equipment and storage medium, which aims to solve the gap in accuracy between the small model distilled by the distillation method in the prior art and the original model, and many distillation schemes require a large amount of , which greatly increases the technical problem of the cost of distillation.

In order to achieve the above purpose of the invention, the present application proposes a model distillation method, the method includes:

Obtain a pre-training model, a student model, a plurality of labeled training samples, and a plurality of unlabeled training samples, and the pre-training model is a model obtained based on Bert network training;

Using the unlabeled training samples and the student model to perform overall distillation learning on the pre-training model to obtain the student model after the first distillation;

Use the unlabeled training samples and the student model after the first distillation to perform hierarchical distillation learning on the pre-training model to obtain the student model after the second distillation;

Use the labeled training samples to perform hierarchical distillation learning on the student model after the second distillation to obtain a trained student model.

The application also proposes a model distillation device, which includes:

a data acquisition module for acquiring a pre-training model, a student model, a plurality of labeled training samples, and a plurality of unlabeled training samples, where the pre-training model is a model obtained through Bert network training;

The first-stage distillation module is used to perform overall distillation learning on the pre-training model by using the unlabeled training samples and the student model to obtain the student model after the first distillation;

The second-stage distillation module is used to perform hierarchical distillation learning on the pre-training model by using the unlabeled training samples and the student model after the first distillation to obtain the student model after the second distillation;

The third-stage distillation module is used to perform hierarchical distillation learning on the student model after the second distillation using the labeled training samples to obtain a trained student model.

The present application also proposes a computer device, including a memory and a processor, the memory stores a computer program, and the processor implements the following method steps when executing the computer program:

The present application also proposes a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the following method steps are implemented:

beneficial effect

The model distillation method, device, equipment and storage medium of the present application use unlabeled training samples and student models to perform overall distillation learning on the pre-training model to obtain the student model after the first distillation, using unlabeled training samples and The student model after the first distillation performs hierarchical distillation learning on the pre-trained model to obtain the student model after the second distillation, and uses the labeled training samples to perform hierarchical distillation learning on the student model after the second distillation, and obtains The trained student model improves the accuracy of the model obtained after distillation through three distillations; because unlabeled training samples are used in the first and second distillations, the need for labeling of training samples is reduced. Reduced distillation costs.

Description of drawings

1 is a schematic flowchart of a model distillation method according to an embodiment of the application;

2 is a schematic block diagram of the structure of a model distillation apparatus according to an embodiment of the application;

FIG. 3 is a schematic structural block diagram of a computer device according to an embodiment of the present application.

The realization, functional features and advantages of the present application will be further described with reference to the accompanying drawings in conjunction with the embodiments.

Embodiments of the present invention

In order to make the purpose, technical solutions and advantages of the present application more clearly understood, the present application will be described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, but not to limit the present application.

In order to solve the difference in accuracy between the small model distilled by the distillation method of the prior art and the original model, many distillation schemes require a large amount of labeled data, which greatly increases the technical problem of the cost of distillation. This application proposes A model distillation method, which is applied in the field of artificial intelligence technology. The model distillation method uses the overall distillation learning for the first time, the hierarchical distillation learning for the second time, and the hierarchical distillation learning for the third time, and improves the accuracy of the model obtained after distillation through three distillations; And the second distillation uses unlabeled training samples, which reduces the need for labeling training samples and reduces the cost of distillation.

1 , a model distillation method is provided in the embodiment of the present application, and the method includes:

S1: Obtain a pre-training model, a student model, a plurality of labeled training samples, and a plurality of unlabeled training samples, where the pre-training model is a model obtained through Bert network training;

S2: Use the unlabeled training samples and the student model to perform overall distillation learning on the pre-training model to obtain the student model after the first distillation;

S3: Use the unlabeled training samples and the student model after the first distillation to perform hierarchical distillation learning on the pre-training model to obtain the student model after the second distillation;

S4: Use the labeled training samples to perform hierarchical distillation learning on the student model after the second distillation, to obtain a trained student model.

In this embodiment, the overall distillation learning is performed on the pre-training model by using the unlabeled training samples and the student model to obtain the student model after the first distillation, and the unlabeled training samples and the student model after the first distillation are used for pre-training. The model is subjected to hierarchical distillation learning, and the student model after the second distillation is obtained. The labeled training samples are used to perform hierarchical distillation learning on the student model after the second distillation to obtain a trained student model, which is improved by three distillations. The accuracy of the model obtained after distillation is improved; because unlabeled training samples are used in the first and second distillations, the need for labeling training samples is reduced and the cost of distillation is reduced.

For S1, the pre-training model can be obtained from the database, the pre-training model input by the user, or the pre-training model sent by a third-party application system.

The student model can be obtained from the database, the student model input by the user, or the student model sent by the third-party application system.

Multiple labeled training samples can be obtained from the database, multiple labeled training samples input by the user, or multiple labeled training samples sent by third-party application systems.

Multiple unlabeled training samples can be obtained from the database, multiple unlabeled training samples input by the user, or multiple unlabeled training samples sent by third-party application systems.

The student model includes: Embedding layer, BiLSTM layer, Dense layer. The Embedding layer inputs data to the BiLSTM layer, and the BiLSTM layer outputs data to the Dense layer. The Embedding layer is the embedding layer. The output of the BiLSTM layer is the prediction score for each label. The Dense layer is a fully connected layer that outputs predicted probabilities.

The labeled training samples include: sample data and sample calibration values, where the sample calibration values are the calibration results of the sample data.

Unlabeled training samples include: sample data.

Optionally, the number of labeled training samples in the plurality of labeled training samples is smaller than the number of unlabeled training samples in the plurality of unlabeled training samples.

For S2, use the unlabeled training samples and the student model to perform overall distillation learning on the pre-training model, that is, update all parameters of the pre-training model, and use the trained student model as the first Student model after secondary distillation.

For S3, use the unlabeled training samples and the student model after the first distillation to perform hierarchical distillation learning on the pre-training model, that is, perform layering on the parameters of the student model after the first distillation. Update, take the trained student model after the first distillation as the student model after the second distillation. Thus, the phenomenon of catastrophic forgetting of the distillation method in the prior art is avoided, and the phenomenon of forgetting the content of the first distillation is avoided in the second distillation.

For S4, the labeled training sample is used to perform hierarchical distillation learning on the student model after the second distillation, that is, the parameters of the student model after the second distillation are updated hierarchically, and the trained The student model after the second distillation is used as the trained student model. Thus, the phenomenon of catastrophic forgetting in the distillation method in the prior art is avoided, and the phenomenon of forgetting the content of the second distillation is avoided in the third distillation.

In one embodiment, the above-mentioned steps of using the unlabeled training samples and the student model to perform overall distillation learning on the pre-training model to obtain the student model after the first distillation include:

S21: Input the unlabeled training samples into the pre-training model for scoring prediction, and obtain the first prediction score output by the scoring prediction layer of the pre-training model;

S22: Input the unlabeled training samples into the student model for scoring prediction to obtain a second prediction score;

S23: Input the first predicted score and the second predicted score into a first loss function for calculation to obtain a first loss value, update all parameters of the student model according to the first loss value, and update the parameters after updating. The said student model is used for the next calculation of the second predicted score;

S24: Repeat the above method steps until the first loss value reaches the first convergence condition or the number of iterations reaches the second convergence condition, and set the first loss value to meet the first convergence condition or the number of iterations to reach the second convergence condition The student model is determined as the student model after the first distillation.

This embodiment realizes that all parameters of the student model are updated according to the prediction score obtained from the unlabeled training sample prediction and the loss value is calculated, and the knowledge learned by the pre-training model is learned by overall distillation.

For S21, input the sample data of the unlabeled training samples into the pre-training model for prediction, and use the score output by the score prediction layer of the pre-training model as the first prediction score.

For S22, input the sample data of the unlabeled training samples into the student model for prediction, and use the score output by the BiLSTM layer of the student model as the second prediction score.

For S23, the first predicted score and the second predicted score are input into the first loss function to calculate the loss value, and the calculated loss value is used as the first loss value.

The method for updating all the parameters of the student model according to the first loss value can be selected from the prior art, which will not be repeated here.

For S24, the first convergence condition means that the magnitude of the first loss value calculated twice adjacently satisfies the Lipschitz condition (the Lipschitz continuity condition).

The number of iterations reaching the second convergence condition refers to the number of times the student model is used to calculate the second predicted score, that is, the number of iterations increases by 1 after one calculation.

It can be understood that, when the first loss value does not reach the first convergence condition and the number of iterations does not reach the second convergence condition, a new unlabeled training sample is obtained from the plurality of unlabeled training samples, according to Steps S21 to S24 are executed for the acquired unlabeled training samples.

In one embodiment, the above-mentioned step of inputting the first predicted score and the second predicted score into a first loss function for calculation to obtain a first loss value includes:

The first prediction score and the second prediction score are input into the KL divergence loss function for calculation, and the first loss value is obtained.

KL divergence loss function, also known as K-L divergence loss function.

The calculation formula of the KL divergence loss function KL(p||q) is:

where x is the sample data of the unlabeled training samples, p(x) is the first predicted score, q(x) is the second predicted score, and log() is a logarithmic function.

In one embodiment, the above step of using the unlabeled training samples and the student model after the first distillation to perform hierarchical distillation learning on the pre-training model to obtain the student model after the second distillation, include:

S31: Input the unlabeled training samples into the pre-training model for probability prediction, and obtain the first prediction probability output by the probability prediction layer of the pre-training model;

S32: Input the unlabeled training sample into the student model after the first distillation for probability prediction, and obtain a second predicted probability;

S33: Input the first predicted probability and the second predicted probability into a second loss function for calculation to obtain a second loss value, and update the first preset parameter hierarchical update rule according to the second loss value The parameters of the student model after the first distillation, the student model after the first distillation after updating the parameters is used for the next calculation of the second prediction probability;

S34: Repeat the above method steps until the second loss value reaches the third convergence condition or the number of iterations reaches the fourth convergence condition, and set the second loss value to reach the third convergence condition or the iteration number reaches the fourth convergence condition The student model after the first distillation is determined as the student model after the second distillation.

This embodiment realizes that the parameters of the student model after the first distillation are updated hierarchically according to the prediction probability calculation loss value obtained from the unlabeled training samples, thereby avoiding the catastrophic forgetting phenomenon of the distillation method in the prior art, Avoid the phenomenon of forgetting the content of the first distillation in the second distillation.

For S31, input the sample data of the unlabeled training samples into the pre-training model for probability prediction, and use the probability output by the probability prediction layer of the pre-training model as the first prediction probability.

For S32, the sample data of the unlabeled training sample is input into the student model after the first distillation for probability prediction, and the probability output by the Dense layer of the student model after the first distillation is used as the second prediction probability.

For S33, the first predicted probability and the second predicted probability are input into the second loss function to calculate the loss value, and the calculated loss value is used as the second loss value.

According to the second loss value, only the parameters of one layer (that is, the Embedding layer, the BiLSTM layer, and the Dense layer) of the student model after the first distillation are updated each time.

For S34, the third convergence condition means that the magnitude of the third loss value calculated twice adjacently satisfies the Lipschitz condition (the Lipschitz continuity condition).

The number of iterations reaching the fourth convergence condition refers to the number of times that the student model is used to calculate the fourth predicted probability, that is, the number of iterations increases by 1 after one calculation.

It can be understood that when the second loss value does not reach the third convergence condition and the number of iterations does not reach the fourth convergence condition, a new unlabeled training sample is obtained from the plurality of unlabeled training samples, according to Steps S31 to S34 are executed for the acquired unlabeled training samples.

In one embodiment, the first predicted probability and the second predicted probability are input into a second loss function for calculation to obtain a second loss value, which is stratified by first preset parameters according to the second loss value The steps of updating the parameters of the student model after the first distillation of the update rule include:

S331: Input the first predicted probability and the second predicted probability into the MSE loss function for calculation to obtain the second loss value;

S332: When the Dense layer parameter in the second loss value does not reach the convergence condition of the first Dense layer, update the Dense layer of the student model after the first distillation according to the Dense layer parameter in the second loss value , otherwise, when the BiLSTM layer parameters in the second loss value do not reach the convergence condition of the first BiLSTM layer, update the student model after the first distillation according to the BiLSTM layer parameters in the second loss value parameters of the BiLSTM layer, otherwise, update the parameters of the Embedding layer of the student model after the first distillation according to the parameters of the Embedding layer in the second loss value.

For S331, the MSE loss function formula MSE(p,q) is as follows:

where p _t is the first predicted probability and q _t is the second predicted probability.

For S332, the convergence conditions of the first Dense layer and the convergence conditions of the first BiLSTM layer can be set according to training requirements, which are not specifically limited here.

In one embodiment, the above-mentioned steps of using the labeled training samples to perform hierarchical distillation learning on the student model after the second distillation to obtain a trained student model include:

S41: Input the labeled training sample into the student model after the second distillation for probability prediction, and obtain a third predicted probability;

S42: Input the third prediction probability and the sample calibration value of the labeled training sample into a third loss function for calculation, to obtain a third loss value, and stratify according to the second preset parameter according to the third loss value The update rule updates the parameters of the student model after the second distillation, and the student model after the second distillation after the update parameter is used for the next calculation of the third prediction probability;

S43: Repeat the above method steps until the third loss value reaches the fifth convergence condition or the number of iterations reaches the sixth convergence condition, and set the third loss value to reach the fifth convergence condition or the iteration number reaches the sixth convergence condition The student model after the second distillation is determined as the trained student model.

This embodiment implements hierarchical update of the parameters of the student model after the second distillation according to the prediction probability calculation loss value predicted by the labeled training samples, thereby avoiding the phenomenon of catastrophic forgetting in the distillation method in the prior art , to avoid the phenomenon of forgetting the content of the second distillation in the third distillation.

For S41, input the sample data of the labeled training samples into the student model after the second distillation for probability prediction, and use the probability output by the Dense layer of the student model after the second distillation as the third prediction probability.

For S42, the third predicted probability and the sample calibration value of the labeled training sample are input into the third loss function to calculate the loss value, and the calculated loss value is used as the third loss value.

According to the third loss value, only the parameters of one layer (that is, the Embedding layer, the BiLSTM layer, and the Dense layer) of the student model after the second distillation are updated each time.

For S43, the fifth convergence condition means that the magnitude of the third loss value calculated twice adjacently satisfies the Lipschitz condition (the Lipschitz continuity condition).

The number of iterations reaching the sixth convergence condition refers to the number of times that the student model after the second distillation is used to calculate the third predicted probability, that is, the number of iterations increases by 1 after one calculation.

It can be understood that when the third loss value does not reach the fifth convergence condition and the number of iterations does not reach the sixth convergence condition, a new marked training sample is obtained from the multiple marked training samples, according to Steps S41 to S43 are executed for the obtained labeled training samples.

In one embodiment, the above-mentioned third prediction probability and the sample calibration value of the labeled training sample are input into a third loss function for calculation to obtain a third loss value, and according to the third loss value The step of updating the parameters of the student model after the second distillation by the second preset parameter hierarchical update rule includes:

S421: Input the third prediction probability and the sample calibration value of the labeled training sample into a cross-entropy loss function for calculation, to obtain the third loss value;

S422: When the Dense layer parameter in the third loss value does not reach the convergence condition of the second Dense layer, update the Dense layer of the student model after the second distillation according to the Dense layer parameter in the third loss value , otherwise, when the BiLSTM layer parameters in the third loss value do not reach the convergence condition of the second BiLSTM layer, update the student model after the second distillation according to the BiLSTM layer parameters in the third loss value parameters of the BiLSTM layer, otherwise, update the parameters of the Embedding layer of the student model after the second distillation according to the parameters of the Embedding layer in the third loss value.

In this embodiment, the parameters of the student model after the second distillation are updated hierarchically according to the prediction probability calculation loss value predicted by the labeled training samples, so as to avoid the catastrophic forgetting phenomenon of the distillation method in the prior art, and avoid the The phenomenon of forgetting the contents of the second distillation after the third distillation.

For S421, the calculation formula of the cross entropy loss function CE is as follows:

Wherein, y _c is the sample calibration value of the labeled training sample, and p _c is the third prediction probability.

For S422, the convergence conditions of the second Dense layer and the convergence conditions of the second BiLSTM layer can be set according to training requirements, which are not specifically limited here.

Referring to Figure 2, the present application also proposes a model distillation device, the device comprising:

The data acquisition module 100 is used to acquire a pre-training model, a student model, a plurality of labeled training samples, and a plurality of unlabeled training samples, wherein the pre-training model is a model obtained based on Bert network training;

The first-stage distillation module 200 is used to perform overall distillation learning on the pre-training model by using the unlabeled training samples and the student model to obtain the student model after the first distillation;

The second-stage distillation module 300 is configured to perform hierarchical distillation learning on the pre-training model using the unlabeled training samples and the student model after the first distillation, to obtain the student model after the second distillation;

The third-stage distillation module 400 is configured to perform hierarchical distillation learning on the student model after the second distillation by using the labeled training samples to obtain a trained student model.

In one embodiment, the first-stage distillation module 200 includes: a pre-training model score prediction sub-module, a student model score prediction sub-module, and a first-stage distillation training sub-module;

The pre-training model scoring prediction submodule is used to input the unlabeled training samples into the pre-training model for scoring prediction, and obtain the first prediction score output by the scoring prediction layer of the pre-training model;

The student model scoring prediction sub-module is used to input the unlabeled training samples into the student model for scoring prediction to obtain a second prediction score;

The first-stage distillation training sub-module is used to input the first prediction score and the second prediction score into a first loss function for calculation to obtain a first loss value, and update the first loss value according to the first loss value. For all parameters of the student model, the student model after updating the parameters is used for the next calculation of the second prediction score, and the above method steps are repeated until the first loss value reaches the first convergence condition or the number of iterations reaches the second. Convergence condition, the student model whose first loss value reaches the first convergence condition or the number of iterations reaches the second convergence condition is determined as the student model after the first distillation.

In one embodiment, the first-stage distillation training sub-module includes: a first loss value calculation unit;

The first loss value calculation unit is configured to input the first prediction score and the second prediction score into the KL divergence loss function for calculation, and obtain the first loss value.

In one embodiment, the second-stage distillation module 300 includes: a pre-training model probability prediction sub-module, a student model probability prediction sub-module after the first distillation, and a second-stage distillation training sub-module;

The pre-training model probability prediction sub-module is configured to input the unlabeled training samples into the pre-training model for probability prediction, and obtain the first prediction probability output by the probability prediction layer of the pre-training model;

The student model probability prediction submodule after the first distillation is used to input the unlabeled training samples into the student model after the first distillation for probability prediction, and obtain a second predicted probability;

The second-stage distillation training sub-module is used to input the first predicted probability and the second predicted probability into a second loss function for calculation to obtain a second loss value. The preset parameter hierarchical update rule updates the parameters of the student model after the first distillation, and the student model after the first distillation after updating the parameters is used for the next calculation of the second prediction probability, and repeated execution The above method steps are performed until the second loss value reaches the third convergence condition or the number of iterations reaches the fourth convergence condition, and the second loss value reaches the third convergence condition or the iteration number reaches the fourth convergence condition for the first time The student model after distillation is determined as the student model after the second distillation.

In one embodiment, the second-stage distillation training sub-module includes: a second loss value calculation unit, and a first parameter update unit;

the second loss value calculation unit, configured to input the first predicted probability and the second predicted probability into an MSE loss function for calculation, to obtain the second loss value;

The first parameter updating unit is configured to update the first Dense layer parameter according to the Dense layer parameter in the second loss value when the Dense layer parameter in the second loss value does not reach the convergence condition of the first Dense layer The parameters of the Dense layer of the distilled student model, otherwise, when the BiLSTM layer parameters in the second loss value do not reach the convergence condition of the first BiLSTM layer, update the BiLSTM layer parameters according to the second loss value. The parameters of the BiLSTM layer of the student model after the first distillation, otherwise, the parameters of the Embedding layer of the student model after the first distillation are updated according to the parameters of the Embedding layer in the second loss value.

In one embodiment, the third-stage distillation module 400 includes: a student model probability prediction sub-module after the second distillation, and a third-stage distillation training sub-module;

The student model probability prediction submodule after the second distillation is used to input the labeled training sample into the student model after the second distillation for probability prediction, and obtain a third prediction probability;

The third-stage distillation training sub-module is used to input the third prediction probability and the sample calibration value of the labeled training sample into a third loss function for calculation to obtain a third loss value, according to the third loss function. The loss value updates the parameters of the student model after the second distillation according to the second preset parameter hierarchical update rule, and the student model after the second distillation after updating the parameters is used for the next calculation of the third To predict the probability, repeat the above method steps until the third loss value reaches the fifth convergence condition or the number of iterations reaches the sixth convergence condition, and the third loss value reaches the fifth convergence condition or the iteration number reaches the sixth convergence condition. The student model after the second distillation is determined as the trained student model.

In one embodiment, the third-stage distillation training sub-module includes: a third loss value calculation unit and a second parameter update unit;

The third loss value calculation unit is configured to input the third prediction probability and the sample calibration value of the labeled training sample into a cross-entropy loss function for calculation to obtain the third loss value;

The second parameter updating unit is configured to update the second Dense layer parameter according to the Dense layer parameter in the third loss value when the Dense layer parameter in the third loss value does not reach the convergence condition of the second Dense layer The parameters of the Dense layer of the distilled student model, otherwise, when the BiLSTM layer parameters in the third loss value do not reach the convergence condition of the second BiLSTM layer, update the BiLSTM layer parameters according to the third loss value. The parameters of the BiLSTM layer of the student model after the second distillation, otherwise, the parameters of the Embedding layer of the student model after the second distillation are updated according to the parameters of the Embedding layer in the third loss value.

Referring to FIG. 3 , an embodiment of the present application further provides a computer device. The computer device may be a server, and its internal structure may be as shown in FIG. 3 . The computer device includes a processor, memory, a network interface, and a database connected by a system bus. Among them, the processor of the computer design is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium, an internal memory. The nonvolatile storage medium stores an operating system, a computer program, and a database. The memory provides an environment for the execution of the operating system and computer programs in the non-volatile storage medium. The database of the computer equipment is used to store data such as model distillation methods. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer program when executed by a processor implements a model distillation method. The model distillation method includes: obtaining a pre-training model, a student model, a plurality of labeled training samples, and a plurality of unlabeled training samples, the pre-training model is a model obtained based on Bert network training; The labeled training samples and the student model perform overall distillation learning on the pre-training model to obtain the student model after the first distillation; the unlabeled training samples and the student model after the first distillation are used to pair The pre-trained model is subjected to layered distillation learning to obtain a student model after the second distillation; the labeled training sample is used to perform layered distillation learning on the student model after the second distillation, and a trained student model is obtained. student model.

An embodiment of the present application also provides a computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, a method for model distillation is implemented, including the steps of: acquiring a pre-trained model, a student model, a plurality of Labeled training samples, a plurality of unlabeled training samples, the pre-training model is a model obtained based on Bert network training; using the unlabeled training samples and the student model to perform overall distillation learning on the pre-training model , obtain the student model after the first distillation; use the unlabeled training samples and the student model after the first distillation to perform hierarchical distillation learning on the pre-training model, and obtain the student model after the second distillation model; using the labeled training samples to perform hierarchical distillation learning on the student model after the second distillation, to obtain a trained student model.

The model distillation method implemented above uses the unlabeled training samples and student models to perform overall distillation learning on the pre-training model to obtain the first distilled student model, using the unlabeled training samples and the first distilled student model. The model performs hierarchical distillation learning on the pre-trained model to obtain the student model after the second distillation, and uses the labeled training samples to perform hierarchical distillation learning on the student model after the second distillation, and obtains the trained student model. The accuracy of the model obtained after distillation is improved by three distillations; because unlabeled training samples are used in the first and second distillations, the need for labeling of training samples is reduced and the cost of distillation is reduced.

The computer storage medium can be non-volatile or volatile.

Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing relevant hardware through a computer program, and the computer program can be stored in a non-volatile computer-readable storage In the medium, when the computer program is executed, it may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other medium provided in this application and used in the embodiments may include non-volatile and/or volatile memory. Nonvolatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double-rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

It should be noted that, herein, the terms "comprising", "comprising" or any other variation thereof are intended to encompass non-exclusive inclusion, such that a process, device, article or method comprising a series of elements includes not only those elements, It also includes other elements not expressly listed or inherent to such a process, apparatus, article or method. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in the process, apparatus, article, or method that includes the element.

The above are only the preferred embodiments of the present application, and are not intended to limit the scope of the patent of the present application. Any equivalent structure or equivalent process transformation made by using the contents of the description and drawings of the present application, or directly or indirectly applied to other related The technical field is similarly included in the scope of patent protection of this application.

Claims

A model distillation method, wherein the method comprises:

Obtain a pre-training model, a student model, a plurality of labeled training samples, and a plurality of unlabeled training samples, and the pre-training model is a model obtained based on Bert network training;

Using the unlabeled training samples and the student model to perform overall distillation learning on the pre-training model to obtain the student model after the first distillation;

Use the unlabeled training samples and the student model after the first distillation to perform hierarchical distillation learning on the pre-training model to obtain the student model after the second distillation;

Use the labeled training samples to perform hierarchical distillation learning on the student model after the second distillation to obtain a trained student model.
The model distillation method according to claim 1, wherein the step of using the unlabeled training samples and the student model to perform overall distillation learning on the pre-trained model to obtain the student model after the first distillation ,include:

Input the unlabeled training samples into the pre-training model for scoring prediction, and obtain the first prediction score output by the scoring prediction layer of the pre-training model;

Inputting the unlabeled training samples into the student model for scoring prediction to obtain a second prediction score;

Input the first predicted score and the second predicted score into the first loss function for calculation to obtain a first loss value, update all parameters of the student model according to the first loss value, and update all parameters after updating the parameters. The student model is used for the next calculation of the second predicted score;

Repeat the above method steps until the first loss value reaches the first convergence condition or the number of iterations reaches the second convergence condition, and the student whose first loss value reaches the first convergence condition or the number of iterations reaches the second convergence condition model, identified as the student model after the first distillation.
The model distillation method according to claim 2, wherein the step of inputting the first prediction score and the second prediction score into a first loss function for calculation to obtain a first loss value comprises:

The first prediction score and the second prediction score are input into the KL divergence loss function for calculation, and the first loss value is obtained.
The model distillation method according to claim 1, wherein the pre-trained model is subjected to hierarchical distillation learning by using the unlabeled training samples and the student model after the first distillation to obtain the second The steps of the distilled student model include:

Input the unlabeled training samples into the pre-training model for probability prediction, and obtain the first prediction probability output by the probability prediction layer of the pre-training model;

Inputting the unlabeled training samples into the student model after the first distillation for probability prediction to obtain a second predicted probability;

Inputting the first predicted probability and the second predicted probability into a second loss function for calculation to obtain a second loss value, and updating the first preset parameter hierarchical update rule according to the second loss value The parameters of the student model after the second distillation, the student model after the first distillation after updating the parameters is used for the next calculation of the second prediction probability;

Repeat the above method steps until the second loss value reaches the third convergence condition or the number of iterations reaches the fourth convergence condition, and the second loss value reaches the third convergence condition or the iteration number reaches the fourth convergence condition. The student model after the first distillation is determined as the student model after the second distillation.
The model distillation method according to claim 4, wherein the first predicted probability and the second predicted probability are input into a second loss function for calculation to obtain a second loss value, according to the second loss value The step of updating the parameters of the student model after the first distillation according to the first preset parameter hierarchical update rule includes:

Inputting the first predicted probability and the second predicted probability into the MSE loss function for calculation to obtain the second loss value;

When the Dense layer parameters in the second loss value do not reach the convergence condition of the first Dense layer, update the parameters of the Dense layer of the student model after the first distillation according to the Dense layer parameters in the second loss value , otherwise, when the BiLSTM layer parameters in the second loss value do not reach the convergence condition of the first BiLSTM layer, update the BiLSTM of the student model after the first distillation according to the BiLSTM layer parameters in the second loss value The parameters of the layer, otherwise, the parameters of the Embedding layer of the student model after the first distillation are updated according to the parameters of the Embedding layer in the second loss value.
The model distillation method according to claim 1, wherein the step of using the labeled training samples to perform hierarchical distillation learning on the student model after the second distillation to obtain a trained student model includes the following steps: :

Inputting the labeled training sample into the student model after the second distillation for probability prediction to obtain a third prediction probability;

Inputting the third prediction probability and the sample calibration value of the labeled training sample into a third loss function for calculation to obtain a third loss value, and updating the rules hierarchically according to the third loss value and according to the second preset parameter updating the parameters of the student model after the second distillation, and using the student model after the second distillation after updating the parameters for the next calculation of the third prediction probability;

Repeat the above method steps until the third loss value reaches the fifth convergence condition or the number of iterations reaches the sixth convergence condition, and the third loss value reaches the fifth convergence condition or the iteration number reaches the sixth convergence condition. The student model after secondary distillation is determined as the trained student model.
The model distillation method according to claim 6, wherein the third prediction probability and the sample calibration value of the labeled training sample are input into a third loss function for calculation to obtain a third loss value, according to the The step of updating the parameters of the student model after the second distillation by the third loss value according to the second preset parameter hierarchical update rule, includes:

Inputting the third predicted probability and the sample calibration value of the labeled training sample into a cross-entropy loss function for calculation to obtain the third loss value;

When the Dense layer parameters in the third loss value do not reach the convergence condition of the second Dense layer, update the parameters of the Dense layer of the student model after the second distillation according to the Dense layer parameters in the third loss value , otherwise, when the BiLSTM layer parameters in the third loss value do not reach the convergence condition of the second BiLSTM layer, update the BiLSTM of the student model after the second distillation according to the BiLSTM layer parameters in the third loss value The parameters of the layer, otherwise, the parameters of the Embedding layer of the student model after the second distillation are updated according to the parameters of the Embedding layer in the third loss value.
A model distillation apparatus, wherein the apparatus comprises:

a data acquisition module for acquiring a pre-training model, a student model, a plurality of labeled training samples, and a plurality of unlabeled training samples, where the pre-training model is a model obtained through Bert network training;

The first-stage distillation module is used to perform overall distillation learning on the pre-training model by using the unlabeled training samples and the student model to obtain the student model after the first distillation;

The second-stage distillation module is used to perform hierarchical distillation learning on the pre-training model by using the unlabeled training samples and the student model after the first distillation to obtain the student model after the second distillation;

The third-stage distillation module is used to perform hierarchical distillation learning on the student model after the second distillation by using the labeled training samples to obtain a trained student model.
A computer device, comprising a memory and a processor, wherein the memory stores a computer program, wherein the processor implements the following method steps when executing the computer program:

Obtain a pre-training model, a student model, a plurality of labeled training samples, and a plurality of unlabeled training samples, and the pre-training model is a model obtained based on Bert network training;

Using the unlabeled training samples and the student model to perform overall distillation learning on the pre-training model to obtain the student model after the first distillation;

Use the unlabeled training samples and the student model after the first distillation to perform hierarchical distillation learning on the pre-training model to obtain the student model after the second distillation;

Use the labeled training samples to perform hierarchical distillation learning on the student model after the second distillation to obtain a trained student model.
The computer device according to claim 9, wherein the step of performing overall distillation learning on the pre-training model by using the unlabeled training samples and the student model to obtain the student model after the first distillation, include:

Input the unlabeled training samples into the pre-training model for scoring prediction, and obtain the first prediction score output by the scoring prediction layer of the pre-training model;

Inputting the unlabeled training samples into the student model for scoring prediction to obtain a second prediction score;

Input the first predicted score and the second predicted score into the first loss function for calculation to obtain a first loss value, update all parameters of the student model according to the first loss value, and update all parameters after updating the parameters. The student model is used for the next calculation of the second predicted score;

Repeat the above method steps until the first loss value reaches the first convergence condition or the number of iterations reaches the second convergence condition, and the student whose first loss value reaches the first convergence condition or the number of iterations reaches the second convergence condition model, identified as the student model after the first distillation.
The computer device according to claim 10, wherein the step of inputting the first predicted score and the second predicted score into a first loss function for calculation to obtain a first loss value comprises:

The first prediction score and the second prediction score are input into the KL divergence loss function for calculation, and the first loss value is obtained.
The computer device according to claim 9, wherein the pre-trained model is subjected to hierarchical distillation learning using the unlabeled training samples and the student model after the first distillation to obtain the second distillation Post-student model steps include:

Input the unlabeled training samples into the pre-training model for probability prediction, and obtain the first prediction probability output by the probability prediction layer of the pre-training model;

Inputting the unlabeled training samples into the student model after the first distillation for probability prediction to obtain a second predicted probability;

Inputting the first predicted probability and the second predicted probability into a second loss function for calculation to obtain a second loss value, and updating the first preset parameter hierarchical update rule according to the second loss value The parameters of the student model after the second distillation, the student model after the first distillation after updating the parameters is used for the next calculation of the second prediction probability;

Repeat the above method steps until the second loss value reaches the third convergence condition or the number of iterations reaches the fourth convergence condition, and the second loss value reaches the third convergence condition or the iteration number reaches the fourth convergence condition. The student model after the first distillation is determined as the student model after the second distillation.
The computer device according to claim 12, wherein the first predicted probability and the second predicted probability are input into a second loss function for calculation to obtain a second loss value, and according to the second loss value according to The step of updating the parameters of the student model after the first distillation by the first preset parameter hierarchical update rule includes:

Inputting the first predicted probability and the second predicted probability into the MSE loss function for calculation to obtain the second loss value;

When the Dense layer parameters in the second loss value do not reach the convergence condition of the first Dense layer, update the parameters of the Dense layer of the student model after the first distillation according to the Dense layer parameters in the second loss value , otherwise, when the BiLSTM layer parameters in the second loss value do not reach the convergence condition of the first BiLSTM layer, update the BiLSTM of the student model after the first distillation according to the BiLSTM layer parameters in the second loss value The parameters of the layer, otherwise, the parameters of the Embedding layer of the student model after the first distillation are updated according to the parameters of the Embedding layer in the second loss value.
The computer equipment according to claim 9, wherein the step of using the labeled training samples to perform hierarchical distillation learning on the student model after the second distillation, and obtaining a trained student model, comprises:

Inputting the labeled training sample into the student model after the second distillation for probability prediction to obtain a third prediction probability;

Inputting the third prediction probability and the sample calibration value of the labeled training sample into a third loss function for calculation to obtain a third loss value, and updating the rules hierarchically according to the third loss value and according to the second preset parameter updating the parameters of the student model after the second distillation, and using the student model after the second distillation after updating the parameters for the next calculation of the third prediction probability;

Repeat the above method steps until the third loss value reaches the fifth convergence condition or the number of iterations reaches the sixth convergence condition, and the third loss value reaches the fifth convergence condition or the iteration number reaches the sixth convergence condition. The student model after secondary distillation is determined as the trained student model.
A computer-readable storage medium on which a computer program is stored, wherein when the computer program is executed by a processor, the following method steps are implemented:

Obtain a pre-training model, a student model, a plurality of labeled training samples, and a plurality of unlabeled training samples, and the pre-training model is a model obtained based on Bert network training;

Using the unlabeled training samples and the student model to perform overall distillation learning on the pre-training model to obtain the student model after the first distillation;

Use the unlabeled training samples and the student model after the first distillation to perform hierarchical distillation learning on the pre-training model to obtain the student model after the second distillation;

Use the labeled training samples to perform hierarchical distillation learning on the student model after the second distillation to obtain a trained student model.
The computer-readable storage medium according to claim 15, wherein the overall distillation learning is performed on the pre-trained model by using the unlabeled training samples and the student model to obtain a student model after the first distillation steps, including:

Input the unlabeled training samples into the pre-training model for scoring prediction, and obtain the first prediction score output by the scoring prediction layer of the pre-training model;

Inputting the unlabeled training samples into the student model for scoring prediction to obtain a second prediction score;

Input the first predicted score and the second predicted score into the first loss function for calculation to obtain a first loss value, update all parameters of the student model according to the first loss value, and update all parameters after updating the parameters. The student model is used for the next calculation of the second predicted score;

Repeat the above method steps until the first loss value reaches the first convergence condition or the number of iterations reaches the second convergence condition, and the student whose first loss value reaches the first convergence condition or the number of iterations reaches the second convergence condition model, identified as the student model after the first distillation.
The computer-readable storage medium according to claim 16, wherein the step of inputting the first prediction score and the second prediction score into a first loss function for calculation to obtain a first loss value comprises:

The first prediction score and the second prediction score are input into the KL divergence loss function for calculation, and the first loss value is obtained.
The computer-readable storage medium according to claim 15, wherein the pre-trained model is subjected to hierarchical distillation learning using the unlabeled training samples and the first distilled student model, to obtain the first The steps of the student model after double distillation include:

Input the unlabeled training samples into the pre-training model for probability prediction, and obtain the first prediction probability output by the probability prediction layer of the pre-training model;

Inputting the unlabeled training samples into the student model after the first distillation for probability prediction to obtain a second predicted probability;

Inputting the first predicted probability and the second predicted probability into a second loss function for calculation to obtain a second loss value, and updating the first preset parameter hierarchical update rule according to the second loss value The parameters of the student model after the second distillation, the student model after the first distillation after updating the parameters is used for the next calculation of the second prediction probability;

Repeat the above method steps until the second loss value reaches the third convergence condition or the number of iterations reaches the fourth convergence condition, and the second loss value reaches the third convergence condition or the iteration number reaches the fourth convergence condition. The student model after the first distillation is determined as the student model after the second distillation.
The computer-readable storage medium according to claim 18, wherein the first predicted probability and the second predicted probability are input into a second loss function for calculation to obtain a second loss value, according to the second loss function. The step of updating the parameters of the student model after the first distillation of the loss value according to the first preset parameter hierarchical update rule includes:

Inputting the first predicted probability and the second predicted probability into the MSE loss function for calculation to obtain the second loss value;

When the Dense layer parameters in the second loss value do not reach the convergence condition of the first Dense layer, update the parameters of the Dense layer of the student model after the first distillation according to the Dense layer parameters in the second loss value , otherwise, when the BiLSTM layer parameters in the second loss value do not reach the convergence condition of the first BiLSTM layer, update the BiLSTM of the student model after the first distillation according to the BiLSTM layer parameters in the second loss value The parameters of the layer, otherwise, the parameters of the Embedding layer of the student model after the first distillation are updated according to the parameters of the Embedding layer in the second loss value.
The computer-readable storage medium according to claim 15, wherein the step of using the labeled training samples to perform hierarchical distillation learning on the student model after the second distillation to obtain a trained student model ,include:

Inputting the labeled training sample into the student model after the second distillation for probability prediction to obtain a third prediction probability;

Inputting the third prediction probability and the sample calibration value of the labeled training sample into a third loss function for calculation to obtain a third loss value, and updating the rules hierarchically according to the third loss value and according to the second preset parameter updating the parameters of the student model after the second distillation, and using the student model after the second distillation after updating the parameters for the next calculation of the third prediction probability;

Repeat the above method steps until the third loss value reaches the fifth convergence condition or the number of iterations reaches the sixth convergence condition, and the third loss value reaches the fifth convergence condition or the iteration number reaches the sixth convergence condition. The student model after secondary distillation is determined as the trained student model.