WO2022105173A1

WO2022105173A1 - Model distillation method and apparatus, and storage medium and device

Info

Publication number: WO2022105173A1
Application number: PCT/CN2021/096649
Authority: WO
Inventors: 吴天博; 王健宗; 程宁
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-11-20
Filing date: 2021-05-28
Publication date: 2022-05-27
Also published as: CN112465138A

Abstract

Disclosed are a model distillation method and apparatus, and a storage medium and a device. The method comprises: acquiring training sample data which is used for training a preset student model; respectively recognizing the training sample data by using the preset student model and a preset teacher model, so as to obtain a teacher recognition result and a student recognition result of the training sample data; acquiring, from the teacher recognition result, a weight parameter which is used for adjusting a recognition result of the preset student model; and calculating a logarithm between the teacher recognition result and the student recognition result, performing a weighted operation on the logarithm by using the weight parameter, and adjusting the preset student model by taking a calculated numerical value as a loss value. By means of the present application, a student model can have the data processing capability of a teacher model, thereby improving the accuracy of the student model.

Description

Model distillation method, device, storage medium and equipment

This application claims the priority of the Chinese patent application filed on November 20, 2020 with the application number 202011313330.8 and titled "Model Distillation Method, Apparatus, Storage Medium and Equipment", the entire contents of which are incorporated by reference in in this application.

technical field

The present application relates to the technical field of artificial intelligence, and in particular, to a model distillation method, device, storage medium and device.

Background technique

As an important technical solution for model compression and acceleration, model distillation has attracted much attention in recent years, and has played an important role in promoting the field of natural language processing. Model distillation (knowledge distillation) refers to using a teacher model with high accuracy but a complex structure to guide the training of a student model with low accuracy but a simple structure, so as to improve the accuracy of the student model.

The inventors realized that although the student model can learn from the teacher model, the accuracy of the student model is improved. However, there are still some differences between the teacher model and the student model in the existing distillation model architecture, resulting in poor expression effect and low accuracy of the student model.

SUMMARY OF THE INVENTION

The technical problem to be solved by the embodiments of the present application is to provide a model distillation method, device, storage medium and device, which can improve the accuracy and data processing capability of the student model.

On the one hand, the embodiments of the present application provide a method for model distillation, including:

Obtain training sample data for training a preset student model;

The training sample data is identified by the preset student model and the preset teacher model, respectively, and the teacher identification result and the student identification result of the training sample data are obtained, wherein the preset student model is determined by the The above-mentioned preset teacher model guides the training;

Obtaining a weight parameter for adjusting the recognition result of the preset student model from the teacher recognition result;

Calculate the logarithm between the teacher identification result and the student identification result, and use the weight parameter to perform a weighted operation on the logarithm, and use the calculated value as a loss value to the preset student model make adjustments.

One aspect of the embodiments of the present application provides a model distillation device, comprising:

a first acquisition module, used for acquiring training sample data for training a preset student model;

an identification module, configured to identify the training sample data by using the preset student model and the preset teacher model, respectively, to obtain a teacher identification result and a student identification result of the training sample data, wherein the preset The student model is obtained by the guidance and training of the preset teacher model;

a second acquisition module, configured to acquire, from the teacher recognition result, a weight parameter for adjusting the recognition result of the preset student model;

The adjustment module is used to calculate the logarithm between the teacher identification result and the student identification result, and use the weight parameter to perform a weighted operation on the logarithm, and use the calculated value as a loss value to the said logarithm. The preset student models can be adjusted.

One aspect of the present application provides a computer device, including: a processor and a memory;

Wherein, the above-mentioned memory is used to store the computer program, and the above-mentioned processor is used to call the above-mentioned computer program to perform the following steps:

Obtain training sample data for training a preset student model;

On the one hand, an embodiment of the present application provides a computer-readable storage medium. The computer-readable storage medium stores a computer program, and the computer program includes program instructions. When executed by a processor, the program instructions perform the following steps:

Obtain training sample data for training a preset student model;

The embodiment of the present application can reasonably allocate the identification results of the training sample data and the adjustment weights for adjusting the student model for different results in the prediction results of the training sample data according to the weight parameters, which can make the adjusted student model more accurate. Through this application, the student model can have the data processing capability of the teacher model, and the accuracy of the student model can be improved.

Description of drawings

Fig. 1 is the schematic flow sheet of a kind of model distillation method provided by the application;

2 is a schematic diagram of a method for calculating the average value of multiple teacher identification results in each teacher identification group provided by an embodiment of the present application;

3 is a schematic diagram of a method for obtaining a loss value of a preset student model provided by an embodiment of the present application;

Fig. 4 is the schematic diagram of a kind of model distillation provided by the embodiment of the present application;

Fig. 5 is the schematic flow sheet of another kind of model distillation method provided by the application;

Fig. 6 is the schematic flow sheet of a kind of model distillation apparatus provided by the application;

FIG. 7 is a schematic structural diagram of a computer device provided by an embodiment of the present application.

Detailed ways

The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application.

The technical solutions of the present application relate to the technical field of artificial intelligence and/or big data. Optionally, the data involved in this application, such as sample data, identification results and/or loss values, may be stored in a database, or may be stored in a blockchain, such as distributed storage through a blockchain, which is not limited in this application. .

Please refer to FIG. 1 , which is a schematic flowchart of a model distillation method provided by an embodiment of the present application. The method can be performed by a computer device, and the computer device can refer to a terminal or a server, and the terminal can include but is not limited to: a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc.; the server can be an independent one A physical server can also be a server cluster or distributed system composed of multiple physical servers, and can also provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, and domain name services. , security services, Content Delivery Network (CDN), and cloud servers for basic cloud computing services such as big data and artificial intelligence platforms. As shown in FIG. 1, the model distillation method may include steps S101-S105.

S101: Acquire training sample data for training a preset student model.

Obtain training sample data for training a preset student model. The preset student model refers to the student model in model distillation. Model distillation (knowledge distillation) refers to using the teacher model to guide the training of the student model, so as to improve The accuracy of the student model. The training sample data may refer to text data, image data, and the like.

S102, using a preset student model and a preset teacher model to identify the training sample data, respectively, to obtain a teacher identification result and a student identification result of the training sample data.

A preset teacher model is used to identify the training sample data to obtain a teacher identification result of the training sample data, and a preset student model is used to identify the training sample data to obtain a student identification result of the training sample data. Among them, the preset student model is guided and trained by the preset teacher model. Generally speaking, the teacher model has high accuracy and high computational complexity, which is not suitable for deployment in terminal equipment, while the calculation of the student model is relatively simple and meets the requirements of terminal equipment. , but the accuracy is not enough, so model distillation (distillation) can be used to solve this problem, that is, the preset student model is trained under the guidance of the preset teacher model, so as to improve the accuracy of the preset student model. Among them, the data processing type between the teacher model and the student model is the same, the network in the teacher model is deeper or wider, and the resolution of the teacher model is larger, that is, the data processing capability of the teacher model is higher than that of the student model. Specifically, when obtaining the teacher model and the student model, the selection can be made according to the similarity between the teacher model and the student model. The more similar the structure of the teacher model and the student model, the smaller the difference in accuracy between the two after distillation. Therefore, the teacher model and the student model can be selected as the same type of model, such as resnet network series, resnet network refers to the residual network, and the width and depth of the network can be easily adjusted to obtain networks with different expressive abilities. Then, according to the teacher identification result of the training sample data output by the preset teacher model and the student identification result of the training sample data output by the preset student model, the preset student model is adjusted, so that the preset teacher model is adjusted. The knowledge is transferred to the preset student model, so that the preset student model has the data processing capability and accuracy of the preset teacher model.

S103: Obtain, from the teacher's recognition result, a weight parameter for adjusting the recognition result of the preset student model.

The teacher recognition result can be used to generate a weight parameter for adjusting the recognition result of the preset student model, and the weight parameter is used to determine the weight of the recognition result of the preset student model when generating the loss value, that is, the larger the weight parameter is. , the recognition result of the preset student model has a greater weight in generating the loss value.

Optionally, when the weight parameter used to adjust the recognition result of the preset student model is obtained from the teacher recognition result, the balance parameter used to balance the teacher recognition result can be obtained, and the preset teacher model's weight parameter can be obtained. In the identification sequence, the obtained multiple teacher identification results are grouped according to the balance parameter to obtain multiple teacher identification groups arranged in sequence, wherein each teacher identification group contains the same number of teacher identification results. Calculate the average value of multiple teacher recognition results in each teacher recognition group respectively, and use the obtained multiple average values as weight parameters after balancing processing.

The teacher identification result is a plurality of teacher identification results, and the teacher identification result represents the identification probability, which refers to the identification probability obtained by identifying the training sample data by the preset teacher model. The balance parameter used to balance the teacher identification results can be obtained. The balance parameter can be a positive integer greater than or equal to 1, and can be used for abnormal results in the teacher identification results. The abnormal results are far larger than normal results or far away. Results that are much smaller than normal results, i.e. abnormal compared to other results. And according to the preset recognition sequence of the teacher recognition model, sort the multiple teacher recognition results to obtain the sorted multiple teacher recognition results, and group the sorted multiple teacher recognition results according to the balance parameter, and get the sequenced arrangement. of multiple teacher identification groups. The number of teacher identification results included in each of the plurality of teacher identification groups is the same. Then, the average value of the recognition results of multiple teachers in each teacher recognition group is calculated separately, and the average value of the recognition results of the teachers is obtained. Then, the recognition result of the preset student model can be adjusted by using the weight parameter after the balancing process. Among them, when the average value of multiple teacher recognition results in the teacher recognition group is larger, the corresponding weight parameter after balancing processing is larger, the teacher recognition result represents the recognition probability, and the average value of the recognition probability in the teacher recognition group is larger, The resulting balanced weight parameter is larger. In this way, more attention is paid to the high-probability positions in the probability distribution, and it is of practical value to give priority to correctly matching high-probability events in the concept distribution. In this way, the loss value of the preset student model calculated according to the weight parameters is more accurate and can be improved more accurately. The accuracy of the preset student model. The obtained multiple teacher identification results are grouped according to the balance parameter to obtain multiple teacher identification groups arranged in sequence, and then the average value of the multiple teacher identification results of each teacher identification group in the multiple teacher identification groups is calculated, that is, the balance is adopted. The parameter balances each teacher identification result, which can balance the abnormal results in the teacher identification result, reduce the error generated when generating the weight parameter, and improve the accuracy of the preset student model.

Optionally, when obtaining the balance parameter for balancing the teacher identification results, the number of the teacher identification results in the multiple teacher identification results can be obtained, and the preset threshold to which the number of the teacher identification results in the multiple teacher identification results belongs to is determined. scope. Determine the target balance parameter corresponding to the preset threshold range to which the number of teacher identification results belongs from the balance parameter library, use the target balance parameter as the balance parameter for balancing the teacher identification results, the balance parameter library includes at least one balance parameter, and Correspondence between each of the at least one balance parameter and a preset threshold range.

Since the teacher recognition result is a plurality of teacher recognition results, the teacher recognition result represents the recognition probability, which refers to the recognition probability obtained by the preset teacher model recognizing the training sample data. Therefore, the teacher recognition result among the multiple teacher recognition results can be obtained. The number of teacher identification results is determined, and the preset threshold range to which the number of teacher identification results in the plurality of teacher identification results belongs. After determining the preset threshold range to which the number of teacher identification results among the plurality of teacher identification results belongs, the target balance parameter corresponding to the preset threshold range to which the number of teacher identification results belongs may be determined from the balance parameter database. It includes at least one balance parameter, and a corresponding relationship between each balance parameter in the at least one balance parameter and a preset threshold range. For example, a balance parameter library is preset, and the balance parameter library includes at least one balance parameter, and a corresponding relationship between each balance parameter in the at least one balance parameter and a preset threshold range, for example, the first threshold range corresponds to the first balance parameter, The second threshold range corresponds to the second balance parameter and so on.

Optionally, since the student identification results are multiple student identification results, and the student identification results correspond to the teacher identification results, the balance parameter may also be determined according to the number of student identification results among the multiple student identification results. Similarly, it is also possible to determine the preset threshold range to which the number of student identification results belongs, and then determine the target balance parameter corresponding to the preset threshold range to which the number of student identification results belongs from the balance database, and use the target balance parameter as the target balance parameter for teacher identification. As a result, the balance parameters are balanced, and the balance parameter library includes at least one balance parameter, and a corresponding relationship between each balance parameter in the at least one balance parameter and a preset threshold range. It should be noted that the balance parameter C can be determined according to the number of recognition results of the teacher or the number of recognition results of the students, or other specific circumstances. Abnormal results in teacher identification results or student identification results.

As shown in Figure 2, a schematic diagram of a method for calculating the average value of multiple teacher identification results in each teacher identification group provided by the embodiment of the present application, as shown in Figure 2, the number of teacher identification results is 11 teachers Taking the distribution of recognition results as an example, the 11 teacher recognition results are sorted according to the preset recognition order of the teacher model, and the sorted teacher recognition results are obtained, that is, x1, x2...x11. If the balance parameter is determined to be 3 according to the number of teacher identification results 11, then the 11 sorted teacher identification results can be grouped according to the balance parameter 3 to obtain multiple teacher identification groups arranged in sequence, namely [x1, x2], [x1 , x2, x3], [x2, x3, x4], [x3, x4, x5]…..[x9, x10, x11], [x10, x11]. Among them, since there is no teacher recognition result before the teacher recognition result x1, the teacher recognition group corresponding to the teacher recognition result x1 can be [x1, x2] or [x1, x2, x3]. Similarly, since there is no teacher recognition result sorted after the teacher recognition result x11, the teacher recognition group corresponding to the teacher recognition result x11 can be [x10, x11] or [x9, x10, x11]. After obtaining multiple teacher identification groups in sequence, calculate the average value of the multiple teacher identification results in each teacher identification group respectively. For example, after obtaining the teacher identification group [x1, x2, x3] corresponding to the teacher identification result x2, sum the teacher identification result x1, the teacher identification result x2 and the teacher identification result x3, and then divide it by the number of members of the group 3 to get The average value of multiple teacher recognition results in the teacher recognition group corresponding to the teacher recognition result x2.

For example, the distribution of teacher recognition results for training sample data is:

[0.1, 0.05, 0.0001, 0.02, 0.15, 0.28, 0.23, 0.06, 0.023, 0.05, 0.0369]

According to the number of teacher identification results in the teacher identification results of the training sample data, the balance parameter is determined to be 3, then the teacher identification group obtained after sampling and grouping with C=3 is:

[0.1, 0.05], [0.1, 0.05, 0.0001], [0.05, 0.0001, 0.02], [0.0001, 0.02, 0.15], [0.02, 0.15, 0.28], [0.15, 0.28, 0.23], [0.28, 0.23 , 0.06], [0.23, 0.06, 0.023], [0.06, 0.023, 0.05], [0.023, 0.05, 0.0369], [0.023, 0.05, 0.0369].

Calculate the average value of multiple teacher recognition results in each teacher recognition group separately, and the obtained multiple average values are:

[0.075, 0.05, 0.0234, 0.0567, 0.15, 0.22, 0.19, 0.1043, 0.0443, 0.0366, 0.0436]

The detailed calculation process for calculating the average value of each teacher's recognition result is as follows:

Teacher recognition result_0=(0.1+0.05)/2=0.0750, teacher recognition result_1=(0.1+0.05+0.0001)/3=0.0500, teacher recognition result_2=(0.05+0.0001+0.02)/3= 0.0234, teacher recognition result_3=(0.0001+0.02+0.15)/3=0.0567, teacher recognition result_4=(0.02+0.15+0.28)/3=0.1500, teacher recognition result 5=(0.15+0.28+0.23) /3=0.2200, teacher recognition result_6=(0.28+0.23+0.06)/3=0.1900, teacher recognition result_7=(0.23+0.06+0.023)/3=0.1043, teacher recognition result 8=(0.06+0.023 +0.05)/3=0.0443, teacher recognition result_9=(0.023+0.05+0.0369)/3=0.0366, teacher recognition result 10=(0.05+0.0369)/2=0.0435

For the third teacher recognition result of 0.0001, it becomes 0.0234 after balancing processing, and the abnormal teacher recognition results are balanced, that is, the abnormal teacher recognition results are eliminated.

S104: Calculate the logarithm between the teacher identification result and the student identification result, and perform a weighting operation on the logarithm using a weight parameter, and use the calculated value as a loss value to adjust the preset student model.

Calculate the logarithm between the teacher identification result and the student identification result, and use the above weight parameter to perform a weighted operation on the logarithm between the teacher identification result and the student identification result, and obtain the value after the weighted calculation, and use the value as a preset The loss value of the student model. The preset student model is adjusted according to the loss value of the preset student model to obtain an adjusted student model, and the adjusted student model is used as the target student model. The target learning model is used to identify the data to be processed, and the recognition result obtained by the target learning model identifying the data to be processed matches the recognition result obtained by identifying the data to be processed by the preset teacher model, that is, the target student model has the data of the preset teacher model. processing power.

Optionally, when the logarithm between the teacher identification result and the student identification result is calculated, and the logarithm is weighted by using a weight parameter, the following formula (1) may be used for calculation.

Among them, formula (1) D _KL (P||Q) refers to the preset loss value of the student model, and P(x) refers to the teacher recognition result of the training sample data, that is, the teacher model recognizes the training sample data. The teacher recognition result, Q(x) refers to the student recognition result of the training sample data, that is, the student recognition result obtained by the student model predicting the training sample data.

It refers to the logarithm between the teacher recognition result and the student recognition result. The weight parameter can refer to P(x), x refers to either the teacher recognition result or the student recognition result, and X refers to the teacher recognition result and the student recognition result. . This formula (1) can be called KL-divergence (relative entropy).

As shown in FIG. 3 , a schematic diagram of a method for obtaining a preset loss value of a student model provided by an embodiment of the present application, as shown in FIG. 3 , the method for obtaining a preset loss value of a student model includes step S21 -S23.

S21, in the above identification sequence, group the obtained multiple student identification results according to the balance parameter, and obtain multiple student identification groups arranged in sequence, wherein each student identification group in the multiple student identification groups includes the same number of student identification groups As a result, each teacher identification group corresponds to each student identification group in a one-to-one identification order.

S22, calculate the average value of the recognition results of multiple students in each student recognition group respectively.

The student identification result is a plurality of student identification results, and the student identification result represents the identification probability, which refers to the identification probability obtained by identifying the training sample data by the preset student model. The recognition sequence of the preset teacher model may be the same as the preset recognition sequence of the student model, that is, the training sample data in the same arrangement order can be recognized. The obtained multiple student identification results are grouped according to the above-mentioned balance parameter to obtain multiple student identification groups arranged in sequence. Wherein, each student identification group in the multiple student identification groups contains the same number of student identification results, and each teacher identification group corresponds to each student identification group in a one-to-one identification sequence. The obtained multiple student identification results are grouped according to the balance parameter to obtain multiple student identification groups arranged in sequence, and then the average value of the multiple student identification results of each student identification group in the multiple student identification groups is calculated, that is, the balance is adopted. The parameters are balanced for each student identification result, and the balanced student identification result is obtained. In this way, the abnormal results in the student identification results can be balanced, and the occurrence of extreme values can be greatly reduced, thereby improving the accuracy of the target student model and enabling the student model to have the data processing capability of the teacher model.

S23, calculate the logarithm of the average value of each student identification group and the average value of the corresponding teacher identification group respectively, obtain a plurality of logarithms after balance processing, and perform the weight parameter after balance processing with the logarithm after balance processing. Weighted operation.

Then calculate the average value of multiple student recognition results in each student recognition group, and calculate the logarithm of the average value of each student recognition group and the average value of the corresponding teacher recognition group to obtain the logarithm after multiple balanced processing. , the weight parameter after balance processing and the logarithm after balance processing are weighted.

Optionally, the following formulas (2), (3), and (4) can be used to calculate the weight parameters after the balance processing to perform a weighting operation on the logarithm between the teacher identification result and the student identification result.

Among them, z in formulas (2) and (3) represents the number of teacher recognition results in the teacher recognition group or the number of student recognition results in the student recognition group, and e represents the teacher recognition group or the student recognition group. D _DKL (P||Q) _C in formula (4) refers to the loss value, P′(x) refers to the average value of the teacher recognition group, that is, the average value obtained by balancing the teacher recognition results, Q′( x) is the average value of the student recognition group, that is, the average value obtained by balancing the student recognition results.

is the logarithm of the mean value of the student identification group and the mean value of the corresponding teacher identification group,

The weight parameter may refer to P(x), where x refers to any one of the teacher identification result and the student identification result, and X refers to the teacher identification result and the student identification result. When P(x), Q(x) are identical, D _DKL (P||Q) _C equals zero.

The present application introduces a balance parameter into the KL-divergence (relative entropy) function, and the KL-divergence loss function that introduces the balance parameter can be used as DKL-divergence, and the balance parameter is introduced into the DKL-divergence. Using DKL-divergence, each teacher recognition result in multiple teacher recognition results can be balanced, and each student recognition result in student recognition results can be balanced, eliminating multiple teacher recognition results and multiple student recognition results. abnormal results.

Optionally, the preset teacher model includes multiple teacher distillation layers, and the preset student model includes multiple student distillation layers, and the logarithm between the teacher identification result and the student identification result is calculated, and the weight parameter is used to pair the pair. When the calculated value is used as the loss value to adjust the preset student model, the teacher distillation layer corresponding to each student distillation layer in the multiple student distillation layers can be determined, and the value of each student distillation layer can be calculated. The logarithm between the student recognition result and the teacher recognition result of the corresponding teacher distillation layer. Using the weight parameter, the logarithm between the student identification result of each student distillation layer and the teacher identification result of the corresponding teacher distillation layer is weighted to obtain the loss value of each student distillation layer, and each student distillation layer is used separately. The loss value of is adjusted to the corresponding student distillation layer in the preset student model.

The preset teacher model includes multiple teacher distillation layers, and each teacher distillation layer has a corresponding output teacher identification result. The preset student model includes multiple student distillation layers, and each student distillation layer also has a corresponding output. According to the student identification result, the corresponding knowledge distillation can be performed between the distillation layer in the preset teacher model and the distillation layer in the preset student model. The teacher distillation layer corresponding to each student distillation layer in the multiple student distillation layers can be determined, and the weight parameter is used to compare the student identification result output by each student distillation layer and the teacher identification result output by the corresponding teacher distillation layer. The number is weighted to obtain the corresponding loss value of each student distillation layer. The loss value of each student distillation layer is used to adjust the corresponding student distillation layer in the preset student model, so that each student distillation layer in the preset student model can be adjusted more accurately and the accuracy of the student model can be improved. Spend.

For example, as shown in FIG. 4, which is a schematic diagram of a model distillation provided by the embodiment of the application, as shown in FIG. 4, the distillation model generally includes three parts of distillation, namely Transformer-layer distillation, Embedding-layer distillation, and Embedding-layer distillation. Layer (embedding layer) distillation and Prediction-layer (prediction layer) distillation can respectively perform knowledge distillation on the three distillation layers in the teacher model and the corresponding three distillation layers in the student model.

The corresponding knowledge distillation can be performed between the distillation layer in the teacher model and the distillation layer in the student model, that is, knowledge distillation is performed on the Transformer-layer conversion layer in the student model according to the Transformer-layer conversion layer in the teacher model, and the Transformer-layer conversion layer is calculated. The loss value of the layer conversion layer, the Transformer-layer conversion layer is adjusted according to the loss value of the Transformer-layer conversion layer. Perform knowledge distillation on the Embedding-layer in the student model according to the Embedding-layer in the teacher model, calculate the loss value of the Embedding-layer, and use the loss value of the Embedding-layer for the Embedding -layer embed layer to adjust. Perform knowledge distillation on the Prediction-layer prediction layer in the student model according to the Prediction-layer prediction layer in the teacher model, calculate the loss value of the Prediction-layer prediction layer, and use the loss value of the Prediction-layer prediction layer to the Prediction-layer prediction layer. layer prediction layer to adjust. In this way, the distillation layer in the student model can be adjusted more accurately, providing the preset accuracy of the student model.

As shown in Figure 4, the student model (new model) has M Transformers (transformation layers) layers, and the teacher model (original model) has N transformers (transformers (transformation layers) layers, using n=g(m) to represent the first layer of the student model The m layer gets information from the nth layer of the teacher model. We set the Embedding-layer distillation as the 0th layer, the Output layer as the M+1th layer, and the Transformers layer as the 1st to Mth layers. The following functional formula (5) can be used to represent the distillation loss of knowledge transfer from teacher to student.

Among them, L _layer in formula (5) represents the loss function of the specified layer, and the specified layer here may refer to Transformer-layer (conversion layer), Embedding-layer (embedding layer), Prediction-layer (prediction layer). Refers to m=0 refers to the Embedding layer, m=M+1 refers to the Output layer, m=1,2,...,M refers to the number of Transformer layers that the student model plans to learn the teacher model; λ _m represents the loss weight of each layer Hyperparameters; L _model represents the sum of knowledge distillation losses for all layers. Among them, Transformer-layer (conversion layer) distillation, Embedding-layer (embedding layer) distillation, Prediction-layer (prediction layer) distillation can set the corresponding loss function, and can set the corresponding loss function according to the loss value obtained by the corresponding loss function. make adjustments.

The Transformer-layer distillation in the model distillation includes self-attention-based distillation and hidden state-based distillation. The objective function of the self-attention matrix distillation of the Transformer-layer (conversion layer) in the related art is the following formula (6 ).

Among them, h in formula (6) is the number of attention heads, i represents the ith attention head,

denote the attention matrices of the student model and the teacher model, respectively, and MSE refers to the mean squared error loss.

Then the objective function of the output matrix fitting of each layer in the Transformer (transformation layer) is the following formula (7).

L _hidn = MSE(H ^S W _h , H ^R ) (7)

Among them, H ^S ∈ R ^l×d′ and H ^T ∈ R ^l×d in formula (7) refer to the hidden state matrices of the student and the teacher, respectively, R ^l×d′ and R ^l×d represent the student and the teacher, respectively The size of the hidden state matrix space, l, d represent the length of the training sample data (ie the length of the input sentence) and the size of the hidden layer, respectively. W _h ∈ R ^d′×d is a learnable linearly varying matrix that transforms the student’s hidden state matrix into the same result space size as the teacher.

The loss function of the Embedding-layer (distillation layer) in the related art is the following formula (8).

_Lembd = MSE( _E ^S We ,E ^T ) (8)

Among them, E ^S and _E ^T of (8) in the formula refer to the embedding (distillation layer) matrices of the student and teacher models, respectively, and We are a linear transformation matrix similar to W _h .

In the related art, the output layer of the Prediction-layer distillation adopts the soft cross-entropy loss as the following formula (9).

L _pred = -softmax(z ^T )·log_softmax(z ^S /t) (9)

Among them, z ^S and z ^T in formula (9) are the teacher recognition result of the preset teacher model and the student recognition result of the preset student model, respectively, log_softmax() represents the log-likelihood, and t refers to the distillation temperature.

To sum up, the overall objective function of model distillation in the related art can be expressed as the following formula (10).

Among them, the MSE in formula (10), the full name is Mean Squared Error, also known as the mean square error, is generally used to detect the deviation between the model predicted value and the true value. Assuming that the result distribution of the true value is observed, the result distribution of the predicted value is predicted, and the sample space size is n, the difference between the two distributions can be expressed as the following formula (11).

In the related art, MSE is used to calculate the loss value of the student model, but it can be seen from formula (11) that the MSE in the related art pays attention to all positions in the result distribution indiscriminately, and the obtained loss value does not reflect the student well. The difference between the model and the teacher model, so the preset student model cannot be adjusted accurately.

In the embodiment of the present application, the above-mentioned DKL-divergence is used to calculate the loss value of the student model. For example, when calculating the loss value corresponding to the Transformer (transformation layer), it is assumed that the teacher model and the student model perform model distillation. The layers corresponding to U are respectively U and V, for the knowledge distillation of each corresponding layer, the attention matrix corresponds to

and

T and S refer to teachers and students respectively, then for the teacher identification results and student identification results corresponding to the training sample data, after determining the balance parameter C, the attention matrix can be corresponding to

and

Sampling is performed to obtain the sub-distributions p ^T , q ^S corresponding to the teacher identification group and the student identification group respectively, and the distribution obtained after averaging is:

and

is the following formula (12) and

The formula is the following formula (13)

Among them, z in formula (12) and formula (13) represents the number of teacher recognition results in the teacher recognition group and the number of student recognition results in the student recognition group, and e represents the teacher recognition group and the student recognition group.

Then the loss function of the student model

It is expressed as the following formula (14).

Among them, χ in formula (14) represents the probability space where the attention matrix A∈R ^1×l is located, which actually represents the result distribution of the training sample data, and l represents the length of the training sample data. Then, for a Transformer (transformation layer) with h attention heads, the total loss for training sample data of length l is expressed as the following formula (15).

Among them, t in formula (15) refers to a certain sub-training sample data of length t in the training sample data of length l, and refers to a certain attention head among h attention heads.

The DKL-divergence in this scheme uses P'(x) (a weight parameter generated according to the teacher's recognition probability) to the logarithm of the ratio of P'(x) to Q'(x) (the average value of each student's recognition group and the corresponding The logarithm of the average value of the teacher identification group) is weighted. When the value of P'(x) is larger, the calculation result of DKL-divergence is relatively larger. In other words, DKL-divergence pays more attention to high-probability locations in the probability distribution, and there is practical value in prioritizing correctly matching the truly high-probability events in the concept distribution. Whereas the MSE in the related art focuses on all positions in the distribution indiscriminately. Therefore, using DKL-divergence is more suitable for calculating the loss value of the student model than MSE. At the same time, a balance constant can be used in this scheme to eliminate abnormal results (ie abnormal concepts) in the teacher identification results and student identification results corresponding to the training sample data. For example, if the value of a teacher identification result in the teacher identification results is 0.2, the corresponding student identification results If the value of the recognition result is 0.0001, the calculated logarithm between P'(x) and Q'(x) will become abnormally large, resulting in a large gradient gap. On the contrary, it may cause the problem of gradient disappearance. Therefore, the introduction of balance parameters to balance the possible abnormal probability values can greatly reduce the occurrence of extreme values. In this way, the accuracy of the student model can be improved, and the data processing ability of the student model can be improved, so that the recognition result obtained by the learning model from identifying the data to be processed matches the recognition result obtained by the teacher model from identifying the data to be processed, even if the student model has the characteristics of the teacher model. data processing capability.

In the embodiment of the present application, the training sample data for training the preset student model is obtained, the preset student model and the preset teacher model are used to identify the training sample data respectively, and the teacher identification result of the training sample data is obtained. and student identification results. Obtain the weight parameter used to adjust the recognition result of the preset student model from the teacher recognition result, calculate the logarithm between the teacher recognition result and the student recognition result, and use the weight parameter to perform a weighting operation on the logarithm, and calculate the logarithm. The obtained value is used as the loss value to adjust the preset student model. In this way, the recognition results of the training sample data and the adjustment weights for adjusting the student model for different results in the prediction results of the training sample data can be reasonably allocated according to the weight parameters, so that more attention is paid to the recognition results with higher probability values, which can make the obtained loss value more accurate, and can make the adjusted student model more accurate. At the same time, a balance parameter is introduced to eliminate the abnormal results in the teacher identification results and the student identification results of the training sample data, so as to avoid inaccurate loss values due to abnormal results in the teacher identification results and the student identification results. Through this application, the student model can have the data processing capability of the teacher model, and the accuracy of the student model can be improved.

Please refer to FIG. 5 , which is a schematic flowchart of another model distillation method provided by an embodiment of the present application. The method may be performed by computer equipment, as shown in FIG. 5 , the other model distillation method may include steps S201-S207.

S201: Acquire training sample data for training a preset student model.

S202, using a preset student model and a preset teacher model to identify the training sample data, respectively, to obtain a teacher identification result and a student identification result of the training sample data.

S203: Obtain, from the teacher's recognition result, a weight parameter for adjusting the recognition result of the preset student model.

S204: Calculate the logarithm between the teacher's identification result and the student's identification result, and perform a weighting operation on the logarithm by using a weight parameter.

For the specific content of steps S201 to S204 in this embodiment of the present application, reference may be made to the content in the embodiment described in FIG. 1 , which is not repeated in this embodiment of the present application.

S205, verify whether the loss value satisfies the condition of the convergence state.

After obtaining the loss value of the student model, it is determined whether the loss value satisfies the convergence state condition, where the convergence condition means that the loss value is less than the loss threshold preset by the user, or the loss value is the minimum value of the corresponding loss function.

Optionally, when verifying whether the loss value satisfies the convergence state condition, the minimum value of the loss function used to calculate the loss value can be obtained, and if the loss value is not the same as the minimum value, it is determined that the loss value does not meet the convergence condition; or, It is verified whether the loss value is less than the preset loss threshold value, and if the loss value is greater than or equal to the preset loss threshold value, it is determined that the loss value does not satisfy the convergence condition.

When verifying whether the loss value satisfies the convergence state condition, the minimum value of the loss function used to calculate the loss value is obtained. If the loss value is not the same as the minimum value, or is smaller than the minimum value, it is determined that the loss value does not meet the convergence condition. Alternatively, it is verified whether the loss value is less than the preset loss threshold value, and if the loss value is greater than or equal to the preset loss threshold value, it is determined that the loss value does not satisfy the convergence condition. The preset loss threshold can be set according to the data processing type of the student model or according to other indicators.

S206, if the loss value does not satisfy the convergence condition, determine the loss degree to which the loss value belongs.

S207: Adjust the parameters in the preset student model according to the degree of loss.

If the loss value does not meet the convergence condition, it means that the teacher recognition result obtained by the teacher model recognizing the training sample data is quite different from the student recognition result obtained by the student model predicting the training sample data, that is, the student model recognition needs to be The recognition result obtained by processing the data does not match the recognition result obtained by the teacher model identifying the data to be processed. Then, the loss degree to which the loss value belongs is determined, and the parameters in the preset student model are adjusted according to the loss degree. If the degree of loss is larger, the adjustment of the parameters in the preset student model is larger; the smaller the degree of loss is, the smaller the adjustment of the parameters in the preset student model is. In this way, adjusting the preset student model based on the loss value can realize a greater degree of adjustment when the error degree of the student model is greater, thereby improving the convergence speed of the student model and improving the training efficiency. The adjustment operation of the student model is more accurate, thereby improving the training accuracy of the student model.

For the specific content of the embodiment of the present application, reference may be made to the content of the embodiment described in FIG. 1 , and the embodiment of the present application will not be repeated here.

In the embodiment of the present application, the training sample data for training the preset student model is obtained, the preset student model and the preset teacher model are used to identify the training sample data respectively, and the teacher identification result of the training sample data is obtained. and student identification results. Obtain the weight parameter used to adjust the recognition result of the preset student model from the teacher recognition result, calculate the logarithm between the teacher recognition result and the student recognition result, and use the weight parameter to perform a weighting operation on the logarithm, and calculate the logarithm. The obtained value is used as the loss value to adjust the preset student model. In this way, the recognition results of the training sample data and the adjustment weights for adjusting the student model for different results in the prediction results of the training sample data can be reasonably allocated according to the weight parameters, so that more attention is paid to the recognition results with higher probability values, which can make the obtained loss value more accurate, and can make the adjusted student model more accurate. At the same time, a balance parameter is introduced to eliminate the abnormal results in the teacher identification results and the student identification results of the training sample data, so as to avoid inaccurate loss values due to abnormal results in the teacher identification results and the student identification results. The student model is adjusted according to the loss degree of the loss value pair, which can realize a greater degree of adjustment when the error degree of the student model is greater, thereby improving the accuracy of the student model training. Through this application, the student model can have the data processing capability of the teacher model, and the accuracy of the student model can be improved.

Please refer to FIG. 6 , which is a schematic structural diagram of a model distillation apparatus provided by an embodiment of the present application. The above-mentioned model distillation apparatus may be a computer program (including program code) running in a computer device, for example, the model distillation apparatus is an application software; the apparatus may be used to execute corresponding steps in the methods provided in the embodiments of the present application. As shown in FIG. 6 , the model distillation apparatus may include: a first acquisition module 11 , an identification module 12 , a second acquisition module 13 , and an adjustment module 14 .

The first acquisition module 11 is used for acquiring training sample data for training a preset student model;

The identification module 12 is configured to use the preset student model and the preset teacher model to identify the training sample data respectively, and obtain the teacher identification result and the student identification result of the training sample data, wherein the preset The set student model is obtained through the guidance and training of the preset teacher model;

The second obtaining module 13 is configured to obtain, from the teacher identification result, a weight parameter for adjusting the identification result of the preset student model;

The adjustment module 14 is used to calculate the logarithm between the teacher identification result and the student identification result, and use the weight parameter to perform a weighted operation on the logarithm, and use the calculated value as a loss value for all parameters. The preset student model is adjusted.

Wherein, the teacher identification results are multiple, and the teacher identification results represent the identification probability;

The above-mentioned second acquisition module 13 includes:

an acquisition unit for acquiring a balance parameter for balancing the teacher identification result;

The first grouping unit is configured to group the obtained multiple teacher identification results according to the balance parameter according to the preset identification sequence of the teacher model to obtain multiple teacher identification groups arranged in sequence, wherein all Each teacher identification group in the multiple teacher identification groups contains the same number of teacher identification results;

The first calculation unit is configured to calculate the average value of multiple teacher identification results in each teacher identification group respectively, and use the obtained multiple average values as weight parameters after balancing processing.

Wherein, the student identification results are multiple;

The above-mentioned adjustment module 14 includes:

The second grouping unit is configured to group the obtained multiple student identification results according to the balance parameter in the identification sequence to obtain multiple student identification groups arranged in sequence, wherein the multiple student identification groups are In each student identification group contains the same number of student identification results, each described teacher identification group and each described student identification group are in one-to-one correspondence according to the identification sequence;

a second calculation unit, used for calculating the average value of the recognition results of multiple students in each of the student recognition groups;

The third calculation unit is used to calculate the logarithm of the mean value of each student identification group and the mean value of the corresponding teacher identification group respectively, and obtain a plurality of logarithms after balanced processing;

The first weighting operation unit is configured to perform weighting operation on the weight parameter after the balance processing and the logarithm after the balance processing.

Wherein, the above acquisition unit is specifically used for:

Obtain the number of teacher identification results in the plurality of teacher identification results;

Determine the preset threshold range to which the number of teacher identification results in the plurality of teacher identification results belongs;

A target balance parameter corresponding to the preset threshold range is determined from a balance parameter library, and the target balance parameter is used as a balance parameter for balancing the teacher identification result. The balance parameter library includes at least one balance parameter, and all The corresponding relationship between each balance parameter of the at least one balance parameter and the preset threshold range.

Wherein, the above-mentioned adjustment module 14 includes:

a verification unit, configured to verify whether the loss value satisfies the convergence state condition;

a first determining unit, configured to determine the degree of loss to which the loss value belongs if the loss value does not satisfy the convergence condition;

A first adjustment unit, configured to adjust parameters in the preset student model according to the loss degree.

Among them, the above verification unit is specifically used for:

Obtain the minimum value of the loss function used to calculate the loss value, and if the loss value is different from the minimum value, it is determined that the loss value does not satisfy the convergence condition; or,

It is verified whether the loss value is less than a preset loss threshold, and if the loss value is greater than or equal to the preset loss threshold, it is determined that the loss value does not satisfy the convergence condition.

Wherein, the preset teacher model includes multiple teacher distillation layers, and the preset student model includes multiple student distillation layers;

The above-mentioned adjustment module 14 includes:

a second determining unit, configured to determine the teacher distillation layer corresponding to each student distillation layer in the plurality of student distillation layers;

The fourth calculation unit is used to calculate the logarithm between the student identification result of each student distillation layer and the teacher identification result of the corresponding teacher distillation layer;

The second weighting operation unit is configured to use the weight parameter to perform a weighted operation on the logarithm between the student identification result of each student distillation layer and the teacher identification result of the corresponding teacher distillation layer, to obtain the each The loss value of the student distillation layer;

The second adjustment unit is configured to adjust the corresponding student distillation layer in the preset student model by using the loss value of each student distillation layer respectively.

In the embodiment of the present application, the training sample data used for training the preset teacher model is obtained, the preset student model and the preset teacher model are used to identify the training sample data respectively, and the teacher identification result of the training sample data is obtained. and student identification results. Obtain the weight parameter used to adjust the recognition result of the preset student model from the teacher recognition result, calculate the logarithm between the teacher recognition result and the student recognition result, and use the weight parameter to perform a weighting operation on the logarithm, and calculate the logarithm. The obtained value is used as the loss value to adjust the preset student model. In this way, the recognition results of the training sample data and the adjustment weights for adjusting the student model for different results in the prediction results of the training sample data can be reasonably allocated according to the weight parameters, so that more attention is paid to the recognition results with higher probability values, which can make the obtained loss value more accurate, and can make the adjusted student model more accurate. At the same time, a balance parameter is introduced to eliminate the abnormal results in the teacher identification results and the student identification results of the training sample data, so as to avoid inaccurate loss values due to abnormal results in the teacher identification results and the student identification results. And adjusting the student model according to the loss degree of the loss value pair can realize a greater degree of adjustment when the error degree of the student model is greater, thereby improving the accuracy of the student model training. Through this application, the student model can have the data processing capability of the teacher model, and the accuracy of the student model can be improved.

According to an embodiment of the present application, the steps involved in the model distillation method shown in FIG. 1 or FIG. 5 may be performed by each module in the model distillation apparatus shown in FIG. 6 . For example, step S101 shown in FIG. 1 may be performed by the first acquisition module 11 shown in FIG. 6 ; step S102 shown in FIG. 1 may be performed by the identification module 12 shown in FIG. 6 ; step S103 shown in FIG. 1 It can be performed by the second acquisition module 13 in FIG. 6 ; step S104 shown in FIG. 1 can be performed by the adjustment module 14 in FIG. 6 .

Please refer to FIG. 7 , which is a schematic structural diagram of a computer device provided by an embodiment of the present application. The computer device may include a processor and memory. Optionally, the computer device may further include a network interface and/or a user interface. For example, as shown in FIG. 7 , the above-mentioned computer device 1000 may include: a processor 1001 , a network interface 1004 and a memory 1005 , in addition, the above-mentioned computer device 1000 may further include: a user interface 1003 , and at least one communication bus 1002 . Among them, the communication bus 1002 is used to realize the connection and communication between these components. The user interface 1003 may include a display screen (Display) and a keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a wireless interface. Optionally, the network interface 1004 may include a standard wired interface and a wireless interface (eg, a WI-FI interface). The memory 1005 can be a high-speed RAM memory, or a non-volatile memory, such as at least one disk memory. Optionally, the memory 1005 may also be at least one storage device located away from the aforementioned processor 1001 . As shown in FIG. 7 , the memory 1005 as a computer-readable storage medium may include an operating system, a network communication module, a user interface module, and a device control application program.

In the computer device 1000 shown in FIG. 7 , the network interface 1004 can provide a network communication function; the user interface 1003 is mainly used to provide an input interface for the user; and the processor 1001 can be used to call the device control application stored in the memory 1005 program to achieve:

Obtain training sample data for training a preset student model;

Optionally, the processor 1001 can be used to call the device control application program stored in the memory 1005 to realize:

obtaining balancing parameters for balancing the teacher identification results;

In the recognition sequence of the preset teacher model, the obtained multiple teacher identification results are grouped according to the balance parameter, to obtain multiple teacher identification groups arranged in sequence, wherein, among the multiple teacher identification groups, Each teacher identification group contains the same number of teacher identification results;

Calculate the average value of multiple teacher recognition results in each teacher recognition group respectively, and use the obtained multiple average values as weight parameters after balancing processing.

In the identification sequence, the obtained multiple student identification results are grouped according to the balance parameter, to obtain multiple student identification groups arranged in sequence, wherein each student identification group in the multiple student identification groups includes: The same number of student identification results, each described teacher identification group and each described student identification group are in one-to-one correspondence according to the identification sequence;

Calculate the average value of multiple student identification results in each of the student identification groups respectively;

Calculate the logarithm of the mean value of the described each student identification group and the mean value of the corresponding teacher identification group respectively, and obtain the logarithm after a plurality of balanced treatments;

A weighting operation is performed on the weight parameter after the balance processing and the logarithm after the balance processing.

verifying whether the loss value satisfies the convergence state condition;

If the loss value does not satisfy the convergence condition, determining the degree of loss to which the loss value belongs;

The parameters in the preset student model are adjusted according to the loss degree.

The logarithm between the teacher identification result and the student identification result is calculated, and the logarithm is weighted by using the weight parameter, and the calculated value is used as a loss value for the preset value. Student models are adjusted to include:

determining the teacher distillation layer corresponding to each student distillation layer in the plurality of student distillation layers;

Calculate the logarithm between the student recognition result of each student distillation layer and the teacher recognition result of the corresponding teacher distillation layer;

Using the weight parameter, the logarithm between the student identification result of each student distillation layer and the teacher identification result of the corresponding teacher distillation layer is weighted to obtain the loss value of each student distillation layer;

The corresponding student distillation layer in the preset student model is adjusted by using the loss value of each student distillation layer.

In the embodiment of the present application, the training sample data for training the preset student model is obtained, the preset student model and the preset teacher model are used to identify the training sample data respectively, and the teacher identification result of the training sample data is obtained. and student identification results. Obtain the weight parameter used to adjust the recognition result of the preset student model from the teacher recognition result, calculate the logarithm between the teacher recognition result and the student recognition result, and use the weight parameter to perform a weighting operation on the logarithm, and calculate the logarithm. The obtained value is used as the loss value to adjust the preset student model. In this way, the recognition results of the training sample data and the adjustment weights for adjusting the student model for different results in the prediction results of the training sample data can be reasonably allocated according to the weight parameters, so that more attention is paid to the recognition results with higher probability values, which can make the obtained loss value more accurate, and can make the adjusted student model more accurate. At the same time, a balance parameter is introduced to eliminate the abnormal results in the teacher identification results and the student identification results of the training sample data, so as to avoid inaccurate loss values due to abnormal results in the teacher identification results and the student identification results. And adjusting the student model according to the loss degree of the loss value pair can realize a greater degree of adjustment when the error degree of the student model is greater, thereby improving the accuracy of the student model training. Through this application, the student model can have the data processing capability of the teacher model, and the accuracy of the student model can be improved.

It should be understood that the computer device 1000 described in this embodiment of the present application can execute the description of the above model distillation method in the foregoing embodiment corresponding to FIG. 1 and the foregoing FIG. The description of the distillation apparatus will not be repeated here. In addition, the description of the beneficial effects of using the same method will not be repeated.

In the embodiment of the present application, in addition, it should be pointed out here that: the embodiment of the present application also provides a computer-readable storage medium, and the computer-readable storage medium described above stores the computer executed by the model distillation apparatus mentioned above. The above computer program includes program instructions. When the above-mentioned processor executes the above-mentioned program instructions, the above-mentioned description of the above-mentioned model distillation method in the corresponding embodiment of FIG. 1 or FIG. 5 can be executed. In addition, the description of the beneficial effects of using the same method will not be repeated. For technical details not disclosed in the computer-readable storage medium embodiments involved in the present application, please refer to the description of the method embodiments of the present application.

Optionally, the storage medium involved in this application, such as a computer-readable storage medium, may be non-volatile or volatile.

By way of example, the above-described program instructions may be deployed and executed on one computer device, or on multiple computer devices located at one site, or alternatively, distributed across multiple sites and interconnected by a communication network. Executed on a blockchain, multiple computer devices distributed in multiple locations and interconnected by a communication network can form a blockchain network.

Those of ordinary skill in the art can understand that all or part of the process in the method of the above embodiment can be implemented by instructing the relevant hardware through a computer program, and the above program can be stored in a computer-readable storage medium, and the program is in During execution, it may include the processes of the embodiments of the above-mentioned methods. Wherein, the above-mentioned storage medium may be a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM) or a random access memory (Random Access Memory, RAM) and the like.

The above disclosures are only the preferred embodiments of the present application, and of course, the scope of the rights of the present application cannot be limited by this. Therefore, equivalent changes made according to the claims of the present application are still within the scope of the present application.

Claims

A model distillation method including:

Obtain training sample data for training a preset student model;

The training sample data is identified by the preset student model and the preset teacher model, respectively, and the teacher identification result and the student identification result of the training sample data are obtained, wherein the preset student model is determined by the The above-mentioned preset teacher model guides the training;

Obtaining a weight parameter for adjusting the recognition result of the preset student model from the teacher recognition result;

Calculate the logarithm between the teacher identification result and the student identification result, and use the weight parameter to perform a weighted operation on the logarithm, and use the calculated value as a loss value to the preset student model make adjustments.
The method according to claim 1, wherein the teacher identification results are multiple, and the teacher identification results represent identification probability;

The weight parameters obtained from the teacher's recognition result for adjusting the recognition result of the preset student model include:

obtaining balancing parameters for balancing the teacher identification results;

In the recognition sequence of the preset teacher model, the obtained multiple teacher identification results are grouped according to the balance parameter, to obtain multiple teacher identification groups arranged in sequence, wherein, among the multiple teacher identification groups, Each teacher identification group contains the same number of teacher identification results;

Calculate the average value of multiple teacher recognition results in each teacher recognition group respectively, and use the obtained multiple average values as weight parameters after balancing processing.
The method of claim 2, wherein the student identification results are multiple;

Described calculating the logarithm between the teacher identification result and the student identification result, and using the weight parameter to perform a weighted operation on the logarithm, including:

In the identification sequence, the obtained multiple student identification results are grouped according to the balance parameter, to obtain multiple student identification groups arranged in sequence, wherein each student identification group in the multiple student identification groups includes: The same number of student identification results, each described teacher identification group and each described student identification group are in one-to-one correspondence according to the identification sequence;

Calculate the average value of multiple student identification results in each of the student identification groups respectively;

Calculate the logarithm of the mean value of the described each student identification group and the mean value of the corresponding teacher identification group respectively, and obtain the logarithm after a plurality of balanced treatments;

A weighting operation is performed on the weight parameter after the balance processing and the logarithm after the balance processing.
The method according to claim 2, wherein the obtaining a balance parameter for balancing the teacher identification result comprises:

Obtain the number of teacher identification results in the plurality of teacher identification results;

Determine the preset threshold range to which the number of teacher identification results in the plurality of teacher identification results belongs;

A target balance parameter corresponding to the preset threshold range is determined from a balance parameter library, and the target balance parameter is used as a balance parameter for balancing the teacher identification result. The balance parameter library includes at least one balance parameter, and all The corresponding relationship between each balance parameter of the at least one balance parameter and the preset threshold range.
The method according to claim 1, wherein the adjusting the preset student model using the calculated value as a loss value comprises:

verifying whether the loss value satisfies the convergence state condition;

If the loss value does not satisfy the convergence condition, determining the degree of loss to which the loss value belongs;

The parameters in the preset student model are adjusted according to the loss degree.
The method of claim 5, wherein the verifying whether the loss value satisfies a convergence state condition comprises:

Obtain the minimum value of the loss function used to calculate the loss value, and if the loss value is different from the minimum value, determine that the loss value does not satisfy the convergence condition; or,

It is verified whether the loss value is less than a preset loss threshold, and if the loss value is greater than or equal to the preset loss threshold, it is determined that the loss value does not satisfy the convergence condition.
The method of claim 1, wherein the preset teacher model includes a plurality of teacher distillation layers, and the preset student model includes a plurality of student distillation layers;

The logarithm between the teacher identification result and the student identification result is calculated, and the logarithm is weighted by using the weight parameter, and the calculated value is used as a loss value for the preset value. Student models are adjusted to include:

determining the teacher distillation layer corresponding to each student distillation layer in the plurality of student distillation layers;

Calculate the logarithm between the student recognition result of each student distillation layer and the teacher recognition result of the corresponding teacher distillation layer;

Using the weight parameter, the logarithm between the student identification result of each student distillation layer and the teacher identification result of the corresponding teacher distillation layer is weighted to obtain the loss value of each student distillation layer;

The corresponding student distillation layer in the preset student model is adjusted by using the loss value of each student distillation layer.
A model distillation apparatus, comprising:

a first acquisition module, used for acquiring training sample data for training a preset student model;

an identification module, configured to identify the training sample data by using the preset student model and the preset teacher model, respectively, to obtain a teacher identification result and a student identification result of the training sample data, wherein the preset The student model is obtained by the guidance and training of the preset teacher model;

a second acquisition module, configured to acquire, from the teacher recognition result, a weight parameter for adjusting the recognition result of the preset student model;

The adjustment module is used to calculate the logarithm between the teacher identification result and the student identification result, and use the weight parameter to perform a weighted operation on the logarithm, and use the calculated value as a loss value to the said logarithm. The preset student models can be adjusted.
A computer equipment, including: a processor and a memory;

Wherein, the memory is used to store program code, and the processor is used to call the program code to execute the following method:

Obtain training sample data for training a preset student model;

The training sample data is identified by the preset student model and the preset teacher model, respectively, and the teacher identification result and the student identification result of the training sample data are obtained, wherein the preset student model is determined by the The above-mentioned preset teacher model guides the training;

Obtaining a weight parameter for adjusting the recognition result of the preset student model from the teacher recognition result;

Calculate the logarithm between the teacher identification result and the student identification result, and use the weight parameter to perform a weighted operation on the logarithm, and use the calculated value as a loss value to the preset student model make adjustments.
The computer device according to claim 9, wherein the teacher recognition results are multiple, and the teacher recognition results represent recognition probability;

Executing the obtaining of the weight parameter used for adjusting the recognition result of the preset student model by the teacher's recognition result, including:

obtaining balancing parameters for balancing the teacher identification results;

In the recognition sequence of the preset teacher model, the obtained multiple teacher identification results are grouped according to the balance parameter, to obtain multiple teacher identification groups arranged in sequence, wherein, among the multiple teacher identification groups, Each teacher identification group contains the same number of teacher identification results;

Calculate the average value of multiple teacher recognition results in each teacher recognition group respectively, and use the obtained multiple average values as weight parameters after balancing processing.
The computer device according to claim 10, wherein the student identification results are plural;

Perform the calculation of the logarithm between the teacher identification result and the student identification result, and use the weight parameter to perform a weighted operation on the logarithm, including:

In the identification sequence, the obtained multiple student identification results are grouped according to the balance parameter, to obtain multiple student identification groups arranged in sequence, wherein each student identification group in the multiple student identification groups includes: The same number of student identification results, each described teacher identification group and each described student identification group are in one-to-one correspondence according to the identification sequence;

Calculate the average value of multiple student identification results in each of the student identification groups respectively;

Calculate the logarithm of the mean value of the described each student identification group and the mean value of the corresponding teacher identification group respectively, and obtain the logarithm after a plurality of balanced treatments;

A weighting operation is performed on the weight parameter after the balance processing and the logarithm after the balance processing.
The computer device of claim 10, wherein performing the obtaining of a balance parameter for balancing the teacher identification results comprises:

Obtain the number of teacher identification results in the plurality of teacher identification results;

Determine the preset threshold range to which the number of teacher identification results in the plurality of teacher identification results belongs;

A target balance parameter corresponding to the preset threshold range is determined from a balance parameter library, and the target balance parameter is used as a balance parameter for balancing the teacher identification result. The balance parameter library includes at least one balance parameter, and all The corresponding relationship between each balance parameter of the at least one balance parameter and the preset threshold range.
The computer device according to claim 9, wherein performing the adjustment of the preset student model using the calculated value as a loss value comprises:

verifying whether the loss value satisfies the convergence state condition;

If the loss value does not satisfy the convergence condition, determining the degree of loss to which the loss value belongs;

The parameters in the preset student model are adjusted according to the loss degree.
The computer device according to claim 9, wherein the preset teacher model includes a plurality of teacher distillation layers, and the preset student model includes a plurality of student distillation layers;

Perform the calculation of the logarithm between the teacher identification result and the student identification result, and use the weight parameter to perform a weighted operation on the logarithm, and use the calculated value as a loss value for the preset The student model is adjusted to include:

determining the teacher distillation layer corresponding to each student distillation layer in the plurality of student distillation layers;

Calculate the logarithm between the student recognition result of each student distillation layer and the teacher recognition result of the corresponding teacher distillation layer;

Using the weight parameter, the logarithm between the student identification result of each student distillation layer and the teacher identification result of the corresponding teacher distillation layer is weighted to obtain the loss value of each student distillation layer;

The corresponding student distillation layer in the preset student model is adjusted by using the loss value of each student distillation layer.
A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, the computer program includes program instructions, and the program instructions, when executed by a processor, perform the following method:

Obtain training sample data for training a preset student model;

The training sample data is identified by the preset student model and the preset teacher model, respectively, and the teacher identification result and the student identification result of the training sample data are obtained, wherein the preset student model is determined by the The above-mentioned preset teacher model guides the training;

Obtaining a weight parameter for adjusting the recognition result of the preset student model from the teacher recognition result;

Calculate the logarithm between the teacher identification result and the student identification result, and use the weight parameter to perform a weighted operation on the logarithm, and use the calculated value as a loss value to the preset student model make adjustments.
The computer-readable storage medium according to claim 15, wherein the teacher identification results are multiple, and the teacher identification results represent identification probability;

Executing the obtaining of the weight parameter used for adjusting the recognition result of the preset student model by the teacher's recognition result, including:

obtaining balancing parameters for balancing the teacher identification results;

In the recognition sequence of the preset teacher model, the obtained multiple teacher identification results are grouped according to the balance parameter, to obtain multiple teacher identification groups arranged in sequence, wherein, among the multiple teacher identification groups, Each teacher identification group contains the same number of teacher identification results;

Calculate the average value of multiple teacher recognition results in each teacher recognition group respectively, and use the obtained multiple average values as weight parameters after balancing processing.
The computer-readable storage medium of claim 16, wherein the student identification results are plural;

Perform the calculation of the logarithm between the teacher identification result and the student identification result, and use the weight parameter to perform a weighted operation on the logarithm, including:

In the identification sequence, the obtained multiple student identification results are grouped according to the balance parameter, to obtain multiple student identification groups arranged in sequence, wherein each student identification group in the multiple student identification groups includes: The same number of student identification results, each described teacher identification group and each described student identification group are in one-to-one correspondence according to the identification sequence;

Calculate the average value of multiple student identification results in each of the student identification groups respectively;

Calculate the logarithm of the mean value of the described each student identification group and the mean value of the corresponding teacher identification group respectively, and obtain the logarithm after a plurality of balanced treatments;

A weighting operation is performed on the weight parameter after the balance processing and the logarithm after the balance processing.
The computer-readable storage medium of claim 16, wherein performing the obtaining of a balance parameter for balancing the teacher identification results comprises:

Obtain the number of teacher identification results in the plurality of teacher identification results;

Determine the preset threshold range to which the number of teacher identification results in the plurality of teacher identification results belongs;

A target balance parameter corresponding to the preset threshold range is determined from a balance parameter library, and the target balance parameter is used as a balance parameter for balancing the teacher identification result. The balance parameter library includes at least one balance parameter, and all The corresponding relationship between each balance parameter of the at least one balance parameter and the preset threshold range.
The computer-readable storage medium according to claim 15, wherein performing the adjusting of the preset student model using the calculated value as a loss value comprises:

verifying whether the loss value satisfies the convergence state condition;

If the loss value does not satisfy the convergence condition, determining the degree of loss to which the loss value belongs;

The parameters in the preset student model are adjusted according to the loss degree.
The computer-readable storage medium of claim 15, wherein the preset teacher model includes a plurality of teacher distillation layers, and the preset student model includes a plurality of student distillation layers;

Perform the calculation of the logarithm between the teacher identification result and the student identification result, and use the weight parameter to perform a weighted operation on the logarithm, and use the calculated value as a loss value for the preset The student model is adjusted to include:

determining the teacher distillation layer corresponding to each student distillation layer in the plurality of student distillation layers;

Calculate the logarithm between the student recognition result of each student distillation layer and the teacher recognition result of the corresponding teacher distillation layer;

Using the weight parameter, the logarithm between the student identification result of each student distillation layer and the teacher identification result of the corresponding teacher distillation layer is weighted to obtain the loss value of each student distillation layer;

The corresponding student distillation layer in the preset student model is adjusted by using the loss value of each student distillation layer.