WO2023273237A1

WO2023273237A1 - Model compression method and system, electronic device, and storage medium

Info

Publication number: WO2023273237A1
Application number: PCT/CN2021/140780
Authority: WO
Inventors: 陈贝
Original assignee: 达闼机器人股份有限公司
Priority date: 2021-06-29
Filing date: 2021-12-23
Publication date: 2023-01-05
Also published as: CN115238903B; CN115238903A

Abstract

Embodiments of the present application relate to the technical field of machine learning. Disclosed are a model compression method and system, an electronic device, and a storage medium. The model compression method comprises: providing N types of trained complex models; fusing the N types of complex models to obtain a trained teacher model; and training a student model on the basis of a training sample, the teacher model, and a loss function of the student model, wherein the loss function of the student model is obtained by fusing a first loss function with a second loss function; the first loss function is used for calculating the loss of a predicted value and a real value of the student model; the second loss function is used for calculating the loss of a logit value of the student model and a logit value of the teacher model. According to the technical solution provided in the embodiments of the present application, the prediction precision of the student model obtained by training can be improved.

Description

Model compression method, system, electronic device and storage medium

cross reference

This application is based on the Chinese patent application with the application number "2021107322788" and the filing date is June 29, 2021, and claims the priority of the Chinese patent application. The entire content of the Chinese patent application is hereby incorporated into this application by way of introduction .

technical field

The embodiments of the present application relate to the technical field of machine learning, and in particular to a model compression method, system, electronic equipment, and storage medium.

Background technique

Text similarity matching is widely used. For example, in information retrieval, in order to recall more results similar to search terms, the information retrieval system can use similarity to identify similar words to improve the recall rate. In addition, in automatic question answering, natural language interaction can be used, and the similarity can be used to calculate the matching degree between the user's question sentence in natural language and the question in the corpus, and the answer corresponding to the question with the highest matching degree will be used as the response .

In recent years, the emergence of the BERT model has refreshed the indicators of multiple natural language processing tasks such as text classification, text similarity, and machine translation. Many artificial intelligence companies are gradually applying the BERT model to actual engineering projects. Although the effect of BERT Better, but because the model is too large, it not only requires high performance of the hardware device, but also takes a long time to process the data. Furthermore, a lightweight model based on knowledge distillation has emerged to overcome the problems of high performance requirements on hardware devices and long data processing time caused by too large a model. In the existing knowledge distillation method, a trained complex model is used as a teacher model, and the teacher model is used to guide the learning of the lightweight student model, thereby transferring the dark knowledge in the teacher model to the student model.

Contents of the invention

The purpose of the embodiment of the present application is to provide a model compression method, an electronic device and a storage medium, which can improve the prediction accuracy of a trained student model.

The embodiment of the present application provides a model compression method, including: providing well-trained N types of complex models; N is an integer greater than or equal to 2; fusing the N types of complex models to obtain a trained teacher model; based on training samples, the teacher model and the loss function of the student model, the student model is trained; the loss function of the student model is obtained by fusing the first loss function and the second loss function, and the first loss The function is used to calculate the loss of the predicted value and the true value of the student model, and the second loss function is used to calculate the loss of the logit value of the student model and the logit value of the teacher model; wherein, the training sample Including sample input and sample output, the student model outputs the predicted value after receiving the sample input and the logit layer in the student model outputs the logit value, and the actual value is the sample output; the After the teacher model receives the sample input, the logit layer in the teacher model outputs the logit value.

The embodiment of the present application also provides a model compression system, including: a complex model training unit, used to provide N types of complex models that have been trained; N is an integer greater than or equal to 2; a teacher model acquisition unit, used to Fusing the N types of complex models to obtain a trained teacher model; a student model training unit configured to train the student model based on training samples, the teacher model, and a loss function of the student model; The loss function of the student model is obtained by fusing the first loss function and the second loss function, the first loss function is used to calculate the loss between the predicted value and the actual value of the student model, and the second loss function is used to calculate The loss of the logit value of the student model and the logit value of the teacher model; wherein, the training sample includes a sample input and a sample output, and the student model outputs the predicted value after receiving the sample input and the The logit layer in the student model outputs the logit value, and the real value is the sample output; after the teacher model receives the sample input, the logit layer in the teacher model outputs the logit value.

An embodiment of the present application also provides an electronic device, including: at least one processor; and a memory connected in communication with the at least one processor; wherein, the memory stores information that can be executed by the at least one processor. Instructions executed by the at least one processor to enable the at least one processor to perform the model compression method described above.

The embodiment of the present application also provides a computer-readable storage medium storing a computer program, and implementing the above-mentioned model compression method when the computer program is executed by a processor.

An embodiment of the present application further provides a computer program, wherein the computer program implements the above-mentioned model compression method when executed by a processor.

In the embodiment of the present application, in the process of model compression based on knowledge distillation: the teacher model is obtained by fusing N types of complex models, and the advantages of various types of complex models can be absorbed to make the teacher model more comprehensive; in the student model The loss function is also obtained by fusing the first loss function and the second loss function. The first loss function is used to calculate the loss between the predicted value and the actual value of the student model to achieve training based on hard targets; the second loss function is used to calculate The loss of the logit value of the student model and the logit value of the teacher model realizes the training based on the soft target; the loss function of the student model is based on the training based on the hard target and the soft target, and the training accuracy will be better. Therefore, the model compression method of the embodiment of the present application can improve the prediction accuracy of the trained student model.

Description of drawings

One or more embodiments are exemplified by the pictures in the corresponding drawings, and these exemplifications do not constitute a limitation to the embodiments. Elements with the same reference numerals in the drawings represent similar elements. Unless otherwise stated, the drawings in the drawings are not limited to scale.

Fig. 1 is a flowchart of a model compression method according to one embodiment of the present application;

Fig. 2 is a flowchart of a model compression method according to another embodiment of the present application;

3 is a block diagram of a model compression system according to one embodiment of the present application;

FIG. 4 is a block diagram of an electronic device according to one embodiment of the present application.

specific embodiment

In order to make the purpose, technical solutions and advantages of the embodiments of the present application clearer, the embodiments of the present application will be described in detail below with reference to the accompanying drawings. However, those of ordinary skill in the art can understand that in each embodiment of the application, many technical details are provided for readers to better understand the application. However, even without these technical details and various changes and modifications based on the following embodiments, the technical solutions claimed in this application can also be realized. The division of the following embodiments is for the convenience of description, and should not constitute any limitation to the specific implementation of the present application, and the embodiments can be combined and referred to each other on the premise of no contradiction.

An embodiment of the present application relates to a model compression method, and the specific process is shown in FIG. 1 .

Step 101 , providing N types of trained complex models; N is an integer greater than or equal to 2.

Step 102, fusing N types of complex models to obtain a trained teacher model.

Step 103, train the student model based on the training sample, the teacher model and the loss function of the student model. The loss function of the student model is obtained by fusing the first loss function and the second loss function. The first loss function is used to calculate the loss between the predicted value and the real value of the student model. The second loss function is used to calculate the logit value of the student model and the teacher The loss of the logit value of the model. Among them, the training sample includes sample input and sample output, the student model outputs the predicted value after receiving the sample input and the logit layer in the student model outputs the logit value, and the real value is the sample output; after the teacher model receives the sample input, the teacher model’s The logit layer outputs logit values. Among them, the logit layer in the student model is the fully connected layer in the student model, and the logit layer in the teacher model is the fully connected layer in the teacher model.

In the embodiment of this application, in the process of model compression based on knowledge distillation: the teacher model is obtained by fusing N types of complex models, and the advantages of various types of complex models can be absorbed to make the teacher model more comprehensive; in the student model The loss function of is also obtained by fusing the first loss function and the second loss function. The first loss function is used to calculate the loss between the predicted value and the real value of the student model to achieve training based on hard targets; the second loss function is used for Calculate the loss of the logit value of the student model and the logit value of the teacher model to realize the training based on the soft target; the loss function of the student model is based on the training based on the hard target and the soft target, and the training accuracy will be better . Therefore, the model compression method of the embodiment of the present application can improve the prediction accuracy of the trained student model.

The model compression method of the embodiment of the present application uses knowledge distillation to compress complex models to obtain lightweight models that are more suitable for industrial applications. The lightweight model is, for example, a model required in the field of natural language processing such as a text similarity matching model. The model compression method can be executed by an electronic device, such as a server, a personal computer, and any other device that has the processing capability needed to execute the method.

In an embodiment, N may be 3, and the three types of complex models are, for example, BERT-wwm-ext model, Ernie-1.0 model, and RoBERTa-large-pair model. However, this embodiment does not limit the value of N, and N can be determined as required. The student model is, for example, the SiaGRU model. Each type of complex model can be obtained based on training, and each type of trained complex model includes a complex model of this type. Step 102 in this embodiment can specifically be, to fuse the N logit layers of the N complex models as the logit layer of the teacher model; wherein, the fusion method can be to output the N logit layers of the N complex models The N logit values of N are added and averaged as the logit value output by the logit layer of the teacher model; however, it is not limited to this, the fusion method can also be, for example, the N logit values output by the N logit layers of N complex models The logit value is weighted and fused, and the weighted fused value is used as the logit value output by the logit layer of the teacher model.

In one embodiment, each type of complex model is obtained based on K-fold cross-validation training, and each type of trained complex model includes trained K complex models of this type; K is greater than or equal to 2 integer. The parameter values of internal parameters of K complex models belonging to the same type that have been trained are different. In other examples, the complex model can also be trained based on the hold-out method, bootstrap method and other training methods, and the number of trained complex models of each type is one.

As shown in Figure 2, the model compression method includes: Step 201, providing N types of complex models that have been trained; Step 202, fusing the N types of complex models to obtain a trained teacher model; Step 203, based on training The sample, teacher model, and loss function of the student model are used to train the student model. The loss function of the student model is obtained by fusing the first loss function and the second loss function. The first loss function is used to calculate the loss between the predicted value and the real value of the student model. The second loss function is used to calculate the logit value of the student model and the teacher The loss of the logit value of the model. Wherein, step 201 and step 203 are respectively similar to step 101 and step 103 in FIG. 1 , and will not be repeated here. In step 202 of this embodiment, N types of complex models are fused to obtain a trained teacher model, which specifically includes: step 2021, for each type of complex model, performing K logit layers of K complex models Fusion to obtain the logit layers of each type of complex model; Step 2022, fusing the N logit layers of the N types of complex models as the logit layer of the teacher model.

In one embodiment, the fusion method of each logit layer can be: take the average after adding the logit values output by each logit layer. That is, in step 2021, for each type of complex model, the K logit values output by the K logit layers of the K complex models are added and averaged, as the logit output by the logit layer output of each type of complex model value; in step 2022, the N logit values output by the N logit layers of the N types of complex models are added and averaged as the logit values output by the logit layer of the teacher model.

For example, there are 3 types of complex models, namely BERT-wwm-ext model, Ernie-1.0 model, and RoBERTa-large-pair model; if K is 10, after training, it includes: 10 BERT-wwm-ext model, 10 Ernie-1.0 models, 10 RoBERTa-large-pair models.

First, add the 10 logit values output by the 10 logit layers of the 10 BERT-wwm-ext models and take the average, and use it as the logit value output by the logit layer of the BERT-wwm-ext model, which is recorded as the first logit value; add the 10 logit values output by the 10 logit layers of 10 BErnie-1.0 models and take the average, and record it as the logit value output by the logit layer of the Ernie-1.0 model as the second logit value; The 10 logit values output by the 10 logit layers of a RoBERTa-large-pair model are added and averaged, and used as the logit value output by the logit layer of a model such as RoBERTa-large-pair, which is recorded as the third logit value.

Secondly, the first logit value, the second logit value, and the third logit value are added and averaged, and used as the logit value output by the logit layer of the teacher model.

In other embodiments, the fusion method of each logit layer may also be: assign weights to the logit values output by each logit layer in advance, multiply the logit values output by each logit layer by their respective weight values, and then add them. The fusion method of each logit layer can be set as required.

In one embodiment, the training samples may be massive training samples. A large number of training samples can be obtained from the existing database; for example, in some intelligent question answering scenarios, the existing database contains a large number of questions, and they are divided into categories according to semantics, and the question pairs of the same category can be used as samples in the training samples input, and the answer sentences corresponding to the question pairs of the same category are output as samples of training samples. Massive training samples can also be obtained from daily online logs; for example, in some intelligent question-and-answer scenarios, a large number of online logs will be generated during the actual question-and-answer process, and the online logs can be used as training samples after being marked by the labeling team , a large number of training samples can also be obtained from the public data sets LCQMC, BQ Corpus, etc. on the Internet. In the process of training the student model, different training samples can be selected for each iterative training. Each training sample includes sample input and sample input. In each iterative training, the sample input in the training sample can be input into the student model and the teacher model respectively. At this time, the student model will output a predicted value, and the The logit layer of the student model outputs a logit value, and the logit layer of the teacher model outputs a logit value. The predicted value and real value output by the student model are used as the input of the first loss function, and the first loss function can calculate the first loss value; the logit value output by the student model and the logit value output by the teacher model are used as the second The input of the loss function, the first loss function can calculate the second loss value, and the first loss value and the second loss value are fused as the loss value of the student model under the training sample. Then, the electronic device will judge whether the loss value of the student model satisfies the preset training completion condition, if the loss value of the student model does not meet the training completion condition, then a training sample needs to be reselected to train the student training again , until after a certain training, the loss value of the student model satisfies the training completion condition, then the training of the student model ends. Using massive training samples to iteratively train the student model can make the prediction accuracy of the trained student model higher. In other embodiments, there may be a small number of training samples, or even only one training sample.

In the iterative training of the student model, no matter whether the training samples are large or small, a training sample can be reused. Since the teacher model has been trained, the logit value of the teacher model is the same under the same training sample; that is, when the teacher model receives the sample input in the same training sample, the logit value of the teacher model The logit values output by the layers are the same. Therefore, if the training sample is used repeatedly in iterative training, it is not necessary to recalculate the logit value of the teacher model under the training sample every time. Specifically, if the training sample is selected for training for the first time, then input the sample input of the training sample to the teacher model to obtain the logit value of the teacher model, and save the logit value of the teacher model in the preset A storage unit; if the training sample is not selected for the first time, obtain the logit value of the teacher model from the storage unit. Wherein, the corresponding relationship between the identification mark of the training sample and the logit value of the teacher model can be saved in the storage unit; like this, when the training sample is selected, the logit of the corresponding teacher model can be obtained from the storage unit according to the identification of the training sample. value. The identifier of the training sample can be, for example, the sample number of the training sample. Since the logit value of the teacher model directly obtained from the storage unit is obviously smaller and faster than the logit value of the teacher model obtained through the teacher model, therefore, when the training samples need to be reused, the first calculated The logit value of the teacher model under the training sample is stored for direct acquisition in subsequent use, which can reduce the burden of model training and improve the speed of model training.

In one embodiment, in the step of training the student model, the electronic device can first input the sample input in each training sample into the teacher model respectively, to obtain the logit value of the teacher model under each training sample; The logit value of the teacher model is stored in the storage unit, that is, the storage unit can store the corresponding relationship between the identification mark of the training sample and the logit value of the teacher model. When the subsequent student model is trained based on a certain training sample, when the logit value of the teacher model under the training sample needs to be used to calculate the second loss function, it can be directly obtained from the storage unit according to the identification of the training sample The logit value of the corresponding teacher model.

In one embodiment, the fusion of the first loss function and the second loss function is weighted fusion; that is, weights are assigned to the first loss function and the second loss function in advance, and the loss function of the student model is: the first loss function and The sum of the second loss functions multiplied by their respective weights. The weights of the first loss function and the second loss function can be set according to the actual situation. For example, the weights of the first loss function and the second loss function can be selected so that the prediction accuracy of the trained student model is higher. Preferably, the weight of the second loss function is greater than the weight of the first loss function; that is, the training of the student model can be more focused on the training based on soft targets. In this way, the teacher model can have a greater impact on the student model, so that the generalization ability of the trained student model is better.

In one embodiment, the first loss function is a cross-entropy loss function, and the second loss function is a square difference loss function. In other embodiments, the first loss function may be a negative log-likelihood loss function, and the second loss function may be a KL divergence loss function.

The following is a complete example of the model compression method of this application.

The teacher model is obtained by merging three types of complex models. The three types of complex models are: BERT-wwm-ext model, Ernie-1.0 model, and RoBERTa-large-pair model; K is 10, which is based on 10-fold cross-validation Train complex models of each type. The training data includes multiple training samples, the training data can be divided into 10 parts, and each part of the training data includes several training samples. Combining these 10 training data in turn, using 9 training data in the 10 training data for model training, and the other 1 training data for model testing; therefore, 10 sets of data can be combined, and each set of data includes 9 training data and a test copy.

Using the 10 sets of data to train the BERT-wwm-ext model respectively, 10 BERT-wwm-ext models with different internal parameter values after training can be obtained. Use the 10 sets of data to train the Ernie-1.0 model respectively, and you can get 10 Ernie-1.0 models with different internal parameter values after training; use the 10 sets of data to train the RoBERTa-large-pair model respectively, you can get the training Completed 10 RoBERTa-large-pair models with different internal parameter values.

The three types of complex models are fused. Specifically, first, the 10 logit values output by the 10 logit layers of the 10 BERT-wwm-ext models of the BERT-wwm-ext model are added and averaged, as the BERT-wwm-ext model. The logit value output by the logit layer is recorded as the first logit value; the 10 logit values output by the 10 logit layers of the 10 BErnie-1.0 models are added and averaged, and used as the output of the logit layer of models such as Ernie-1.0 The logit value is recorded as the second logit value; the 10 logit values output by the 10 logit layers of the 10 RoBERTa-large-pair models are added and averaged as the output of the logit layer of the RoBERTa-large-pair model The logit value is recorded as the third logit value. Secondly, the first logit value, the second logit value, and the third logit value are added and averaged, and used as the logit value output by the logit layer of the teacher model. Thus, the three types of complex models are fused to obtain the teacher model.

The next step is to train the student model, such as the SiaGRU model, the first loss function is a cross-entropy loss function, and the second loss function is a square difference loss function. The training data used to train the student model may be the same as the training data for training the complex model described above, that is, multiple training samples included in the training data are used to train the student model. details as follows.

First use the first training sample to train the student model; wherein, each training sample includes sample input and sample output.

First, after the sample input of the first training sample is input to the student model, the student model outputs a predicted value, and the Logit layer of the student model outputs a Logit value, which is recorded as: Under the first training sample, the student The predicted value of the model, the Logit value of the student model; the first training sample is input into the trained teacher model, and the Logit layer of the teacher model outputs a Logit value, denoted as, under the first training sample, the classroom The logit value of the model.

Secondly, calculate the function value of the cross-entropy loss function according to the predicted value and the real value of the student model (ie, the sample output of the training sample), and use the Logit value of the student model and the Logit value of the teacher model to calculate the square difference loss function The function value of the function value; and the function value of the cross-entropy loss function and the function value of the square difference loss function are weighted and fused to obtain the function value of the loss function of the student model; if under the first training sample, the student The function value of the loss function of the model does not meet the preset requirements, then use the second training sample to train the student model; if the function value of the loss function of the student model does not meet the preset requirements under the second training sample Set the requirements, and then use the third training sample to train the student model, and so on, until the function value of the loss function of the student model meets the preset requirements under a certain training sample. Wherein, the process of training the student model with each training sample is the same, that is, it is similar to the process of training the student model with the first training sample above, and will not be repeated here. The function value of the loss function mentioned here meets the preset requirements, for example, the function value of the loss function is greater than or equal to the preset value.

The step division of the above various methods is only for the sake of clarity of description. During implementation, it can be combined into one step or some steps can be split and decomposed into multiple steps. As long as they include the same logical relationship, they are all within the scope of protection of this patent. ; Adding insignificant modifications or introducing insignificant designs to the algorithm or process, but not changing the core design of the algorithm and process are all within the scope of protection of this patent.

The embodiment of the present application also relates to a model compression system, as shown in FIG. 3 . The model compression system includes: a complex model training unit 301 , a teacher model acquisition unit 302 , and a student model training unit 303 .

The complex model training unit 301 is used for providing trained N types of complex models; N is an integer greater than or equal to 2.

The teacher model acquisition unit 302 is used to fuse N types of complex models to obtain a trained teacher model;

The student model training unit 303 is used to train the student model based on the training samples, the teacher model and the loss function of the student model; the loss function of the student model is obtained by fusing the first loss function and the second loss function, and the first loss function is used for The loss between the predicted value and the real value of the student model is calculated, and the second loss function is used to calculate the loss between the logit value of the student model and the logit value of the teacher model.

Among them, the training sample includes sample input and sample output, the student model outputs the predicted value after receiving the sample input and the logit layer in the student model outputs the logit value, and the real value is the sample output; after the teacher model receives the sample input, the teacher model’s The logit layer outputs the logit value.

In one embodiment, each type of complex model is obtained based on K-fold cross-validation training, and the trained complex model of each type includes trained K complex models belonging to this type; K is greater than or equal to Integer of 2.

Fuse N types of complex models to obtain a trained teacher model, including: for each type of complex model, fuse K logit layers of K complex models to obtain logit layers of each type of complex model ; Fuse N logit layers of N types of complex models as the logit layer of the teacher model.

In one embodiment, for each type of complex model, K logit layers of K complex models are fused to obtain a logit layer of each type of complex model, including: for each type of complex model, K The K logit values output by the K logit layers of a complex model are added and averaged as the logit value output by the logit layer of each type of complex model. Fuse the N logit layers of the N types of complex models as the logit layer of the teacher model, including: adding the N logit values output by the N logit layers of the N types of complex models and taking the average, as the teacher The logit value output by the logit layer of the model.

In one embodiment, in the training of the student model, if the training sample is selected for the first time, after inputting the sample input to the teacher model, the logit value of the teacher model is obtained, and the logit value of the teacher model is saved in a preset storage unit ; If the training sample is not selected for the first time, obtain the logit value of the teacher model from the storage unit.

In one embodiment, the fusion of the first loss function and the second loss function is weighted fusion, and the weight of the second loss function is greater than the weight of the first loss function.

In one embodiment, the first loss function is a cross-entropy loss function, and the second loss function is a square difference loss function.

In one embodiment, the student model is a SiaGRU model; and/or, the number of N is three, and the three types of complex models are: BERT-wwm-ext model, Ernie-1.0 model, RoBERTa-large-pair model ; The student model is a SiaGRU model.

It is not difficult to find that the embodiment of the model compression system corresponds to the embodiment of the above-mentioned model compression method, and the relevant technical details mentioned in the embodiment of the above-mentioned model compression method are still valid in the embodiment of the model compression system. In order to reduce repetition, it is not repeated here. Correspondingly, the relevant technical details mentioned in the embodiment of the model compression system can also be applied to the embodiment of the above model compression method.

It is worth mentioning that each module involved in the embodiment of the model compression system is a logical module. In practical applications, a logical unit can be a physical unit, or a part of a physical unit, or can be Combination of multiple physical units. In addition, in order to highlight the innovative part of the present application, units that are not closely related to solving the technical problem proposed in the present application are not introduced in this embodiment, but this does not mean that there are no other units in this embodiment.

The embodiment of the present application also relates to an electronic device, as shown in FIG. 4 , including: at least one processor 401; and a memory 402 communicatively connected to the at least one processor 401; An instruction 401 executed by the at least one processor 401, the instruction is executed by the at least one processor, so that the at least one processor 401 can execute the above-mentioned model compression method.

Wherein, the memory and the processor are connected by a bus, and the bus may include any number of interconnected buses and bridges, and the bus connects one or more processors and various circuits of the memory together. The bus may also connect together various other circuits such as peripherals, voltage regulators, and power management circuits, all of which are well known in the art and therefore will not be further described herein. The bus interface provides an interface between the bus and the transceivers. A transceiver may be a single element or multiple elements, such as multiple receivers and transmitters, providing means for communicating with various other devices over a transmission medium. The data processed by the processor is transmitted on the wireless medium through the antenna, and further, the antenna also receives the data and transmits the data to the processor.

The processor is responsible for managing the bus and general processing, and can also provide various functions, including timing, peripheral interface, voltage regulation, power management, and other control functions. Instead, memory may be used to store data that the processor uses when performing operations.

The embodiment of the present application also relates to a computer-readable storage medium storing a computer program. The above method embodiments are implemented when the computer program is executed by the processor.

The embodiment of the present application also relates to a computer program. The above method embodiments are implemented when the computer program is executed by the processor.

That is, those skilled in the art can understand that all or part of the steps in the method of the above-mentioned embodiments can be completed by instructing related hardware through a program, the program is stored in a storage medium, and includes several instructions to make a device ( It may be a single chip microcomputer, a chip, etc.) or a processor (processor) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disc, etc., which can store program codes. .

Those of ordinary skill in the art can understand that the above-mentioned embodiments are specific embodiments for realizing the present application, and in practical applications, various changes can be made to it in form and details without departing from the spirit and spirit of the present application. scope. The embodiment of the present application relates to a computer program. The above method embodiments are implemented when the computer program is executed by the processor.

Claims

A model compression method, characterized in that, comprising:

Provide N types of complex models that have been trained; N is an integer greater than or equal to 2;

Fusing the N types of complex models to obtain a trained teacher model;

Based on the loss function of the training sample, the teacher model and the student model, the student model is trained; the loss function of the student model is obtained by fusing the first loss function and the second loss function, and the first loss function is obtained by using For calculating the loss of the predicted value and the true value of the student model, the second loss function is used to calculate the loss of the logit value of the student model and the logit value of the teacher model;

Wherein, the training samples include sample input and sample output, the student model outputs the predicted value after receiving the sample input and the logit layer in the student model outputs the logit value, and the real value is the The sample output; after the teacher model receives the sample input, the logit layer in the teacher model outputs the logit value.
The model compression method according to claim 1, wherein each type of complex model is obtained based on K-fold cross-validation training, and the trained complex model of each type includes the trained K of this type. complex models; K is an integer greater than or equal to 2;

The fusion of N types of complex models is carried out to obtain a trained teacher model, including:

For each type of complex model, the K logit layers of the K complex models are fused to obtain the logit layers of each type of complex model;

Fusing the N logit layers of the N types of complex models is used as the logit layer of the teacher model.
The model compression method according to claim 2, wherein,

For each type of complex model, the K logit layers of the K complex models are fused to obtain the logit layers of each type of complex model, including: for each type of complex model, the K logit values output by the K logit layers of the K complex models are added and averaged, as the logit values output by the logit layers of the complex models of each type;

Said merging the N logit layers of the N types of complex models as the logit layer of the teacher model includes: N logit values output by the N logit layers of the N types of complex models Take the average after adding, as the logit value output by the logit layer of the teacher model.
The model compression method according to claim 1, wherein during the training of the student model, if the training sample is selected for the first time, after inputting the sample input to the teacher model, the teacher model is obtained. The logit value of the model, and save the logit value of the teacher model in a preset storage unit; if the training sample is not selected for the first time, obtain the logit value of the teacher model from the storage unit.
The model compression method according to claim 1, wherein the fusion method of the first loss function and the second loss function is weighted fusion, and the weight of the second loss function is greater than the weight of the first loss The weight of the function.
The model compression method according to claim 1, wherein the first loss function is a cross-entropy loss function, and the second loss function is a square difference loss function.
The model compression method according to any one of claims 1 to 6, wherein the student model is a SiaGRU model; and/or,

The number of N is three, and the three types of complex models are: BERT-wwm-ext model, Ernie-1.0 model, RoBERTa-large-pair model; the student model is a SiaGRU model.
A model compression system, characterized in that it comprises:

The complex model training unit is used to provide trained N types of complex models; N is an integer greater than or equal to 2;

a teacher model acquisition unit, configured to fuse the N types of complex models to obtain a trained teacher model;

a student model training unit, configured to train the student model based on training samples, the teacher model, and a loss function of the student model; the loss function of the student model is obtained by fusing the first loss function and the second loss function, The first loss function is used to calculate the loss of the predicted value of the student model and the true value, and the second loss function is used to calculate the loss of the logit value of the student model and the logit value of the teacher model;

Wherein, the training samples include sample input and sample output, the student model outputs the predicted value after receiving the sample input and the logit layer in the student model outputs the logit value, and the real value is the The sample output; after the teacher model receives the sample input, the logit layer in the teacher model outputs the logit value.
An electronic device, characterized in that it comprises:

at least one processor; and,

a memory communicatively coupled to the at least one processor; wherein,

The memory is stored with instructions executable by the at least one processor, the instructions are executed by the at least one processor, so that the at least one processor can perform any one of claims 1 to 7 model compression method.
A computer-readable storage medium storing a computer program, wherein the computer program implements the model compression method according to any one of claims 1 to 7 when executed by a processor.
A computer program, characterized in that, when the computer program is executed by a processor, the model compression method according to any one of claims 1 to 7 is realized.