WO2023273237A1 - Model compression method and system, electronic device, and storage medium - Google Patents

Model compression method and system, electronic device, and storage medium Download PDF

Info

Publication number
WO2023273237A1
WO2023273237A1 PCT/CN2021/140780 CN2021140780W WO2023273237A1 WO 2023273237 A1 WO2023273237 A1 WO 2023273237A1 CN 2021140780 W CN2021140780 W CN 2021140780W WO 2023273237 A1 WO2023273237 A1 WO 2023273237A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
logit
loss function
value
student
Prior art date
Application number
PCT/CN2021/140780
Other languages
French (fr)
Chinese (zh)
Inventor
陈贝
Original Assignee
达闼机器人股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 达闼机器人股份有限公司 filed Critical 达闼机器人股份有限公司
Publication of WO2023273237A1 publication Critical patent/WO2023273237A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the embodiments of the present application relate to the technical field of machine learning, and in particular to a model compression method, system, electronic equipment, and storage medium.
  • Text similarity matching is widely used. For example, in information retrieval, in order to recall more results similar to search terms, the information retrieval system can use similarity to identify similar words to improve the recall rate. In addition, in automatic question answering, natural language interaction can be used, and the similarity can be used to calculate the matching degree between the user's question sentence in natural language and the question in the corpus, and the answer corresponding to the question with the highest matching degree will be used as the response .
  • the purpose of the embodiment of the present application is to provide a model compression method, an electronic device and a storage medium, which can improve the prediction accuracy of a trained student model.
  • the embodiment of the present application provides a model compression method, including: providing well-trained N types of complex models; N is an integer greater than or equal to 2; fusing the N types of complex models to obtain a trained teacher model; based on training samples, the teacher model and the loss function of the student model, the student model is trained; the loss function of the student model is obtained by fusing the first loss function and the second loss function, and the first loss The function is used to calculate the loss of the predicted value and the true value of the student model, and the second loss function is used to calculate the loss of the logit value of the student model and the logit value of the teacher model; wherein, the training sample Including sample input and sample output, the student model outputs the predicted value after receiving the sample input and the logit layer in the student model outputs the logit value, and the actual value is the sample output; the After the teacher model receives the sample input, the logit layer in the teacher model outputs the logit value.
  • the embodiment of the present application also provides a model compression system, including: a complex model training unit, used to provide N types of complex models that have been trained; N is an integer greater than or equal to 2; a teacher model acquisition unit, used to Fusing the N types of complex models to obtain a trained teacher model; a student model training unit configured to train the student model based on training samples, the teacher model, and a loss function of the student model; The loss function of the student model is obtained by fusing the first loss function and the second loss function, the first loss function is used to calculate the loss between the predicted value and the actual value of the student model, and the second loss function is used to calculate The loss of the logit value of the student model and the logit value of the teacher model; wherein, the training sample includes a sample input and a sample output, and the student model outputs the predicted value after receiving the sample input and the The logit layer in the student model outputs the logit value, and the real value is the sample output; after the teacher model receives the sample input, the logit layer in the teacher
  • An embodiment of the present application also provides an electronic device, including: at least one processor; and a memory connected in communication with the at least one processor; wherein, the memory stores information that can be executed by the at least one processor. Instructions executed by the at least one processor to enable the at least one processor to perform the model compression method described above.
  • the embodiment of the present application also provides a computer-readable storage medium storing a computer program, and implementing the above-mentioned model compression method when the computer program is executed by a processor.
  • An embodiment of the present application further provides a computer program, wherein the computer program implements the above-mentioned model compression method when executed by a processor.
  • the teacher model is obtained by fusing N types of complex models, and the advantages of various types of complex models can be absorbed to make the teacher model more comprehensive; in the student model
  • the loss function is also obtained by fusing the first loss function and the second loss function.
  • the first loss function is used to calculate the loss between the predicted value and the actual value of the student model to achieve training based on hard targets;
  • the second loss function is used to calculate The loss of the logit value of the student model and the logit value of the teacher model realizes the training based on the soft target;
  • the loss function of the student model is based on the training based on the hard target and the soft target, and the training accuracy will be better. Therefore, the model compression method of the embodiment of the present application can improve the prediction accuracy of the trained student model.
  • Fig. 1 is a flowchart of a model compression method according to one embodiment of the present application
  • Fig. 2 is a flowchart of a model compression method according to another embodiment of the present application.
  • FIG. 3 is a block diagram of a model compression system according to one embodiment of the present application.
  • FIG. 4 is a block diagram of an electronic device according to one embodiment of the present application.
  • An embodiment of the present application relates to a model compression method, and the specific process is shown in FIG. 1 .
  • Step 101 providing N types of trained complex models; N is an integer greater than or equal to 2.
  • Step 102 fusing N types of complex models to obtain a trained teacher model.
  • Step 103 train the student model based on the training sample, the teacher model and the loss function of the student model.
  • the loss function of the student model is obtained by fusing the first loss function and the second loss function.
  • the first loss function is used to calculate the loss between the predicted value and the real value of the student model.
  • the second loss function is used to calculate the logit value of the student model and the teacher The loss of the logit value of the model.
  • the training sample includes sample input and sample output, the student model outputs the predicted value after receiving the sample input and the logit layer in the student model outputs the logit value, and the real value is the sample output; after the teacher model receives the sample input, the teacher model’s
  • the logit layer outputs logit values.
  • the logit layer in the student model is the fully connected layer in the student model
  • the logit layer in the teacher model is the fully connected layer in the teacher model.
  • the teacher model is obtained by fusing N types of complex models, and the advantages of various types of complex models can be absorbed to make the teacher model more comprehensive; in the student model
  • the loss function of is also obtained by fusing the first loss function and the second loss function.
  • the first loss function is used to calculate the loss between the predicted value and the real value of the student model to achieve training based on hard targets;
  • the second loss function is used for Calculate the loss of the logit value of the student model and the logit value of the teacher model to realize the training based on the soft target;
  • the loss function of the student model is based on the training based on the hard target and the soft target, and the training accuracy will be better . Therefore, the model compression method of the embodiment of the present application can improve the prediction accuracy of the trained student model.
  • the model compression method of the embodiment of the present application uses knowledge distillation to compress complex models to obtain lightweight models that are more suitable for industrial applications.
  • the lightweight model is, for example, a model required in the field of natural language processing such as a text similarity matching model.
  • the model compression method can be executed by an electronic device, such as a server, a personal computer, and any other device that has the processing capability needed to execute the method.
  • N may be 3, and the three types of complex models are, for example, BERT-wwm-ext model, Ernie-1.0 model, and RoBERTa-large-pair model.
  • this embodiment does not limit the value of N, and N can be determined as required.
  • the student model is, for example, the SiaGRU model.
  • Each type of complex model can be obtained based on training, and each type of trained complex model includes a complex model of this type.
  • Step 102 in this embodiment can specifically be, to fuse the N logit layers of the N complex models as the logit layer of the teacher model; wherein, the fusion method can be to output the N logit layers of the N complex models
  • the N logit values of N are added and averaged as the logit value output by the logit layer of the teacher model; however, it is not limited to this, the fusion method can also be, for example, the N logit values output by the N logit layers of N complex models
  • the logit value is weighted and fused, and the weighted fused value is used as the logit value output by the logit layer of the teacher model.
  • each type of complex model is obtained based on K-fold cross-validation training, and each type of trained complex model includes trained K complex models of this type; K is greater than or equal to 2 integer.
  • the parameter values of internal parameters of K complex models belonging to the same type that have been trained are different.
  • the complex model can also be trained based on the hold-out method, bootstrap method and other training methods, and the number of trained complex models of each type is one.
  • the model compression method includes: Step 201, providing N types of complex models that have been trained; Step 202, fusing the N types of complex models to obtain a trained teacher model; Step 203, based on training
  • the sample, teacher model, and loss function of the student model are used to train the student model.
  • the loss function of the student model is obtained by fusing the first loss function and the second loss function.
  • the first loss function is used to calculate the loss between the predicted value and the real value of the student model.
  • the second loss function is used to calculate the logit value of the student model and the teacher The loss of the logit value of the model.
  • step 201 and step 203 are respectively similar to step 101 and step 103 in FIG. 1 , and will not be repeated here.
  • step 202 of this embodiment N types of complex models are fused to obtain a trained teacher model, which specifically includes: step 2021, for each type of complex model, performing K logit layers of K complex models Fusion to obtain the logit layers of each type of complex model; Step 2022, fusing the N logit layers of the N types of complex models as the logit layer of the teacher model.
  • the fusion method of each logit layer can be: take the average after adding the logit values output by each logit layer. That is, in step 2021, for each type of complex model, the K logit values output by the K logit layers of the K complex models are added and averaged, as the logit output by the logit layer output of each type of complex model value; in step 2022, the N logit values output by the N logit layers of the N types of complex models are added and averaged as the logit values output by the logit layer of the teacher model.
  • K there are 3 types of complex models, namely BERT-wwm-ext model, Ernie-1.0 model, and RoBERTa-large-pair model; if K is 10, after training, it includes: 10 BERT-wwm-ext model, 10 Ernie-1.0 models, 10 RoBERTa-large-pair models.
  • the first logit value, the second logit value, and the third logit value are added and averaged, and used as the logit value output by the logit layer of the teacher model.
  • the fusion method of each logit layer may also be: assign weights to the logit values output by each logit layer in advance, multiply the logit values output by each logit layer by their respective weight values, and then add them.
  • the fusion method of each logit layer can be set as required.
  • the training samples may be massive training samples.
  • a large number of training samples can be obtained from the existing database; for example, in some intelligent question answering scenarios, the existing database contains a large number of questions, and they are divided into categories according to semantics, and the question pairs of the same category can be used as samples in the training samples input, and the answer sentences corresponding to the question pairs of the same category are output as samples of training samples.
  • Massive training samples can also be obtained from daily online logs; for example, in some intelligent question-and-answer scenarios, a large number of online logs will be generated during the actual question-and-answer process, and the online logs can be used as training samples after being marked by the labeling team , a large number of training samples can also be obtained from the public data sets LCQMC, BQ Corpus, etc. on the Internet.
  • different training samples can be selected for each iterative training.
  • Each training sample includes sample input and sample input. In each iterative training, the sample input in the training sample can be input into the student model and the teacher model respectively.
  • the student model will output a predicted value
  • the The logit layer of the student model outputs a logit value
  • the logit layer of the teacher model outputs a logit value.
  • the predicted value and real value output by the student model are used as the input of the first loss function, and the first loss function can calculate the first loss value;
  • the logit value output by the student model and the logit value output by the teacher model are used as the second
  • the input of the loss function, the first loss function can calculate the second loss value, and the first loss value and the second loss value are fused as the loss value of the student model under the training sample.
  • the electronic device will judge whether the loss value of the student model satisfies the preset training completion condition, if the loss value of the student model does not meet the training completion condition, then a training sample needs to be reselected to train the student training again , until after a certain training, the loss value of the student model satisfies the training completion condition, then the training of the student model ends.
  • Using massive training samples to iteratively train the student model can make the prediction accuracy of the trained student model higher. In other embodiments, there may be a small number of training samples, or even only one training sample.
  • a training sample can be reused. Since the teacher model has been trained, the logit value of the teacher model is the same under the same training sample; that is, when the teacher model receives the sample input in the same training sample, the logit value of the teacher model The logit values output by the layers are the same. Therefore, if the training sample is used repeatedly in iterative training, it is not necessary to recalculate the logit value of the teacher model under the training sample every time.
  • the training sample is selected for training for the first time, then input the sample input of the training sample to the teacher model to obtain the logit value of the teacher model, and save the logit value of the teacher model in the preset A storage unit; if the training sample is not selected for the first time, obtain the logit value of the teacher model from the storage unit.
  • the corresponding relationship between the identification mark of the training sample and the logit value of the teacher model can be saved in the storage unit; like this, when the training sample is selected, the logit of the corresponding teacher model can be obtained from the storage unit according to the identification of the training sample. value.
  • the identifier of the training sample can be, for example, the sample number of the training sample.
  • the electronic device in the step of training the student model, can first input the sample input in each training sample into the teacher model respectively, to obtain the logit value of the teacher model under each training sample;
  • the logit value of the teacher model is stored in the storage unit, that is, the storage unit can store the corresponding relationship between the identification mark of the training sample and the logit value of the teacher model.
  • the subsequent student model is trained based on a certain training sample, when the logit value of the teacher model under the training sample needs to be used to calculate the second loss function, it can be directly obtained from the storage unit according to the identification of the training sample The logit value of the corresponding teacher model.
  • the fusion of the first loss function and the second loss function is weighted fusion; that is, weights are assigned to the first loss function and the second loss function in advance, and the loss function of the student model is: the first loss function and The sum of the second loss functions multiplied by their respective weights.
  • the weights of the first loss function and the second loss function can be set according to the actual situation. For example, the weights of the first loss function and the second loss function can be selected so that the prediction accuracy of the trained student model is higher.
  • the weight of the second loss function is greater than the weight of the first loss function; that is, the training of the student model can be more focused on the training based on soft targets. In this way, the teacher model can have a greater impact on the student model, so that the generalization ability of the trained student model is better.
  • the first loss function is a cross-entropy loss function
  • the second loss function is a square difference loss function
  • the first loss function may be a negative log-likelihood loss function
  • the second loss function may be a KL divergence loss function.
  • the teacher model is obtained by merging three types of complex models.
  • the three types of complex models are: BERT-wwm-ext model, Ernie-1.0 model, and RoBERTa-large-pair model; K is 10, which is based on 10-fold cross-validation Train complex models of each type.
  • the training data includes multiple training samples, the training data can be divided into 10 parts, and each part of the training data includes several training samples. Combining these 10 training data in turn, using 9 training data in the 10 training data for model training, and the other 1 training data for model testing; therefore, 10 sets of data can be combined, and each set of data includes 9 training data and a test copy.
  • the three types of complex models are fused. Specifically, first, the 10 logit values output by the 10 logit layers of the 10 BERT-wwm-ext models of the BERT-wwm-ext model are added and averaged, as the BERT-wwm-ext model.
  • the logit value output by the logit layer is recorded as the first logit value; the 10 logit values output by the 10 logit layers of the 10 BErnie-1.0 models are added and averaged, and used as the output of the logit layer of models such as Ernie-1.0
  • the logit value is recorded as the second logit value; the 10 logit values output by the 10 logit layers of the 10 RoBERTa-large-pair models are added and averaged as the output of the logit layer of the RoBERTa-large-pair model
  • the logit value is recorded as the third logit value.
  • the first logit value, the second logit value, and the third logit value are added and averaged, and used as the logit value output by the logit layer of the teacher model.
  • the three types of complex models are fused to obtain the teacher model.
  • the next step is to train the student model, such as the SiaGRU model, the first loss function is a cross-entropy loss function, and the second loss function is a square difference loss function.
  • the training data used to train the student model may be the same as the training data for training the complex model described above, that is, multiple training samples included in the training data are used to train the student model. details as follows.
  • each training sample includes sample input and sample output.
  • the student model outputs a predicted value
  • the Logit layer of the student model outputs a Logit value, which is recorded as: Under the first training sample, the student The predicted value of the model, the Logit value of the student model; the first training sample is input into the trained teacher model, and the Logit layer of the teacher model outputs a Logit value, denoted as, under the first training sample, the classroom The logit value of the model.
  • the process of training the student model with each training sample is the same, that is, it is similar to the process of training the student model with the first training sample above, and will not be repeated here.
  • the function value of the loss function mentioned here meets the preset requirements, for example, the function value of the loss function is greater than or equal to the preset value.
  • step division of the above various methods is only for the sake of clarity of description. During implementation, it can be combined into one step or some steps can be split and decomposed into multiple steps. As long as they include the same logical relationship, they are all within the scope of protection of this patent. ; Adding insignificant modifications or introducing insignificant designs to the algorithm or process, but not changing the core design of the algorithm and process are all within the scope of protection of this patent.
  • the embodiment of the present application also relates to a model compression system, as shown in FIG. 3 .
  • the model compression system includes: a complex model training unit 301 , a teacher model acquisition unit 302 , and a student model training unit 303 .
  • the complex model training unit 301 is used for providing trained N types of complex models; N is an integer greater than or equal to 2.
  • the teacher model acquisition unit 302 is used to fuse N types of complex models to obtain a trained teacher model
  • the student model training unit 303 is used to train the student model based on the training samples, the teacher model and the loss function of the student model; the loss function of the student model is obtained by fusing the first loss function and the second loss function, and the first loss function is used for The loss between the predicted value and the real value of the student model is calculated, and the second loss function is used to calculate the loss between the logit value of the student model and the logit value of the teacher model.
  • the training sample includes sample input and sample output, the student model outputs the predicted value after receiving the sample input and the logit layer in the student model outputs the logit value, and the real value is the sample output; after the teacher model receives the sample input, the teacher model’s The logit layer outputs the logit value.
  • each type of complex model is obtained based on K-fold cross-validation training, and the trained complex model of each type includes trained K complex models belonging to this type; K is greater than or equal to Integer of 2.
  • Fuse N types of complex models to obtain a trained teacher model including: for each type of complex model, fuse K logit layers of K complex models to obtain logit layers of each type of complex model ; Fuse N logit layers of N types of complex models as the logit layer of the teacher model.
  • K logit layers of K complex models are fused to obtain a logit layer of each type of complex model, including: for each type of complex model, K The K logit values output by the K logit layers of a complex model are added and averaged as the logit value output by the logit layer of each type of complex model.
  • the training sample is selected for the first time, after inputting the sample input to the teacher model, the logit value of the teacher model is obtained, and the logit value of the teacher model is saved in a preset storage unit ; If the training sample is not selected for the first time, obtain the logit value of the teacher model from the storage unit.
  • the fusion of the first loss function and the second loss function is weighted fusion, and the weight of the second loss function is greater than the weight of the first loss function.
  • the first loss function is a cross-entropy loss function
  • the second loss function is a square difference loss function
  • the student model is a SiaGRU model; and/or, the number of N is three, and the three types of complex models are: BERT-wwm-ext model, Ernie-1.0 model, RoBERTa-large-pair model ;
  • the student model is a SiaGRU model.
  • the embodiment of the model compression system corresponds to the embodiment of the above-mentioned model compression method, and the relevant technical details mentioned in the embodiment of the above-mentioned model compression method are still valid in the embodiment of the model compression system. In order to reduce repetition, it is not repeated here. Correspondingly, the relevant technical details mentioned in the embodiment of the model compression system can also be applied to the embodiment of the above model compression method.
  • each module involved in the embodiment of the model compression system is a logical module.
  • a logical unit can be a physical unit, or a part of a physical unit, or can be Combination of multiple physical units.
  • units that are not closely related to solving the technical problem proposed in the present application are not introduced in this embodiment, but this does not mean that there are no other units in this embodiment.
  • the embodiment of the present application also relates to an electronic device, as shown in FIG. 4 , including: at least one processor 401; and a memory 402 communicatively connected to the at least one processor 401; An instruction 401 executed by the at least one processor 401, the instruction is executed by the at least one processor, so that the at least one processor 401 can execute the above-mentioned model compression method.
  • the memory and the processor are connected by a bus
  • the bus may include any number of interconnected buses and bridges, and the bus connects one or more processors and various circuits of the memory together.
  • the bus may also connect together various other circuits such as peripherals, voltage regulators, and power management circuits, all of which are well known in the art and therefore will not be further described herein.
  • the bus interface provides an interface between the bus and the transceivers.
  • a transceiver may be a single element or multiple elements, such as multiple receivers and transmitters, providing means for communicating with various other devices over a transmission medium.
  • the data processed by the processor is transmitted on the wireless medium through the antenna, and further, the antenna also receives the data and transmits the data to the processor.
  • the processor is responsible for managing the bus and general processing, and can also provide various functions, including timing, peripheral interface, voltage regulation, power management, and other control functions. Instead, memory may be used to store data that the processor uses when performing operations.
  • the embodiment of the present application also relates to a computer-readable storage medium storing a computer program.
  • the above method embodiments are implemented when the computer program is executed by the processor.
  • the embodiment of the present application also relates to a computer program.
  • the above method embodiments are implemented when the computer program is executed by the processor.
  • the program is stored in a storage medium, and includes several instructions to make a device ( It may be a single chip microcomputer, a chip, etc.) or a processor (processor) to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disc, etc., which can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

Embodiments of the present application relate to the technical field of machine learning. Disclosed are a model compression method and system, an electronic device, and a storage medium. The model compression method comprises: providing N types of trained complex models; fusing the N types of complex models to obtain a trained teacher model; and training a student model on the basis of a training sample, the teacher model, and a loss function of the student model, wherein the loss function of the student model is obtained by fusing a first loss function with a second loss function; the first loss function is used for calculating the loss of a predicted value and a real value of the student model; the second loss function is used for calculating the loss of a logit value of the student model and a logit value of the teacher model. According to the technical solution provided in the embodiments of the present application, the prediction precision of the student model obtained by training can be improved.

Description

模型压缩方法、系统、电子设备及存储介质Model compression method, system, electronic device and storage medium
交叉引用cross reference
本申请基于申请号为“2021107322788、申请日为2021年06月29日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此以引入方式并入本申请。This application is based on the Chinese patent application with the application number "2021107322788" and the filing date is June 29, 2021, and claims the priority of the Chinese patent application. The entire content of the Chinese patent application is hereby incorporated into this application by way of introduction .
技术领域technical field
本申请实施例涉及机器学习技术领域,特别涉及模型压缩方法、系统、电子设备及存储介质。The embodiments of the present application relate to the technical field of machine learning, and in particular to a model compression method, system, electronic equipment, and storage medium.
背景技术Background technique
文本相似度匹配应用广泛,比如在信息检索中,信息检索系统为了能召回更多与检索词语相似的结果,可以用相似度来识别相似的词语,以此提高召回率。另外,在自动问答中,可以使用自然语言交互,相似度在这里可以用来计算用户以自然语言的提问问句与语料库中问题的匹配程度,那么匹配度最高的那个问题对应的答案将作为响应。Text similarity matching is widely used. For example, in information retrieval, in order to recall more results similar to search terms, the information retrieval system can use similarity to identify similar words to improve the recall rate. In addition, in automatic question answering, natural language interaction can be used, and the similarity can be used to calculate the matching degree between the user's question sentence in natural language and the question in the corpus, and the answer corresponding to the question with the highest matching degree will be used as the response .
近年来BERT模型的出现,刷新了文本分类、文本相似度、机器翻译等多个自然语言处理任务的指标,很多人工智能公司也在逐渐将BERT模型应用到实际的工程项目中,虽然BERT的效果较好,但是由于模型太大,不仅对硬件设备的性能要求较高,而且对数据的处理时间会较长。进而,又出现了基于知识蒸馏方式得到一个轻量级模型,以克服模型太大导致的对硬件设备的性能要求较高且对数据的处理时间会较长的问题。现有的知识蒸馏方式中,是将一个训练好的复杂模型作为教师模型,并用该教师模型来指导轻量级的学生模型的学习,从而将教师模型中的暗知识迁移到学生模型中。In recent years, the emergence of the BERT model has refreshed the indicators of multiple natural language processing tasks such as text classification, text similarity, and machine translation. Many artificial intelligence companies are gradually applying the BERT model to actual engineering projects. Although the effect of BERT Better, but because the model is too large, it not only requires high performance of the hardware device, but also takes a long time to process the data. Furthermore, a lightweight model based on knowledge distillation has emerged to overcome the problems of high performance requirements on hardware devices and long data processing time caused by too large a model. In the existing knowledge distillation method, a trained complex model is used as a teacher model, and the teacher model is used to guide the learning of the lightweight student model, thereby transferring the dark knowledge in the teacher model to the student model.
发明内容Contents of the invention
本申请实施例的目的在于提供一种模型压缩方法、电子设备及存储介质,可以提高训练得到的学生模型的预测精度。The purpose of the embodiment of the present application is to provide a model compression method, an electronic device and a storage medium, which can improve the prediction accuracy of a trained student model.
本申请的实施例提供了一种模型压缩方法,包括:提供训练好的N种类型的复杂模型;N为大于或等于2的整数;对N种类型的复杂模型进行融合,得到训练好的教师模型; 基于训练样本、所述教师模型以及学生模型的损失函数,对所述学生模型进行训练;所述学生模型的损失函数由第一损失函数和第二损失函数融合得到,所述第一损失函数用于计算所述学生模型的预测值与真实值的损失,所述第二损失函数用于计算所述学生模型的logit值与所述教师模型的logit值的损失;其中,所述训练样本包含样本输入和样本输出,所述学生模型在接收所述样本输入后输出所述预测值且所述学生模型中的logit层输出所述logit值,所述真实值为所述样本输出;所述教师模型在接收所述样本输入后,所述教师模型中的logit层输出所述logit值。The embodiment of the present application provides a model compression method, including: providing well-trained N types of complex models; N is an integer greater than or equal to 2; fusing the N types of complex models to obtain a trained teacher model; based on training samples, the teacher model and the loss function of the student model, the student model is trained; the loss function of the student model is obtained by fusing the first loss function and the second loss function, and the first loss The function is used to calculate the loss of the predicted value and the true value of the student model, and the second loss function is used to calculate the loss of the logit value of the student model and the logit value of the teacher model; wherein, the training sample Including sample input and sample output, the student model outputs the predicted value after receiving the sample input and the logit layer in the student model outputs the logit value, and the actual value is the sample output; the After the teacher model receives the sample input, the logit layer in the teacher model outputs the logit value.
本申请的实施例还提供了一种模型压缩系统,包括:复杂模型训练单元,用于提供训练好的N种类型的复杂模型;N为大于或等于2的整数;教师模型获取单元,用于对所述N种类型的复杂模型进行融合,得到训练好的教师模型;学生模型训练单元,用于基于训练样本、所述教师模型以及学生模型的损失函数,对所述学生模型进行训练;所述学生模型的损失函数由第一损失函数和第二损失函数融合得到,所述第一损失函数用于计算所述学生模型的预测值与真实值的损失,所述第二损失函数用于计算所述学生模型的logit值与所述教师模型的logit值的损失;其中,所述训练样本包含样本输入和样本输出,所述学生模型在接收所述样本输入后输出所述预测值且所述学生模型中的logit层输出所述logit值,所述真实值为所述样本输出;所述教师模型在接收所述样本输入后,所述教师模型中的logit层输出所述logit值。The embodiment of the present application also provides a model compression system, including: a complex model training unit, used to provide N types of complex models that have been trained; N is an integer greater than or equal to 2; a teacher model acquisition unit, used to Fusing the N types of complex models to obtain a trained teacher model; a student model training unit configured to train the student model based on training samples, the teacher model, and a loss function of the student model; The loss function of the student model is obtained by fusing the first loss function and the second loss function, the first loss function is used to calculate the loss between the predicted value and the actual value of the student model, and the second loss function is used to calculate The loss of the logit value of the student model and the logit value of the teacher model; wherein, the training sample includes a sample input and a sample output, and the student model outputs the predicted value after receiving the sample input and the The logit layer in the student model outputs the logit value, and the real value is the sample output; after the teacher model receives the sample input, the logit layer in the teacher model outputs the logit value.
本申请的实施例还提供了一种电子设备,包括:至少一个处理器;以及,与所述至少一个处理器通信连接的存储器;其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够上述模型压缩方法。An embodiment of the present application also provides an electronic device, including: at least one processor; and a memory connected in communication with the at least one processor; wherein, the memory stores information that can be executed by the at least one processor. Instructions executed by the at least one processor to enable the at least one processor to perform the model compression method described above.
本申请的实施例还提供了一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时实现上述模型压缩方法。The embodiment of the present application also provides a computer-readable storage medium storing a computer program, and implementing the above-mentioned model compression method when the computer program is executed by a processor.
本申请实施例还提供了一种计算机程序,其特征在于,所述计算机程序被处理器执行时实现以上所述的模型压缩方法。An embodiment of the present application further provides a computer program, wherein the computer program implements the above-mentioned model compression method when executed by a processor.
本申请实施例,在基于知识蒸馏的方式的模型压缩过程中:教师模型由N种类型的复杂模型融合得到,可以汲取多种类型的复杂模型的优点,使得教师模型更加全面;在学生模型的损失函数也是由第一损失函数和第二损失函数融合得到,第一损失函数用于计算所述学生模型的预测值与真实值的损失,实现基于硬目标的训练;第二损失函数用于计算所述学生模型的logit值与所述教师模型的logit值的损失,实现基于软目标的训练;使得学生模型的损失函数是融合了基于硬目标和基于软目标的训练,训练精度会更好。因此,本申请实施 例的模型压缩方法,可以提高训练得到的学生模型的预测精度。In the embodiment of the present application, in the process of model compression based on knowledge distillation: the teacher model is obtained by fusing N types of complex models, and the advantages of various types of complex models can be absorbed to make the teacher model more comprehensive; in the student model The loss function is also obtained by fusing the first loss function and the second loss function. The first loss function is used to calculate the loss between the predicted value and the actual value of the student model to achieve training based on hard targets; the second loss function is used to calculate The loss of the logit value of the student model and the logit value of the teacher model realizes the training based on the soft target; the loss function of the student model is based on the training based on the hard target and the soft target, and the training accuracy will be better. Therefore, the model compression method of the embodiment of the present application can improve the prediction accuracy of the trained student model.
附图说明Description of drawings
一个或多个实施例通过与之对应的附图中的图片进行示例性说明,这些示例性说明并不构成对实施例的限定,附图中具有相同参考数字标号的元件表示为类似的元件,除非有特别申明,附图中的图不构成比例限制。One or more embodiments are exemplified by the pictures in the corresponding drawings, and these exemplifications do not constitute a limitation to the embodiments. Elements with the same reference numerals in the drawings represent similar elements. Unless otherwise stated, the drawings in the drawings are not limited to scale.
图1是根据本申请一个实施例的模型压缩方法的流程图;Fig. 1 is a flowchart of a model compression method according to one embodiment of the present application;
图2是根据本申请另一个实施例的模型压缩方法的流程图;Fig. 2 is a flowchart of a model compression method according to another embodiment of the present application;
图3是根据本申请一个实施例的模型压缩系统的方框图;3 is a block diagram of a model compression system according to one embodiment of the present application;
图4是根据本申请一个实施例的电子设备的方框图。FIG. 4 is a block diagram of an electronic device according to one embodiment of the present application.
具体实施例specific embodiment
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合附图对本申请的各实施例进行详细的阐述。然而,本领域的普通技术人员可以理解,在本申请各实施例中,为了使读者更好地理解本申请而提出了许多技术细节。但是,即使没有这些技术细节和基于以下各实施例的种种变化和修改,也可以实现本申请所要求保护的技术方案。以下各个实施例的划分是为了描述方便,不应对本申请的具体实现方式构成任何限定,各个实施例在不矛盾的前提下可以相互结合相互引用。In order to make the purpose, technical solutions and advantages of the embodiments of the present application clearer, the embodiments of the present application will be described in detail below with reference to the accompanying drawings. However, those of ordinary skill in the art can understand that in each embodiment of the application, many technical details are provided for readers to better understand the application. However, even without these technical details and various changes and modifications based on the following embodiments, the technical solutions claimed in this application can also be realized. The division of the following embodiments is for the convenience of description, and should not constitute any limitation to the specific implementation of the present application, and the embodiments can be combined and referred to each other on the premise of no contradiction.
本申请的一个实施例涉及一种模型压缩方法,具体流程如图1所示。An embodiment of the present application relates to a model compression method, and the specific process is shown in FIG. 1 .
步骤101,提供训练好的N种类型的复杂模型;N为大于或等于2的整数。 Step 101 , providing N types of trained complex models; N is an integer greater than or equal to 2.
步骤102,对N种类型的复杂模型进行融合,得到训练好的教师模型。 Step 102, fusing N types of complex models to obtain a trained teacher model.
步骤103,基于训练样本、教师模型以及学生模型的损失函数,对学生模型进行训练。学生模型的损失函数由第一损失函数和第二损失函数融合得到,第一损失函数用于计算学生模型的预测值与真实值的损失,第二损失函数用于计算学生模型的logit值与教师模型的logit值的损失。其中,训练样本包含样本输入和样本输出,学生模型在接收样本输入后输出预测值且学生模型中的logit层输出logit值,真实值为样本输出;教师模型在接收样本输入后,教师模型中的logit层输出logit值。其中,学生模型中的logit层是学生模型中的全连接层,教师模型中的logit层是教师模型中的全连接层。 Step 103, train the student model based on the training sample, the teacher model and the loss function of the student model. The loss function of the student model is obtained by fusing the first loss function and the second loss function. The first loss function is used to calculate the loss between the predicted value and the real value of the student model. The second loss function is used to calculate the logit value of the student model and the teacher The loss of the logit value of the model. Among them, the training sample includes sample input and sample output, the student model outputs the predicted value after receiving the sample input and the logit layer in the student model outputs the logit value, and the real value is the sample output; after the teacher model receives the sample input, the teacher model’s The logit layer outputs logit values. Among them, the logit layer in the student model is the fully connected layer in the student model, and the logit layer in the teacher model is the fully connected layer in the teacher model.
本申请实施例中,在基于知识蒸馏的方式的模型压缩过程中:教师模型由N种类型的复杂模型融合得到,可以汲取多种类型的复杂模型的优点,使得教师模型更加全面;在学生模型的损失函数也是由第一损失函数和第二损失函数融合得到,第一损失函数用于计算所述学生模型的预测值与真实值的损失,实现基于硬目标的训练;第二损失函数用于计算所述学生模型的logit值与所述教师模型的logit值的损失,实现基于软目标的训练;使得学生模 型的损失函数是融合了基于硬目标和基于软目标的训练,训练精度会更好。因此,本申请实施例的模型压缩方法,可以提高训练得到的学生模型的预测精度。In the embodiment of this application, in the process of model compression based on knowledge distillation: the teacher model is obtained by fusing N types of complex models, and the advantages of various types of complex models can be absorbed to make the teacher model more comprehensive; in the student model The loss function of is also obtained by fusing the first loss function and the second loss function. The first loss function is used to calculate the loss between the predicted value and the real value of the student model to achieve training based on hard targets; the second loss function is used for Calculate the loss of the logit value of the student model and the logit value of the teacher model to realize the training based on the soft target; the loss function of the student model is based on the training based on the hard target and the soft target, and the training accuracy will be better . Therefore, the model compression method of the embodiment of the present application can improve the prediction accuracy of the trained student model.
本申请实施例的模型压缩方法,采用知识蒸馏的方式对复杂模型进行压缩,得到更适合于工业应用的轻量级模型。该轻量级模型例如为文本相似度匹配模型等自然语言处理领域所需要的模型。该模型压缩方法可以由电子设备执行,该电子设备例如为服务器、个人电脑等任何具有执行该方法所需处理能力的设备。The model compression method of the embodiment of the present application uses knowledge distillation to compress complex models to obtain lightweight models that are more suitable for industrial applications. The lightweight model is, for example, a model required in the field of natural language processing such as a text similarity matching model. The model compression method can be executed by an electronic device, such as a server, a personal computer, and any other device that has the processing capability needed to execute the method.
在一个实施例中,N可以为3,该3种类型的复杂模型例如为BERT-wwm-ext模型、Ernie-1.0模型、RoBERTa-large-pair模型。然该实施例对N的取值不作限制,N可以根据需要确定。学生模型例如为SiaGRU模型。每种类型的复杂模型可以基于训练得到,训练好的每种类型的复杂模型包含属于该种类型的一个复杂模型。该实施例中的步骤102具体可以为,将N个复杂模型的N个logit层进行融合,作为教师模型的logit层;其中,该融合方式可以为,将N个复杂模型的N个logit层输出的N个logit值相加后求平均,作为该教师模型的logit层输出的logit值;然并不限于此,该融合方式例如还可以是将N个复杂模型的N个logit层输出的N个logit值加权融合,并将加权融合后的值作为该教师模型的logit层输出的logit值。In an embodiment, N may be 3, and the three types of complex models are, for example, BERT-wwm-ext model, Ernie-1.0 model, and RoBERTa-large-pair model. However, this embodiment does not limit the value of N, and N can be determined as required. The student model is, for example, the SiaGRU model. Each type of complex model can be obtained based on training, and each type of trained complex model includes a complex model of this type. Step 102 in this embodiment can specifically be, to fuse the N logit layers of the N complex models as the logit layer of the teacher model; wherein, the fusion method can be to output the N logit layers of the N complex models The N logit values of N are added and averaged as the logit value output by the logit layer of the teacher model; however, it is not limited to this, the fusion method can also be, for example, the N logit values output by the N logit layers of N complex models The logit value is weighted and fused, and the weighted fused value is used as the logit value output by the logit layer of the teacher model.
在一个实施例中,每种类型的复杂模型基于K折交叉验证训练得到,训练好的每种类型的复杂模型包括训练好的属于该种类型的K个复杂模型;K为大于或等于2的整数。训练好的属于同一种类型的K个复杂模型的内部参数的参数值是不同的。在其他例子中,复杂模型还可以基于留出法、自助法等其他训练方式训练得到,基于留出法或自助法训练得到,训练好的每种类型的复杂模型的数量为1个。In one embodiment, each type of complex model is obtained based on K-fold cross-validation training, and each type of trained complex model includes trained K complex models of this type; K is greater than or equal to 2 integer. The parameter values of internal parameters of K complex models belonging to the same type that have been trained are different. In other examples, the complex model can also be trained based on the hold-out method, bootstrap method and other training methods, and the number of trained complex models of each type is one.
如图2所示,模型压缩方法包含:步骤201,提供训练好的N种类型的复杂模型;步骤202,对N种类型的复杂模型进行融合,得到训练好的教师模型;步骤203,基于训练样本、教师模型以及学生模型的损失函数,对学生模型进行训练。学生模型的损失函数由第一损失函数和第二损失函数融合得到,第一损失函数用于计算学生模型的预测值与真实值的损失,第二损失函数用于计算学生模型的logit值与教师模型的logit值的损失。其中,步骤201、步骤203与图1中的步骤101、步骤103分别类似,此处不再赘述。该实施例中的步骤202,对N种类型的复杂模型进行融合,得到训练好的教师模型,具体包括:步骤2021,对于每种类型的复杂模型,将K个复杂模型的K个logit层进行融合,得到每种类型的复杂模型的logit层;步骤2022,将N种类型的复杂模型的N个logit层进行融合,作为教师模型的logit层。As shown in Figure 2, the model compression method includes: Step 201, providing N types of complex models that have been trained; Step 202, fusing the N types of complex models to obtain a trained teacher model; Step 203, based on training The sample, teacher model, and loss function of the student model are used to train the student model. The loss function of the student model is obtained by fusing the first loss function and the second loss function. The first loss function is used to calculate the loss between the predicted value and the real value of the student model. The second loss function is used to calculate the logit value of the student model and the teacher The loss of the logit value of the model. Wherein, step 201 and step 203 are respectively similar to step 101 and step 103 in FIG. 1 , and will not be repeated here. In step 202 of this embodiment, N types of complex models are fused to obtain a trained teacher model, which specifically includes: step 2021, for each type of complex model, performing K logit layers of K complex models Fusion to obtain the logit layers of each type of complex model; Step 2022, fusing the N logit layers of the N types of complex models as the logit layer of the teacher model.
在一个实施例中,各logit层的融合方式可以是:将各logit层输出的logit值相加后 取平均。即,在步骤2021中,对于每种类型的复杂模型,将K个复杂模型的K个logit层输出的K个logit值相加后取平均,作为每种类型的复杂模型的logit层输出的logit值;在步骤2022中,将N种类型的复杂模型的N个logit层输出的N个logit值相加后取平均,作为教师模型的logit层输出的logit值。In one embodiment, the fusion method of each logit layer can be: take the average after adding the logit values output by each logit layer. That is, in step 2021, for each type of complex model, the K logit values output by the K logit layers of the K complex models are added and averaged, as the logit output by the logit layer output of each type of complex model value; in step 2022, the N logit values output by the N logit layers of the N types of complex models are added and averaged as the logit values output by the logit layer of the teacher model.
示例的,有3种类型的复杂模型,分别为BERT-wwm-ext模型、Ernie-1.0模型、RoBERTa-large-pair模型;如果K取10,训练好之后,包括:10个BERT-wwm-ext模型、10个Ernie-1.0模型、10个RoBERTa-large-pair模型。For example, there are 3 types of complex models, namely BERT-wwm-ext model, Ernie-1.0 model, and RoBERTa-large-pair model; if K is 10, after training, it includes: 10 BERT-wwm-ext model, 10 Ernie-1.0 models, 10 RoBERTa-large-pair models.
首先,将10个BERT-wwm-ext模型的10个logit层输出的10个logit值相加后取平均,作为BERT-wwm-ext这类模型的logit层输出的logit值,记作第一logit值;将10个BErnie-1.0模型的10个logit层输出的10个logit值相加后取平均,作为Ernie-1.0这类模型的logit层输出的logit值,记作第二logit值;将10个RoBERTa-large-pair模型的10个logit层输出的10个logit值相加后取平均,作为RoBERTa-large-pair这类模型的logit层输出的logit值,记作第三logit值。First, add the 10 logit values output by the 10 logit layers of the 10 BERT-wwm-ext models and take the average, and use it as the logit value output by the logit layer of the BERT-wwm-ext model, which is recorded as the first logit value; add the 10 logit values output by the 10 logit layers of 10 BErnie-1.0 models and take the average, and record it as the logit value output by the logit layer of the Ernie-1.0 model as the second logit value; The 10 logit values output by the 10 logit layers of a RoBERTa-large-pair model are added and averaged, and used as the logit value output by the logit layer of a model such as RoBERTa-large-pair, which is recorded as the third logit value.
其次,将该第一logit值、该第二logit值、该第三logit值相加后去平均,作为教师模型的logit层输出的logit值。Secondly, the first logit value, the second logit value, and the third logit value are added and averaged, and used as the logit value output by the logit layer of the teacher model.
在其他实施例中,各logit层的融合方式还可以是:预先为各logit层输出的logit值分别权重值,将各logit层输出的logit值分别乘以各自的权重值后相加。各logit层的融合方式可以根据需要设定。In other embodiments, the fusion method of each logit layer may also be: assign weights to the logit values output by each logit layer in advance, multiply the logit values output by each logit layer by their respective weight values, and then add them. The fusion method of each logit layer can be set as required.
在一个实施例中,训练样本可以是海量的训练样本。海量的训练样本可以从现有数据库中获取;例如,在一些智能问答场景中,该现有数据库中包含海量的问题,且按照语义分好类别,相同类别的问题对可作为训练样本中的样本输入,该相同类别的问题对对应的答句作为训练样本的样本输出。海量的训练样本还可以从每日产生的线上日志获取;例如,在一些智能问答场景中,在实际问答过程中会产生大量的线上日志,线上日志经过标注团队标注后可以作为训练样本,海量的训练样本还可以从网上公开的数据集LCQMC,BQ Corpus等获取。在对学生模型进行训练过程中,每次迭代训练都可以选择不同的训练样本。每个训练样本包括样本输入和样本输入,在每次迭代训练中,可以将该训练样本中的样本输入分别输入到学生模型和教师模型中,此时,学生模型会输出一个预测值,且该学生模型的logit层输出logit值,该教师模型的logit层输出一个logit值。该学生模型输出的预测值和真实值被作为第一损失函数的输入,该第一损失函数可以计算出第一损失值;该学生模型输出logit值和该教师模型输出logit值被作为该第二损失函数的输入,该第一损失函数可以计算出第二损失值,该第一损失值和该第二损失值被融合后,作为在该训练样本下,该学生模型的损 失值。然后,电子设备会判断该学生模型的损失值是否满足预设的训练完成条件,如果该学生模型的损失值不满足训练完成条件,那么,需要重新选择一个训练样本来再次对该学生训练进行训练,直至某一次训练后,该学生模型的损失值满足训练完成条件,那么结束对学生模型的训练。采用海量的训练样本对学生模型进行迭代训练,可以使得训练后的学生模型的预测准确度较高。在其他实施例中,训练样本也可以是少量的,甚至只有一个训练样本。In one embodiment, the training samples may be massive training samples. A large number of training samples can be obtained from the existing database; for example, in some intelligent question answering scenarios, the existing database contains a large number of questions, and they are divided into categories according to semantics, and the question pairs of the same category can be used as samples in the training samples input, and the answer sentences corresponding to the question pairs of the same category are output as samples of training samples. Massive training samples can also be obtained from daily online logs; for example, in some intelligent question-and-answer scenarios, a large number of online logs will be generated during the actual question-and-answer process, and the online logs can be used as training samples after being marked by the labeling team , a large number of training samples can also be obtained from the public data sets LCQMC, BQ Corpus, etc. on the Internet. In the process of training the student model, different training samples can be selected for each iterative training. Each training sample includes sample input and sample input. In each iterative training, the sample input in the training sample can be input into the student model and the teacher model respectively. At this time, the student model will output a predicted value, and the The logit layer of the student model outputs a logit value, and the logit layer of the teacher model outputs a logit value. The predicted value and real value output by the student model are used as the input of the first loss function, and the first loss function can calculate the first loss value; the logit value output by the student model and the logit value output by the teacher model are used as the second The input of the loss function, the first loss function can calculate the second loss value, and the first loss value and the second loss value are fused as the loss value of the student model under the training sample. Then, the electronic device will judge whether the loss value of the student model satisfies the preset training completion condition, if the loss value of the student model does not meet the training completion condition, then a training sample needs to be reselected to train the student training again , until after a certain training, the loss value of the student model satisfies the training completion condition, then the training of the student model ends. Using massive training samples to iteratively train the student model can make the prediction accuracy of the trained student model higher. In other embodiments, there may be a small number of training samples, or even only one training sample.
在对学生模型进行迭代训练中,无论训练样本是海量的,还是少量的,一个训练样本是可以被重复使用。由于教师模型是已经训练完成的,所以,在同一个训练样本下,该教师模型的logit值是相同的;即,当该教师模型接收同一个训练样本中的样本输入后,该教师模型的logit层输出的logit值是相同的。因此,如果该训练样本在迭代训练中被重复用到,没有必要每次都重新计算该训练样本下该教师模型的logit值。具体的,如果该训练样本首次被选择用于训练,那么,向该教师模型输入该训练样本的样本输入,以得到该教师模型的logit值,并将该教师模型的logit值保存在预设的存储单元;若该训练样本非首次被选择,从该存储单元获取该教师模型的logit值。其中,存储单元中可以保存训练样本的识别标识与教师模型的logit值的对应关系;这样,当该训练样本被选择时,可以根据该训练样本的标识从存储单元中获取对应的教师模型的logit值。训练样本的标识例如可以是训练样本的样本编号。由于从存储单元直接获取教师模型的logit值显然比通过教师模型得到教师模型的logit值的数据处理量更小、速度更快,因此,在训练样本需要重复使用的情况下,将首次计算出的该训练样本下的教师模型的logit值存储起来以便后续使用时直接获取,能够减轻模型训练负担、且提高模型训练速度。In the iterative training of the student model, no matter whether the training samples are large or small, a training sample can be reused. Since the teacher model has been trained, the logit value of the teacher model is the same under the same training sample; that is, when the teacher model receives the sample input in the same training sample, the logit value of the teacher model The logit values output by the layers are the same. Therefore, if the training sample is used repeatedly in iterative training, it is not necessary to recalculate the logit value of the teacher model under the training sample every time. Specifically, if the training sample is selected for training for the first time, then input the sample input of the training sample to the teacher model to obtain the logit value of the teacher model, and save the logit value of the teacher model in the preset A storage unit; if the training sample is not selected for the first time, obtain the logit value of the teacher model from the storage unit. Wherein, the corresponding relationship between the identification mark of the training sample and the logit value of the teacher model can be saved in the storage unit; like this, when the training sample is selected, the logit of the corresponding teacher model can be obtained from the storage unit according to the identification of the training sample. value. The identifier of the training sample can be, for example, the sample number of the training sample. Since the logit value of the teacher model directly obtained from the storage unit is obviously smaller and faster than the logit value of the teacher model obtained through the teacher model, therefore, when the training samples need to be reused, the first calculated The logit value of the teacher model under the training sample is stored for direct acquisition in subsequent use, which can reduce the burden of model training and improve the speed of model training.
在一个实施例中,电子设备在执行对学生模型进行训练的步骤中,可以先将各个训练样本中的样本输入分别输入教师模型,得到各个训练样本下教师模型的logit值;并将各个训练样本下教师模型的logit值存储到存储单元中,即存储单元中可以保存训练样本的识别标识与教师模型的logit值的对应关系。后续学生模型在基于某个训练样本进行训练时,当需要使用该训练样本下的教师模型的logit值来进行第二损失函数的计算时,可以根据该训练样本的标识直接从该存储单元中获取对应的教师模型的logit值。In one embodiment, in the step of training the student model, the electronic device can first input the sample input in each training sample into the teacher model respectively, to obtain the logit value of the teacher model under each training sample; The logit value of the teacher model is stored in the storage unit, that is, the storage unit can store the corresponding relationship between the identification mark of the training sample and the logit value of the teacher model. When the subsequent student model is trained based on a certain training sample, when the logit value of the teacher model under the training sample needs to be used to calculate the second loss function, it can be directly obtained from the storage unit according to the identification of the training sample The logit value of the corresponding teacher model.
在一个实施例中,第一损失函数和第二损失函数的融合方式为加权融合;即,预先为第一损失函数和第二损失函数分配权重,学生模型的损失函数为:第一损失函数和第二损失函数分别与各自的权重相乘后的和。第一损失函数和第二损失函数的权重可以根据实际情况设定,比如,可以选取使得训练后的学生模型的预测准确度较高的第一损失函数和第二损失函数的权重。优选的,第二损失函数的权重大于第一损失函数的权重;即,对学生模型的训练,可以更偏重于基于软目标的训练。这样,教师模型对学生模型能够产生较大的影响, 从而使得训练好的学生模型的泛化能力更好。In one embodiment, the fusion of the first loss function and the second loss function is weighted fusion; that is, weights are assigned to the first loss function and the second loss function in advance, and the loss function of the student model is: the first loss function and The sum of the second loss functions multiplied by their respective weights. The weights of the first loss function and the second loss function can be set according to the actual situation. For example, the weights of the first loss function and the second loss function can be selected so that the prediction accuracy of the trained student model is higher. Preferably, the weight of the second loss function is greater than the weight of the first loss function; that is, the training of the student model can be more focused on the training based on soft targets. In this way, the teacher model can have a greater impact on the student model, so that the generalization ability of the trained student model is better.
在一个实施例中,第一损失函数为交叉熵损失函数,第二损失函数为平方差损失函数。在其他实施例中,第一损失函数可以是负对数似然损失函数,第二损失函数可以是KL散度损失函数。In one embodiment, the first loss function is a cross-entropy loss function, and the second loss function is a square difference loss function. In other embodiments, the first loss function may be a negative log-likelihood loss function, and the second loss function may be a KL divergence loss function.
以下为本申请的模型压缩方法的一个完整示例。The following is a complete example of the model compression method of this application.
教师模型由3种类型的复杂模型融合得到,3种类型的复杂模型分别是:BERT-wwm-ext模型、Ernie-1.0模型、RoBERTa-large-pair模型;K为10,即基于10折交叉验证训练每种类型的复杂模型。训练数据包括多个训练样本,可以将训练数据分成10份,每份训练数据包括若干个训练样本。轮流组合这10份训练数据,将10份训练数据中的9份训练数据用于模型训练,另外1份训练数据用于模型测试;因此可以组合出10组数据,每组数据包括9份训练数据和一份测试测试。The teacher model is obtained by merging three types of complex models. The three types of complex models are: BERT-wwm-ext model, Ernie-1.0 model, and RoBERTa-large-pair model; K is 10, which is based on 10-fold cross-validation Train complex models of each type. The training data includes multiple training samples, the training data can be divided into 10 parts, and each part of the training data includes several training samples. Combining these 10 training data in turn, using 9 training data in the 10 training data for model training, and the other 1 training data for model testing; therefore, 10 sets of data can be combined, and each set of data includes 9 training data and a test copy.
利用该10组数据分别对BERT-wwm-ext模型进行训练,可以得到训练完成的内部参数值不同的10个BERT-wwm-ext模型。利用该10组数据分别对Ernie-1.0模型进行训练,可以得到训练完成的内部参数值不同的10个Ernie-1.0模型;利用该10组数据分别对RoBERTa-large-pair模型进行训练,可以得到训练完成的内部参数值不同的10个RoBERTa-large-pair模型。Using the 10 sets of data to train the BERT-wwm-ext model respectively, 10 BERT-wwm-ext models with different internal parameter values after training can be obtained. Use the 10 sets of data to train the Ernie-1.0 model respectively, and you can get 10 Ernie-1.0 models with different internal parameter values after training; use the 10 sets of data to train the RoBERTa-large-pair model respectively, you can get the training Completed 10 RoBERTa-large-pair models with different internal parameter values.
对该3种类型的复杂模型进行融合。具体的,首先,将BERT-wwm-ext这种模型的10个BERT-wwm-ext模型的10个logit层输出的10个logit值相加后求平均,作为BERT-wwm-ext这类模型的logit层输出的logit值,记作第一logit值;将10个BErnie-1.0模型的10个logit层输出的10个logit值相加后取平均,作为Ernie-1.0这类模型的logit层输出的logit值,记作第二logit值;将10个RoBERTa-large-pair模型的10个logit层输出的10个logit值相加后取平均,作为RoBERTa-large-pair这类模型的logit层输出的logit值,记作第三logit值。其次,将该第一logit值、该第二logit值、该第三logit值相加后去平均,作为教师模型的logit层输出的logit值。由此,该3种类型的复杂模型融合得到教师模型。The three types of complex models are fused. Specifically, first, the 10 logit values output by the 10 logit layers of the 10 BERT-wwm-ext models of the BERT-wwm-ext model are added and averaged, as the BERT-wwm-ext model. The logit value output by the logit layer is recorded as the first logit value; the 10 logit values output by the 10 logit layers of the 10 BErnie-1.0 models are added and averaged, and used as the output of the logit layer of models such as Ernie-1.0 The logit value is recorded as the second logit value; the 10 logit values output by the 10 logit layers of the 10 RoBERTa-large-pair models are added and averaged as the output of the logit layer of the RoBERTa-large-pair model The logit value is recorded as the third logit value. Secondly, the first logit value, the second logit value, and the third logit value are added and averaged, and used as the logit value output by the logit layer of the teacher model. Thus, the three types of complex models are fused to obtain the teacher model.
接下来是对学生模型进行训练,该学生模型例如是SiaGRU模型,第一损失函数是交叉熵损失函数,第二损失函数是平方差损失函数。用于训练学生模型的训练数据,可以与上述训练复杂模型的训练数据相同,即采用该训练数据包括的多个训练样本来对进行训练。具体如下。The next step is to train the student model, such as the SiaGRU model, the first loss function is a cross-entropy loss function, and the second loss function is a square difference loss function. The training data used to train the student model may be the same as the training data for training the complex model described above, that is, multiple training samples included in the training data are used to train the student model. details as follows.
先用第1个训练样本来对该学生模型进行训练;其中,每个训练样本包括样本输入和样本输出。First use the first training sample to train the student model; wherein, each training sample includes sample input and sample output.
首先,将第1份训练样本的样本输入输入到该学生模型后,该学生模型输出一个预 测值、该学生模型的Logit层输出一个Logit值,记作:在第1份训练样本下,该学生模型的预测值、该学生模型的Logit值;将第1份训练样本输入训练完成的教师模型中,该教师模型的Logit层输出一个Logit值,记作,在第1份训练样本下,该教室模型的Logit值。First, after the sample input of the first training sample is input to the student model, the student model outputs a predicted value, and the Logit layer of the student model outputs a Logit value, which is recorded as: Under the first training sample, the student The predicted value of the model, the Logit value of the student model; the first training sample is input into the trained teacher model, and the Logit layer of the teacher model outputs a Logit value, denoted as, under the first training sample, the classroom The logit value of the model.
其次,根据该学生模型的预测值和真实值(即训练样本的样本输出)来计算交叉熵损失函数的函数值,并利用该学生模型的Logit值和该教师模型的Logit值计算平方差损失函数的函数值;并将该交叉熵损失函数的函数值和该平方差损失函数的函数值进行加权融合,得到该学生模型的损失函数的函数值;如果在该第1个训练样本下,该学生模型的损失函数的函数值不满足预设要求,那么,再用第2个训练样本对该学生模型进行训练;如果在第2个训练样本下,该学生模型的损失函数的函数值不满足预设要求,再用第3个训练样本对该学生模型进行训练,依次类推,直至在某一个训练样本下,该学生模型的损失函数的函数值满足预设要求为止。其中,用各个训练样本对该学生模型进行训练的过程均相同,即与上述第1个训练样本对该学生模型进行训练的过程类似,此处不再赘述。这里提到的损失函数的函数值满足预设要求,例如为,损失函数的函数值大于或等于预设值。Secondly, calculate the function value of the cross-entropy loss function according to the predicted value and the real value of the student model (ie, the sample output of the training sample), and use the Logit value of the student model and the Logit value of the teacher model to calculate the square difference loss function The function value of the function value; and the function value of the cross-entropy loss function and the function value of the square difference loss function are weighted and fused to obtain the function value of the loss function of the student model; if under the first training sample, the student The function value of the loss function of the model does not meet the preset requirements, then use the second training sample to train the student model; if the function value of the loss function of the student model does not meet the preset requirements under the second training sample Set the requirements, and then use the third training sample to train the student model, and so on, until the function value of the loss function of the student model meets the preset requirements under a certain training sample. Wherein, the process of training the student model with each training sample is the same, that is, it is similar to the process of training the student model with the first training sample above, and will not be repeated here. The function value of the loss function mentioned here meets the preset requirements, for example, the function value of the loss function is greater than or equal to the preset value.
上面各种方法的步骤划分,只是为了描述清楚,实现时可以合并为一个步骤或者对某些步骤进行拆分,分解为多个步骤,只要包括相同的逻辑关系,都在本专利的保护范围内;对算法中或者流程中添加无关紧要的修改或者引入无关紧要的设计,但不改变其算法和流程的核心设计都在该专利的保护范围内。The step division of the above various methods is only for the sake of clarity of description. During implementation, it can be combined into one step or some steps can be split and decomposed into multiple steps. As long as they include the same logical relationship, they are all within the scope of protection of this patent. ; Adding insignificant modifications or introducing insignificant designs to the algorithm or process, but not changing the core design of the algorithm and process are all within the scope of protection of this patent.
本申请实施例还涉及一种模型压缩系统,如图3所示。该模型压缩系统包括:复杂模型训练单元301、教师模型获取单元302、学生模型训练单元303。The embodiment of the present application also relates to a model compression system, as shown in FIG. 3 . The model compression system includes: a complex model training unit 301 , a teacher model acquisition unit 302 , and a student model training unit 303 .
复杂模型训练单元301用于提供训练好的N种类型的复杂模型;N为大于或等于2的整数。The complex model training unit 301 is used for providing trained N types of complex models; N is an integer greater than or equal to 2.
教师模型获取单元302用于对N种类型的复杂模型进行融合,得到训练好的教师模型;The teacher model acquisition unit 302 is used to fuse N types of complex models to obtain a trained teacher model;
学生模型训练单元303用于基于训练样本、教师模型以及学生模型的损失函数,对学生模型进行训练;学生模型的损失函数由第一损失函数和第二损失函数融合得到,第一损失函数用于计算学生模型的预测值与真实值的损失,第二损失函数用于计算学生模型的logit值与教师模型的logit值的损失。The student model training unit 303 is used to train the student model based on the training samples, the teacher model and the loss function of the student model; the loss function of the student model is obtained by fusing the first loss function and the second loss function, and the first loss function is used for The loss between the predicted value and the real value of the student model is calculated, and the second loss function is used to calculate the loss between the logit value of the student model and the logit value of the teacher model.
其中,训练样本包含样本输入和样本输出,学生模型在接收样本输入后输出预测值且学生模型中的logit层输出logit值,真实值为样本输出;教师模型在接收样本输入后,教师模型中的logit层输出所述logit值。Among them, the training sample includes sample input and sample output, the student model outputs the predicted value after receiving the sample input and the logit layer in the student model outputs the logit value, and the real value is the sample output; after the teacher model receives the sample input, the teacher model’s The logit layer outputs the logit value.
在一个实施例中,每种类型的复杂模型基于K折交叉验证训练得到,训练好的所述 每种类型的复杂模型包括训练好的属于该种类型的K个复杂模型;K为大于或等于2的整数。In one embodiment, each type of complex model is obtained based on K-fold cross-validation training, and the trained complex model of each type includes trained K complex models belonging to this type; K is greater than or equal to Integer of 2.
对N种类型的复杂模型进行融合,得到训练好的教师模型,包括:对于每种类型的复杂模型,将K个复杂模型的K个logit层进行融合,得到每种类型的复杂模型的logit层;将N种类型的复杂模型的N个logit层进行融合,作为教师模型的logit层。Fuse N types of complex models to obtain a trained teacher model, including: for each type of complex model, fuse K logit layers of K complex models to obtain logit layers of each type of complex model ; Fuse N logit layers of N types of complex models as the logit layer of the teacher model.
在一个实施例中,对于每种类型的复杂模型,将K个复杂模型的K个logit层进行融合,得到每种类型的复杂模型的logit层,包括:对于每种类型的复杂模型,将K个复杂模型的K个logit层输出的K个logit值相加后取平均,作为每种类型的复杂模型的logit层输出的logit值。将N种类型的复杂模型的N个logit层进行融合,作为教师模型的logit层,包括:将N种类型的复杂模型的N个logit层输出的N个logit值相加后取平均,作为教师模型的logit层输出的logit值。In one embodiment, for each type of complex model, K logit layers of K complex models are fused to obtain a logit layer of each type of complex model, including: for each type of complex model, K The K logit values output by the K logit layers of a complex model are added and averaged as the logit value output by the logit layer of each type of complex model. Fuse the N logit layers of the N types of complex models as the logit layer of the teacher model, including: adding the N logit values output by the N logit layers of the N types of complex models and taking the average, as the teacher The logit value output by the logit layer of the model.
在一个实施例中,在对学生模型的训练中,若训练样本首次被选择,向教师模型输入样本输入后,得到教师模型的logit值,并将教师模型的logit值保存在预设的存储单元;若训练样本非首次被选择,从存储单元获取教师模型的logit值。In one embodiment, in the training of the student model, if the training sample is selected for the first time, after inputting the sample input to the teacher model, the logit value of the teacher model is obtained, and the logit value of the teacher model is saved in a preset storage unit ; If the training sample is not selected for the first time, obtain the logit value of the teacher model from the storage unit.
在一个实施例中,第一损失函数和第二损失函数的融合方式为加权融合,且第二损失函数的权重大于第一损失函数的权重。In one embodiment, the fusion of the first loss function and the second loss function is weighted fusion, and the weight of the second loss function is greater than the weight of the first loss function.
在一个实施例中,第一损失函数为交叉熵损失函数,第二损失函数为平方差损失函数。In one embodiment, the first loss function is a cross-entropy loss function, and the second loss function is a square difference loss function.
在一个实施例中,学生模型为SiaGRU模型;和/或,N的数量为三,且三种类型的复杂模型分别是:BERT-wwm-ext模型、Ernie-1.0模型、RoBERTa-large-pair模型;所述学生模型是SiaGRU模型。In one embodiment, the student model is a SiaGRU model; and/or, the number of N is three, and the three types of complex models are: BERT-wwm-ext model, Ernie-1.0 model, RoBERTa-large-pair model ; The student model is a SiaGRU model.
不难发现,模型压缩系统的实施例与上述模型压缩方法的实施例是相对应的实施例,上述模型压缩方法的实施例中提到的相关技术细节在模型压缩系统的实施例中依然有效,为了减少重复,这里不再赘述。相应地,模型压缩系统的实施例中提到的相关技术细节也可应用在上述模型压缩方法的实施例中。It is not difficult to find that the embodiment of the model compression system corresponds to the embodiment of the above-mentioned model compression method, and the relevant technical details mentioned in the embodiment of the above-mentioned model compression method are still valid in the embodiment of the model compression system. In order to reduce repetition, it is not repeated here. Correspondingly, the relevant technical details mentioned in the embodiment of the model compression system can also be applied to the embodiment of the above model compression method.
值得一提的是,模型压缩系统的实施例中所涉及到的各模块均为逻辑模块,在实际应用中,一个逻辑单元可以是一个物理单元,也可以是一个物理单元的一部分,还可以以多个物理单元的组合实现。此外,为了突出本申请的创新部分,本实施例中并没有将与解决本申请所提出的技术问题关系不太密切的单元引入,但这并不表明本实施例中不存在其它的单元。It is worth mentioning that each module involved in the embodiment of the model compression system is a logical module. In practical applications, a logical unit can be a physical unit, or a part of a physical unit, or can be Combination of multiple physical units. In addition, in order to highlight the innovative part of the present application, units that are not closely related to solving the technical problem proposed in the present application are not introduced in this embodiment, but this does not mean that there are no other units in this embodiment.
本申请实施例还涉及一种电子设备,如图4所示,包括:至少一个处理器401;以及,与所述至少一个处理器401通信连接的存储器402;其中,所述存储器402存储有可被所述 至少一个处理器401执行的指401令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器401能够执行上述模型压缩方法。The embodiment of the present application also relates to an electronic device, as shown in FIG. 4 , including: at least one processor 401; and a memory 402 communicatively connected to the at least one processor 401; An instruction 401 executed by the at least one processor 401, the instruction is executed by the at least one processor, so that the at least one processor 401 can execute the above-mentioned model compression method.
其中,存储器和处理器采用总线方式连接,总线可以包括任意数量的互联的总线和桥,总线将一个或多个处理器和存储器的各种电路连接在一起。总线还可以将诸如外围设备、稳压器和功率管理电路等之类的各种其他电路连接在一起,这些都是本领域所公知的,因此,本文不再对其进行进一步描述。总线接口在总线和收发机之间提供接口。收发机可以是一个元件,也可以是多个元件,比如多个接收器和发送器,提供用于在传输介质上与各种其他装置通信的单元。经处理器处理的数据通过天线在无线介质上进行传输,进一步,天线还接收数据并将数据传送给处理器。Wherein, the memory and the processor are connected by a bus, and the bus may include any number of interconnected buses and bridges, and the bus connects one or more processors and various circuits of the memory together. The bus may also connect together various other circuits such as peripherals, voltage regulators, and power management circuits, all of which are well known in the art and therefore will not be further described herein. The bus interface provides an interface between the bus and the transceivers. A transceiver may be a single element or multiple elements, such as multiple receivers and transmitters, providing means for communicating with various other devices over a transmission medium. The data processed by the processor is transmitted on the wireless medium through the antenna, and further, the antenna also receives the data and transmits the data to the processor.
处理器负责管理总线和通常的处理,还可以提供各种功能,包括定时,外围接口,电压调节、电源管理以及其他控制功能。而存储器可以被用于存储处理器在执行操作时所使用的数据。The processor is responsible for managing the bus and general processing, and can also provide various functions, including timing, peripheral interface, voltage regulation, power management, and other control functions. Instead, memory may be used to store data that the processor uses when performing operations.
本申请实施例还涉及一种计算机可读存储介质,存储有计算机程序。计算机程序被处理器执行时实现上述方法实施例。The embodiment of the present application also relates to a computer-readable storage medium storing a computer program. The above method embodiments are implemented when the computer program is executed by the processor.
本申请实施例还涉及一种计算机程序。计算机程序被处理器执行时实现上述方法实施例。The embodiment of the present application also relates to a computer program. The above method embodiments are implemented when the computer program is executed by the processor.
即,本领域技术人员可以理解,实现上述实施例方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序存储在一个存储介质中,包括若干指令用以使得一个设备(可以是单片机,芯片等)或处理器(processor)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。That is, those skilled in the art can understand that all or part of the steps in the method of the above-mentioned embodiments can be completed by instructing related hardware through a program, the program is stored in a storage medium, and includes several instructions to make a device ( It may be a single chip microcomputer, a chip, etc.) or a processor (processor) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disc, etc., which can store program codes. .
本领域的普通技术人员可以理解,上述各实施例是实现本申请的具体实施例,而在实际应用中,可以在形式上和细节上对其作各种改变,而不偏离本申请的精神和范围。本申请实施例涉及一种计算机程序。计算机程序被处理器执行时实现上述方法实施例。Those of ordinary skill in the art can understand that the above-mentioned embodiments are specific embodiments for realizing the present application, and in practical applications, various changes can be made to it in form and details without departing from the spirit and spirit of the present application. scope. The embodiment of the present application relates to a computer program. The above method embodiments are implemented when the computer program is executed by the processor.

Claims (11)

  1. 一种模型压缩方法,其特征在于,包括:A model compression method, characterized in that, comprising:
    提供训练好的N种类型的复杂模型;N为大于或等于2的整数;Provide N types of complex models that have been trained; N is an integer greater than or equal to 2;
    对所述N种类型的复杂模型进行融合,得到训练好的教师模型;Fusing the N types of complex models to obtain a trained teacher model;
    基于训练样本、所述教师模型以及学生模型的损失函数,对所述学生模型进行训练;所述学生模型的损失函数由第一损失函数和第二损失函数融合得到,所述第一损失函数用于计算所述学生模型的预测值与真实值的损失,所述第二损失函数用于计算所述学生模型的logit值与所述教师模型的logit值的损失;Based on the loss function of the training sample, the teacher model and the student model, the student model is trained; the loss function of the student model is obtained by fusing the first loss function and the second loss function, and the first loss function is obtained by using For calculating the loss of the predicted value and the true value of the student model, the second loss function is used to calculate the loss of the logit value of the student model and the logit value of the teacher model;
    其中,所述训练样本包含样本输入和样本输出,所述学生模型在接收所述样本输入后输出所述预测值且所述学生模型中的logit层输出所述logit值,所述真实值为所述样本输出;所述教师模型在接收所述样本输入后,所述教师模型中的logit层输出所述logit值。Wherein, the training samples include sample input and sample output, the student model outputs the predicted value after receiving the sample input and the logit layer in the student model outputs the logit value, and the real value is the The sample output; after the teacher model receives the sample input, the logit layer in the teacher model outputs the logit value.
  2. 根据权利要求1所述的模型压缩方法,其特征在于,每种类型的复杂模型基于K折交叉验证训练得到,训练好的所述每种类型的复杂模型包括训练好的属于该种类型的K个复杂模型;K为大于或等于2的整数;The model compression method according to claim 1, wherein each type of complex model is obtained based on K-fold cross-validation training, and the trained complex model of each type includes the trained K of this type. complex models; K is an integer greater than or equal to 2;
    所述对N种类型的复杂模型进行融合,得到训练好的教师模型,包括:The fusion of N types of complex models is carried out to obtain a trained teacher model, including:
    对于所述每种类型的复杂模型,将所述K个复杂模型的K个logit层进行融合,得到所述每种类型的复杂模型的logit层;For each type of complex model, the K logit layers of the K complex models are fused to obtain the logit layers of each type of complex model;
    将所述N种类型的复杂模型的N个logit层进行融合,作为所述教师模型的logit层。Fusing the N logit layers of the N types of complex models is used as the logit layer of the teacher model.
  3. 根据权利要求2所述的模型压缩方法,其特征在于,The model compression method according to claim 2, wherein,
    所述对于所述每种类型的复杂模型,将所述K个复杂模型的K个logit层进行融合,得到所述每种类型的复杂模型的logit层,包括:对于所述每种类型的复杂模型,将所述K个复杂模型的K个logit层输出的K个logit值相加后取平均,作为所述每种类型的复杂模型的logit层输出的logit值;For each type of complex model, the K logit layers of the K complex models are fused to obtain the logit layers of each type of complex model, including: for each type of complex model, the K logit values output by the K logit layers of the K complex models are added and averaged, as the logit values output by the logit layers of the complex models of each type;
    所述将所述N种类型的复杂模型的N个logit层进行融合,作为所述教师模型的logit层,包括:将所述N种类型的复杂模型的N个logit层输出的N个logit值相加后取平均,作为所述教师模型的logit层输出的logit值。Said merging the N logit layers of the N types of complex models as the logit layer of the teacher model includes: N logit values output by the N logit layers of the N types of complex models Take the average after adding, as the logit value output by the logit layer of the teacher model.
  4. 根据权利要求1所述的模型压缩方法,其特征在于,在对所述学生模型的训练中,若所述训练样本首次被选择,向所述教师模型输入所述样本输入后,得到所述教师模型的logit值,并将所述教师模型的logit值保存在预设的存储单元;若所述训练样本非首次被选 择,从所述存储单元获取所述教师模型的logit值。The model compression method according to claim 1, wherein during the training of the student model, if the training sample is selected for the first time, after inputting the sample input to the teacher model, the teacher model is obtained. The logit value of the model, and save the logit value of the teacher model in a preset storage unit; if the training sample is not selected for the first time, obtain the logit value of the teacher model from the storage unit.
  5. 根据权利要求1所述的模型压缩方法,其特征在于,所述第一损失函数和所述第二损失函数的融合方式为加权融合,且所述第二损失函数的权重大于所述第一损失函数的权重。The model compression method according to claim 1, wherein the fusion method of the first loss function and the second loss function is weighted fusion, and the weight of the second loss function is greater than the weight of the first loss The weight of the function.
  6. 根据权利要求1所述的模型压缩方法,其特征在于,所述第一损失函数为交叉熵损失函数,所述第二损失函数为平方差损失函数。The model compression method according to claim 1, wherein the first loss function is a cross-entropy loss function, and the second loss function is a square difference loss function.
  7. 根据权利要求1至6中任一项所述的模型压缩方法,其特征在于,所述学生模型为SiaGRU模型;和/或,The model compression method according to any one of claims 1 to 6, wherein the student model is a SiaGRU model; and/or,
    所述N的数量为三,且所述三种类型的复杂模型分别是:BERT-wwm-ext模型、Ernie-1.0模型、RoBERTa-large-pair模型;所述学生模型是SiaGRU模型。The number of N is three, and the three types of complex models are: BERT-wwm-ext model, Ernie-1.0 model, RoBERTa-large-pair model; the student model is a SiaGRU model.
  8. 一种模型压缩系统,其特征在于,包括:A model compression system, characterized in that it comprises:
    复杂模型训练单元,用于提供训练好的N种类型的复杂模型;N为大于或等于2的整数;The complex model training unit is used to provide trained N types of complex models; N is an integer greater than or equal to 2;
    教师模型获取单元,用于对所述N种类型的复杂模型进行融合,得到训练好的教师模型;a teacher model acquisition unit, configured to fuse the N types of complex models to obtain a trained teacher model;
    学生模型训练单元,用于基于训练样本、所述教师模型以及学生模型的损失函数,对所述学生模型进行训练;所述学生模型的损失函数由第一损失函数和第二损失函数融合得到,所述第一损失函数用于计算所述学生模型的预测值与真实值的损失,所述第二损失函数用于计算所述学生模型的logit值与所述教师模型的logit值的损失;a student model training unit, configured to train the student model based on training samples, the teacher model, and a loss function of the student model; the loss function of the student model is obtained by fusing the first loss function and the second loss function, The first loss function is used to calculate the loss of the predicted value of the student model and the true value, and the second loss function is used to calculate the loss of the logit value of the student model and the logit value of the teacher model;
    其中,所述训练样本包含样本输入和样本输出,所述学生模型在接收所述样本输入后输出所述预测值且所述学生模型中的logit层输出所述logit值,所述真实值为所述样本输出;所述教师模型在接收所述样本输入后,所述教师模型中的logit层输出所述logit值。Wherein, the training samples include sample input and sample output, the student model outputs the predicted value after receiving the sample input and the logit layer in the student model outputs the logit value, and the real value is the The sample output; after the teacher model receives the sample input, the logit layer in the teacher model outputs the logit value.
  9. 一种电子设备,其特征在于,包括:An electronic device, characterized in that it comprises:
    至少一个处理器;以及,at least one processor; and,
    与所述至少一个处理器通信连接的存储器;其中,a memory communicatively coupled to the at least one processor; wherein,
    所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行如权利要求1至7中任一所述的模型压缩方法。The memory is stored with instructions executable by the at least one processor, the instructions are executed by the at least one processor, so that the at least one processor can perform any one of claims 1 to 7 model compression method.
  10. 一种计算机可读存储介质,存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1至7中任一项所述的模型压缩方法。A computer-readable storage medium storing a computer program, wherein the computer program implements the model compression method according to any one of claims 1 to 7 when executed by a processor.
  11. 一种计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1至7中任一项所述的模型压缩方法。A computer program, characterized in that, when the computer program is executed by a processor, the model compression method according to any one of claims 1 to 7 is realized.
PCT/CN2021/140780 2021-06-29 2021-12-23 Model compression method and system, electronic device, and storage medium WO2023273237A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110732278.8 2021-06-29
CN202110732278.8A CN115238903B (en) 2021-06-29 2021-06-29 Model compression method, system, electronic device and storage medium

Publications (1)

Publication Number Publication Date
WO2023273237A1 true WO2023273237A1 (en) 2023-01-05

Family

ID=83666651

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/140780 WO2023273237A1 (en) 2021-06-29 2021-12-23 Model compression method and system, electronic device, and storage medium

Country Status (2)

Country Link
CN (1) CN115238903B (en)
WO (1) WO2023273237A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112182362A (en) * 2020-08-31 2021-01-05 华为技术有限公司 Method and device for training model for online click rate prediction and recommendation system
CN112418343A (en) * 2020-12-08 2021-02-26 中山大学 Multi-teacher self-adaptive joint knowledge distillation
WO2021095176A1 (en) * 2019-11-13 2021-05-20 日本電気株式会社 Learning device, learning method, and recording medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021095176A1 (en) * 2019-11-13 2021-05-20 日本電気株式会社 Learning device, learning method, and recording medium
CN112182362A (en) * 2020-08-31 2021-01-05 华为技术有限公司 Method and device for training model for online click rate prediction and recommendation system
CN112418343A (en) * 2020-12-08 2021-02-26 中山大学 Multi-teacher self-adaptive joint knowledge distillation

Also Published As

Publication number Publication date
CN115238903B (en) 2023-10-03
CN115238903A (en) 2022-10-25

Similar Documents

Publication Publication Date Title
US10720071B2 (en) Dynamic identification and validation of test questions from a corpus
CN111125309A (en) Natural language processing method and device, computing equipment and storage medium
KR102259390B1 (en) System and method for ensemble question-answering
US20210390873A1 (en) Deep knowledge tracing with transformers
CN111753076B (en) Dialogue method, dialogue device, electronic equipment and readable storage medium
Xue et al. Generative adversarial learning for optimizing ontology alignment
CN117149989B (en) Training method for large language model, text processing method and device
CN113806487A (en) Semantic search method, device, equipment and storage medium based on neural network
CN116136870A (en) Intelligent social conversation method and conversation system based on enhanced entity representation
Hai Chatgpt: The evolution of natural language processing
WO2023273237A1 (en) Model compression method and system, electronic device, and storage medium
CN116610795A (en) Text retrieval method and device
CN116308757A (en) Credit wind control model training method and device based on knowledge distillation, electronic equipment and computer medium
Kumari et al. Domain-Specific Chatbot Development Using the Deep Learning-Based RASA Framework
US11886821B2 (en) Method and system for inferring answers from knowledge graphs
US11605307B2 (en) Assessing student understanding
Chen Measurement, evaluation, and model construction of mathematical literacy based on iot and pisa
Kazi et al. A survey of deep learning techniques for machine reading comprehension
Du et al. Semantic-enhanced reasoning question answering over temporal knowledge graphs
EP4328805A1 (en) Method and apparatus for generating target deep learning model
CN117272937B (en) Text coding model training method, device, equipment and storage medium
Ayana et al. Reinforced Zero-Shot Cross-Lingual Neural Headline Generation
CN116306917B (en) Task processing method, device, equipment and computer storage medium
CN112380353B (en) Knowledge engineering-based spacecraft overall design method, system and storage medium
Ali et al. SWFQA Semantic Web Based Framework for Question Answering

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21948162

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE