WO2022105121A1

WO2022105121A1 - Distillation method and apparatus applied to bert model, device, and storage medium

Info

Publication number: WO2022105121A1
Application number: PCT/CN2021/090524
Authority: WO
Inventors: 朱桂良
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-11-17
Filing date: 2021-04-28
Publication date: 2022-05-27
Also published as: CN112418291A

Abstract

A distillation method and apparatus applied to a BERT model, a computer device, and a storage medium, which relate to the technical field of deep learning. In the method, as a refined BERT model retains the same model structure as a raw BERT model and the difference being a different number of layers, the amount of change in a code is relatively small. Moreover, prediction codes of a large model and a small model are consistent and a source code may be reused, such that that the weights of loss parameters when a model is being distilled do not need to be balanced, thereby reducing the level of difficulty of a deep model distillation method. Meanwhile, tasks of each stage of training the refined BERT model are consistent, so that convergence of the refined BERT model is more stable.

Description

A kind of distillation method, device, equipment and storage medium applied to BERT model

This application is based on the Chinese invention patent application with the application number 202011288877.7 filed on November 17, 2020, entitled "A distillation method, device, equipment and storage medium applied to the BERT model", and claims its priority.

technical field

The present application relates to the technical field of deep learning, and in particular, to a distillation method, apparatus, computer equipment and storage medium applied to a BERT model.

Background technique

In recent years, in many fields such as computer vision and speech recognition, when using deep networks to solve problems, people often tend to design more complex networks to collect more data in order to obtain better results. However, the complexity of the model has increased dramatically, and the intuitive performance is that there are more and more model parameters, the scale is getting bigger and bigger, and the hardware resources (memory, GPU) required are getting higher and higher. It is not conducive to the deployment of the model and the promotion of the application to the mobile terminal.

There is a deep model distillation method, which uses the advantages of the distillation model to match the data between each intermediate layer during model distillation, and has achieved the purpose of compressing the model.

However, the applicant realizes that traditional deep model distillation methods are generally unintelligent. When matching the output of the intermediate layer during the distillation process, it is often necessary to balance many loss parameters, such as: downstream task loss, intermediate layer output loss, correlation Matrix loss, attention matrix (Attention) loss, etc., which lead to the difficulty of balancing loss parameters in traditional deep model distillation methods.

SUMMARY OF THE INVENTION

The purpose of the embodiments of the present application is to propose a distillation method, device, computer equipment and storage medium applied to the BERT model, so as to solve the problem that it is difficult to balance the loss parameters in the traditional deep model distillation method.

In order to solve the above-mentioned technical problems, the embodiment of the present application provides a distillation method applied to the BERT model, which adopts the following technical solutions:

Receive a model distillation request sent by the user terminal, where the model distillation request at least carries a distillation object identifier and a distillation coefficient;

Read the local database, obtain the trained original BERT model corresponding to the distillation object identifier in the local database, and the loss function of the original BERT model is cross entropy;

Build a default reduced model to be trained that is consistent with the trained original BERT model structure, and the loss function of the default reduced model is cross entropy;

Perform a distillation operation on the default reduced model based on the distillation coefficient to obtain an intermediate reduced model;

Acquiring training data of the intermediate reduced model in the local database;

A model training operation is performed on the intermediate reduced model based on the training data to obtain a target reduced model.

In order to solve the above-mentioned technical problems, the embodiment of the present application also provides a distillation device applied to the BERT model, which adopts the following technical solutions:

a request receiving module, configured to receive a model distillation request sent by a user terminal, where the model distillation request at least carries a distillation object identifier and a distillation coefficient;

The original model acquisition module is used for reading the local database, and in the local database, the trained original BERT model corresponding to the distillation object identifier is obtained, and the loss function of the original BERT model is cross entropy;

The default model building module is used to construct a default reduced model to be trained that is consistent with the trained original BERT model structure, and the loss function of the default reduced model is cross entropy;

a distillation operation module, configured to perform a distillation operation on the default reduced model based on the distillation coefficient to obtain an intermediate reduced model;

a training data acquisition module, used for acquiring the training data of the intermediate reduced model in the local database;

A model training module, configured to perform a model training operation on the intermediate reduced model based on the training data to obtain a target reduced model.

In order to solve the above-mentioned technical problems, the embodiment of the present application also provides a computer device, which adopts the following technical solutions:

comprising a memory and a processor, wherein computer-readable instructions are stored in the memory, and when the processor executes the computer-readable instructions, the steps of the distillation method applied to the BERT model as described below are implemented;

In order to solve the above technical problems, the embodiments of the present application also provide a computer-readable storage medium, which adopts the following technical solutions:

The computer-readable storage medium stores computer-readable instructions, and when the computer-readable instructions are executed by the processor, implements the steps of the distillation method applied to the BERT model as described below:

Compared with the prior art, the distillation method, device, computer equipment and storage medium applied to the BERT model provided by the embodiments of the present application mainly have the following beneficial effects:

The embodiment of the present application provides a distillation method applied to a BERT model, receiving a model distillation request sent by a user terminal, where the model distillation request at least carries a distillation object identifier and a distillation coefficient; Obtain a trained original BERT model corresponding to the distillation object identifier in the , and the loss function of the original BERT model is cross entropy; construct a default simplified model to be trained consistent with the trained original BERT model structure, The loss function of the default reduced model is cross entropy; the distillation operation is performed on the default reduced model based on the distillation coefficient to obtain an intermediate reduced model; the training data of the intermediate reduced model is obtained in the local database; The training data is used to perform a model training operation on the intermediate reduced model to obtain a target reduced model. Since the simplified BERT model retains the same model structure as the original BERT model, the difference is the number of layers, which makes the amount of code changes smaller, and the prediction codes of the large model and the small model are consistent, and the original code can be reused to make the model In the process of distillation, there is no need to balance the weights of each loss parameter, thereby reducing the difficulty of the deep model distillation method. At the same time, the tasks in each stage of training the simplified BERT model remain consistent, which makes the convergence of the simplified BERT model more stable.

Description of drawings

In order to illustrate the solutions in the present application more clearly, the following will briefly introduce the accompanying drawings used in the description of the embodiments of the present application. For those of ordinary skill, other drawings can also be obtained from these drawings without any creative effort.

Fig. 1 is the realization flow chart of the distillation method applied to the BERT model provided by the first embodiment of the present application;

Fig. 2 is the realization flow chart of step S104 in Fig. 1;

Fig. 3 is the realization flow chart of step S105 in Fig. 1;

Fig. 4 is the realization flow chart of the parameter optimization operation provided by Embodiment 1 of the present application;

Fig. 5 is the realization flow chart of step S403 in Fig. 4;

Fig. 6 is the structural representation of the distillation apparatus applied to the BERT model provided by the second embodiment of the present application;

FIG. 7 is a schematic structural diagram of an embodiment of a computer device according to the present application.

Detailed ways

Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the technical field of this application; the terms used herein in the specification of the application are for the purpose of describing specific embodiments only It is not intended to limit the application; the terms "comprising" and "having" and any variations thereof in the description and claims of this application and the above description of the drawings are intended to cover non-exclusive inclusion. The terms "first", "second" and the like in the description and claims of the present application or the above drawings are used to distinguish different objects, rather than to describe a specific order.

Reference herein to an "embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the present application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor a separate or alternative embodiment that is mutually exclusive of other embodiments. It is explicitly and implicitly understood by those skilled in the art that the embodiments described herein may be combined with other embodiments.

In order to make those skilled in the art better understand the solutions of the present application, the technical solutions in the embodiments of the present application will be described clearly and completely below with reference to the accompanying drawings.

Example 1

As shown in FIG. 1 , it shows the implementation flow chart of the distillation method applied to the BERT model provided according to the first embodiment of the present application. For the convenience of description, only the part related to the present application is shown.

In step S101, a model distillation request sent by a user terminal is received, where the model distillation request at least carries a distillation object identifier and a distillation coefficient.

In this embodiment of the present application, a user terminal refers to a terminal device used to execute the image processing method for preventing credential abuse provided by the present application, and the current terminal may be, for example, a mobile phone, a smart phone, a notebook computer, a digital broadcast receiver, Mobile terminals such as PDAs (Personal Digital Assistants), PADs (Tablet Computers), PMPs (Portable Multimedia Players), navigation devices, etc., as well as stationary terminals such as digital TVs, desktop computers, etc. The examples are only for the convenience of understanding, and are not used to limit the present application.

In the embodiment of the present application, the distillation object identifier is mainly used to uniquely identify the model object that needs to be distilled, and the distillation object identifier may be named based on the model name. As an example, for example, a visual recognition model, a speech recognition model, etc.; The identification can be named based on the abbreviation of the name, as an example, such as: sjsbmx, yysbmx, etc.; the distillation object identification can also be named by a serial number, as an example, such as: 001, 002, etc., it should be understood that the distillation object here The examples of marks are only for convenience of understanding, and are not used to limit the present application.

In the embodiment of this application, the distillation coefficient is mainly used to confirm the multiple of reducing the number of layers of the original BERT model. As an example, for example, if the BERT model needs to be distilled from 12 layers to 4 layers, then the distillation coefficient is 3, which should be It is understood that the examples of distillation coefficients here are only for convenience of understanding, and are not intended to limit the present application.

In step S102, the local database is read, and the trained original BERT model corresponding to the distillation object identifier is obtained in the local database, and the loss function of the original BERT model is cross entropy.

In this embodiment of the present application, the local database refers to a database resident on a machine running a client application. The local database provides the fastest response time. Because there is no network transfer between the client (application) and the server. The local database pre-stores a variety of trained original BERT models to solve problems in many fields such as computer vision and speech recognition.

In this embodiment of the present application, the Bert model can be divided into a vector (embedding) layer, a transformer (transformer) layer, and a prediction (prediction) layer, each of which is a different representation of knowledge. The original BERT model consists of a 12-layer transformer (a model based on an "encoder-decoder" structure), and the original BERT model uses cross-entropy as the loss function. The cross entropy is mainly used to measure the difference information between two probability distributions. The performance of language models is usually measured by cross-entropy and perplexity. The meaning of cross-entropy is the difficulty of text recognition with the model, or from a compression point of view, how many bits are used to encode each word on average. The meaning of complexity is to use the model to represent the average number of branches of this text, and its inverse can be regarded as the average probability of each word. Smoothing refers to assigning a probability value to the unobserved N-gram combination to ensure that the word sequence can always obtain a probability value through the language model.

In step S103, a default reduced model to be trained that is consistent with the trained original BERT model structure is constructed, and the loss function of the default reduced model is cross entropy.

In the embodiment of the present application, the constructed default reduced model retains the same model structure as BERT, the difference lies in the number of transformer layers.

In step S104, a distillation operation is performed on the default reduced model based on the distillation coefficient to obtain an intermediate reduced model.

In this embodiment of the present application, the distillation operation specifically includes the distillation of the transformer layer and parameter initialization.

In the embodiment of this application, the distillation transformer layer means that if the distillation coefficient is 3, the first to third layers of the trained original BERT model will be replaced with the first layer of the default reduced model; the trained original BERT model The fourth to sixth layers will be replaced to the second layer of the default reduced model; the seventh to ninth layers of the trained original BERT model will be replaced to the third layer of the default reduced model; Layers ten to twelfth will be replaced to the fourth layer of the default reduced model.

In this embodiment of the present application, in the process of distillation replacement, the probability of each layer being replaced may be determined by using the Bernoulli distribution probability.

In the embodiment of the present application, parameter initialization refers to replacing the parameters of the embedding, pooler, and fully connected layers to the parameter positions corresponding to the default simplified model according to the parameters of each level in the trained original BERT model.

In step S105, the training data of the intermediate reduced model is obtained from the local database.

In the embodiment of the present application, the training data of the reduced model may be labeled data obtained by training the above-mentioned original BERT model, or may be additional unlabeled data.

In the example of this application, the original training data after training of the original BERT model can be obtained; the temperature parameter of the softmax layer of the original BERT model can be increased to obtain the increased BERT model, and the original training data can be input into the increased BERT model for prediction operation , get the mean result label; perform a screening operation on the original training data based on the label information, and obtain the label of the filtered result with the label; select the reduced model training data based on the enlarged training data and the filtered training data.

In step S106, a model training operation is performed on the intermediate reduced model based on the training data to obtain the target reduced model.

In the embodiment of the present application, a distillation method applied to a BERT model is provided, which receives a model distillation request sent by a user terminal, and the model distillation request carries at least a distillation object identifier and a distillation coefficient; Obtain the trained original BERT model corresponding to the identity of the distilled object. The loss function of the original BERT model is cross entropy; construct a default reduced model to be trained that is consistent with the structure of the trained original BERT model. The loss function of the default reduced model is Cross entropy; perform a distillation operation on the default reduced model based on the distillation coefficient to obtain an intermediate reduced model; obtain the training data of the intermediate reduced model in the local database; perform model training operations on the intermediate reduced model based on the training data to obtain the target reduced model. Since the simplified BERT model retains the same model structure as the original BERT model, the difference is the number of layers, which makes the amount of code changes smaller, and the prediction codes of the large model and the small model are consistent, and the original code can be reused to make the model In the process of distillation, there is no need to balance the weights of each loss parameter, thereby reducing the difficulty of the deep model distillation method. At the same time, the tasks in each stage of training the simplified BERT model remain consistent, which makes the convergence of the simplified BERT model more stable.

Continuing to refer to FIG. 2 , a flowchart of the implementation of step S104 in FIG. 1 is shown. For the convenience of description, only the parts related to the present application are shown.

In some optional implementation manners of Embodiment 1 of the present application, the foregoing step S104 specifically includes: step S201 , step S202 and step S203 .

In step S201, a grouping operation is performed on the transformer layer of the original BERT model based on the distillation coefficient to obtain a grouped transformer layer.

In the embodiment of this application, the grouping operation refers to that the number of transformer layers is grouped according to the distillation coefficient. For example, for example, the number of transformer layers is 12 and the distillation coefficient is 3. The grouping operation divides the 12 transformer layers into 4 groups.

In step S202, extraction operations are respectively performed in the grouped transformer layers based on the Bernoulli distribution to obtain the transformer layers to be replaced.

In the embodiments of the present application, Bernoulli distribution refers to that for a random variable X, the parameter is p (0<p<1), if it takes probability p and 1-p to take 1 and 0 as values respectively. EX=p, DX=p(1-p). The number of successful Bernoulli trials obeys the Bernoulli distribution, and the parameter p is the probability of the success of the trial. The Bernoulli distribution is a discrete probability distribution, a special case of the binomial distribution when N=1.

In step S203, the transformer layers to be replaced are respectively replaced with default reduced models to obtain intermediate reduced models.

In the embodiment of the present application, the distillation method based on layer replacement retains the same model structure as BERT, the difference is the number of layers, so that the amount of code changes is small, and the prediction codes of the large model and the small model are consistent, The original code can be reused, because during distillation, some layers of the small model are randomly initialized to the weight of the trained large model mapping layer based on Bernoulli sampling, which makes the model converge faster and reduces the number of training rounds.

Continuing to refer to FIG. 3 , a flowchart of the implementation of step S105 in FIG. 1 is shown. For the convenience of description, only the parts related to the present application are shown.

In some optional implementation manners of Embodiment 1 of the present application, the foregoing step S105 specifically includes: step S301 , step S302 , step S303 , step S304 and step S305 .

In step S301, the original training data after the original BERT model training is obtained.

In the embodiment of the present application, the original training data refers to the training data of inputting the training data into the untrained original BERT model before obtaining the trained original BERT model.

In step S302, the temperature parameter of the softmax layer of the original BERT model is increased to obtain an increased BERT model.

In the embodiment of the present application, the temperature parameter T can be increased to a larger value, for example, T=20. It should be understood that the example of increasing the temperature parameter here is only for convenience and is not intended to limit the present application .

In step S303 , the original training data is input into the BERT model for increasing the prediction operation, and the mean result label is obtained.

In the embodiment of the present application, each original training data can obtain its final classification probability vector in each original BERT model, and selecting the maximum probability is the judgment result of the model for the current original training data. For t original BERT models, the t probability vector can be output, and then the average of the t probability vectors can be calculated as the final probability output vector of the current original training data. After all the original training data have completed the prediction operation, the corresponding Mean result label.

In step S304, a screening operation is performed on the original training data based on the label information to obtain a labelled screening result label.

In the embodiment of the present application, since label data will be attached to some sample data when training the original BERT model, in order to obtain training data with a mapping relationship, it is necessary to perform a screening operation on the original training data according to whether the label data is carried as a condition, In order to obtain the training data with the mapping relationship as the label of the screening result.

In step S305, the reduced model training data is selected based on the enlarged training data and the filtered training data.

In the embodiment of the present application, the selected training data of the reduced model can be expressed as:

Target=a*hard_target+b*soft_target(a+b=1)

Among them, Target represents the label that is finally used as the training data of the intermediate reduced model; hard_target represents the label of the screening result; soft_target represents the label of the mean result; a and b represent the weight of the control label fusion.

Continuing to refer to FIG. 4 , a flowchart for realizing the parameter optimization operation provided in Embodiment 1 of the present application is shown. For the convenience of description, only the part related to the present application is shown.

In some optional implementation manners of Embodiment 1 of the present application, after the foregoing step S106, the foregoing method further includes: step S401, step S402, step S403, and step S404.

In step S401, the optimized training data is obtained from the local database.

In the embodiment of the present application, the optimized training data is mainly used to optimize the parameters of the target reduced model. The optimized training data is input into the trained original BERT model and the target reduced model respectively. On the premise of ensuring the consistency of the input data, the original The difference between the output of each transformer layer of the BERT model and the target reduction model.

In step S402, the optimized training data is input into the trained original BERT model and the target reduced model respectively, and the original transformer layer output data and the target transformer layer output data are obtained respectively.

In step S403, the distillation loss data of the output data of the original transformer layer and the output data of the target transformer layer are calculated based on the soil removal distance.

In the embodiment of the present application, the earth removal distance (EMD) is a measure of the distance between two probability distributions on a region D. The attention (attention) matrix data output by the original transformer layer and the target transformer layer can be obtained respectively, and the attention EMD distance of the attention (attention) matrix data of the two can be calculated; then the original transformer layer and the target transformer layer output respectively. FFN (Fully Connected Feedforward Neural Network) hidden layer matrix data, and calculate the FFN hidden layer EMD distance of the two FFN hidden layer matrix data to obtain the distillation loss data.

In step S404, a parameter optimization operation is performed on the target reduced model according to the distillation loss data to obtain an optimized reduced model.

In the embodiment of the present application, after learning the distillation loss data (ie, the distance metric between the original transformer layer output data and the target transformer layer output data), the parameters in the target reduced model are optimized until the distillation loss data is less than the preset value. , or the training times meet the preset times, so as to obtain the optimized and reduced model.

In the embodiment of this application, since the transformer layer of the target reduced model is selected based on the probability of Bernoulli distribution, there is a certain error in the parameters of the target reduced model, because the transformer layer in the Bert model contributes the most to the model , contains the most abundant information, and the learning ability of the simplified model in this layer is also the most important. Therefore, the loss data between the output of the transformer layer of the original BERT model and the output of the transformer layer of the target simplified model is calculated by using the "earth removal distance EMD", And based on the loss data, the parameters of the target reduced model are optimized to improve the accuracy of the target reduced model, which can ensure that the target model learns more knowledge of the original model.

Continuing to refer to FIG. 5 , a flowchart of the implementation of step S403 in FIG. 4 is shown. For convenience of description, only the part related to the present application is shown.

In some optional implementation manners of Embodiment 1 of the present application, the foregoing step S403 specifically includes: step S501 , step S502 , step 503 , step S504 and step S505 .

In step S501, the original attention matrix output by the original transformer layer and the target attention matrix output by the target transformer layer are obtained.

In step S502, the attention EMD distance is calculated according to the original attention matrix and the target attention matrix.

In this embodiment of the present application, the attention EMD distance is expressed as:

Among them, L _attn represents the attention EMD distance; A ^T represents the original attention matrix; A ^S represents the target attention matrix;

represents the mean squared error between the original attention matrix and the standard attention matrix, and

represents the original attention matrix of the original transformer layer of the i-th layer;

Represents the target attention matrix of the j-th target transformer layer; f _ij represents the amount of knowledge migrated from the i-th original transformer layer to the j-th target transformer layer; M represents the number of layers of the original transformer layer; N represents the target transformer layer. layers.

In step S503, the original FFN hidden layer matrix output by the original transformer layer and the target FFN hidden layer matrix output by the target transformer layer are obtained.

In step S504, the FFN hidden layer EMD distance is calculated according to the original FFN hidden layer matrix and the target FFN hidden layer matrix.

In this embodiment of the present application, the EMD distance of the FFN hidden layer is expressed as:

Among them, _Lffn represents the EMD distance of the FFN hidden layer; H ^T represents the original FFN hidden layer matrix of the original transformer layer; H ^S represents the target FFN hidden layer matrix of the target transformer layer;

represents the mean squared error between the original FFN hidden layer matrix and the target FFN hidden layer matrix, and

Represents the target FFN hidden layer matrix of the j-th target transformer layer; W _h represents the transformation matrix;

Represents the original FFN hidden layer matrix of the original transformer layer of the i-th layer; f _ij represents the amount of knowledge migrated from the original transformer layer of the i-th layer to the target transformer layer of the j-th layer; M represents the number of layers of the original transformer layer; N represents the target transformer layer number of layers.

In step S505, the distillation loss data is obtained based on the attention EMD distance and the FFN hidden layer EMD distance.

In the embodiment of this application, the transformer layer is an important part of the Bert model, and long-distance dependencies can be captured through the self-attention mechanism. A standard transformer mainly includes two parts: the multi-head attention mechanism (Multi-Head Attention, MHA). ) and a fully connected feedforward neural network (FFN). EMD is a method of calculating the optimal distance between two distributions using linear programming, which can make the distillation of knowledge more reasonable.

In some optional implementations of Embodiment 1 of the present application, the attention EMD distance is expressed as:

In some optional implementations of Embodiment 1 of the present application, the FFN hidden layer EMD distance is expressed as:

Represents the original FFN hidden layer matrix of the original transformer layer of the i-th layer; f _ij represents the amount of knowledge migrated from the original transformer layer of the i-th layer to the target transformer layer of the j-th layer; M represents the number of layers of the original transformer layer; N represents the target transformer layer the number of layers.

In summary, Embodiment 1 of the present application provides a distillation method applied to a BERT model, receiving a model distillation request sent by a user terminal, and the model distillation request at least carries a distillation object identifier and a distillation coefficient; Obtain the trained original BERT model corresponding to the identification of the distillation object, and the loss function of the original BERT model is cross entropy; construct a default reduced model to be trained that is consistent with the structure of the trained original BERT model, and the loss function of the default reduced model is cross entropy; perform distillation operation on the default reduced model based on the distillation coefficient to obtain an intermediate reduced model; obtain the training data of the intermediate reduced model in the local database; perform model training operations on the intermediate reduced model based on the training data to obtain the target reduced model. Since the simplified BERT model retains the same model structure as the original BERT model, the difference is the number of layers, which makes the amount of code changes smaller, and the prediction codes of the large model and the small model are consistent, and the original code can be reused to make the model In the process of distillation, there is no need to balance the weights of each loss parameter, thereby reducing the difficulty of the deep model distillation method. At the same time, the tasks in each stage of training the simplified BERT model remain consistent, which makes the convergence of the simplified BERT model more stable. In addition, the distillation method based on layer replacement retains the same model structure as BERT. The difference is the number of layers, which makes the code changes smaller, and the prediction codes of the large model and the small model are consistent, and the original code can be reused , during distillation, some layers of the small model are randomly initialized to the weight of the trained large model mapping layer based on Bernoulli sampling, which makes the model converge faster and reduces the number of training rounds.

Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing relevant hardware through computer-readable instructions, and the computer-readable instructions can be stored in a computer-readable storage medium. , when the computer-readable instructions are executed, the processes of the above-mentioned method embodiments may be included. Wherein, the aforementioned storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM) or the like.

It should be understood that although the various steps in the flowchart of the accompanying drawings are sequentially shown in the order indicated by the arrows, these steps are not necessarily executed in sequence in the order indicated by the arrows. Unless explicitly stated herein, the execution of these steps is not strictly limited to the order and may be performed in other orders. Moreover, at least a part of the steps in the flowchart of the accompanying drawings may include multiple sub-steps or multiple stages, and these sub-steps or stages are not necessarily executed at the same time, but may be executed at different times, and the execution sequence is also It does not have to be performed sequentially, but may be performed alternately or alternately with other steps or at least a portion of sub-steps or stages of other steps.

Embodiment 2

Further referring to FIG. 6 , as an implementation of the method shown in FIG. 1 above, the present application provides an embodiment of a distillation apparatus applied to a BERT model, and the apparatus embodiment corresponds to the method embodiment shown in FIG. 1 , Specifically, the device can be applied to various electronic devices.

As shown in FIG. 6 , the distillation apparatus 100 applied to the BERT model in this embodiment includes: a request receiving module 110, an original model obtaining module 120, a default model building module 130, a distillation operation module 140, a training data obtaining module 150, and a model training module module 160. in:

The request receiving module 110 is configured to receive a model distillation request sent by the user terminal, where the model distillation request at least carries a distillation object identifier and a distillation coefficient;

The original model obtaining module 120 is used to read the local database, and obtain the trained original BERT model corresponding to the distillation object identifier in the local database, and the loss function of the original BERT model is cross entropy;

The default model building module 130 is used to construct a default reduced model to be trained that is consistent with the trained original BERT model structure, and the loss function of the default reduced model is cross entropy;

a distillation operation module 140, configured to perform a distillation operation on the default reduced model based on the distillation coefficient to obtain an intermediate reduced model;

A training data acquisition module 150, configured to acquire the training data of the intermediate reduced model in the local database;

The model training module 160 is configured to perform a model training operation on the intermediate reduced model based on the training data to obtain the target reduced model.

In this embodiment of the present application, the distillation object identifier is mainly used to uniquely identify the model object that needs to be distilled. The distillation object identifier may be named based on the model name. For example, for example, a visual recognition model, a speech recognition model, etc.; the distillation object The identification can be named based on the abbreviation of the name, as an example, such as: sjsbmx, yysbmx, etc.; the distillation object identification can also be named by a serial number, as an example, such as: 001, 002, etc., it should be understood that the distillation object here The examples of marks are only for convenience of understanding, and are not used to limit the present application.

In the embodiment of the present application, a distillation device applied to the BERT model is provided. Since the simplified BERT model retains the same model structure as the original BERT model, the difference is the number of layers, so that the amount of code changes is small and large. The prediction code of the model and the small model is consistent, and the original code can be reused, so that the model does not need to balance the weight of each loss parameter during the distillation process, thereby reducing the difficulty of the deep model distillation method. The tasks of the stages are all consistent, which makes the convergence of the simplified BERT model more stable.

In some optional implementations of the second embodiment of the present application, the above-mentioned distillation operation module 140 specifically includes: a grouping operation sub-module, an extraction operation sub-module, and a replacement operation sub-module. in:

The grouping operation sub-module is used to group the transformer layer of the original BERT model based on the distillation coefficient to obtain the grouped transformer layer;

The extraction operation sub-module is used to perform extraction operations in the grouped transformer layers based on the Bernoulli distribution to obtain the transformer layers to be replaced;

The replacement operation sub-module is used to replace the transformer layer to be replaced with the default reduced model respectively to obtain the intermediate reduced model.

In some optional implementations of the second embodiment of the present application, the above-mentioned training data acquisition module 150 specifically includes: an original training data acquisition sub-module, a parameter sub-adjustment model, a prediction operation sub-module, a screening operation sub-module, and a training data acquisition sub-module submodule. in:

The original training data acquisition sub-module is used to obtain the original training data after the original BERT model training;

The parameter sub-adjustment model is used to increase the temperature parameters of the softmax layer of the original BERT model to obtain an increased BERT model;

The prediction operation sub-module is used to input the original training data into the BERT model for prediction operation, and obtain the average result label;

The filtering operation sub-module is used to perform the filtering operation on the original training data based on the label information, and obtain the label of the filtering result with the label;

The training data acquisition sub-module is used to select the reduced model training data based on amplifying the training data and filtering the training data.

In some optional implementations of the second embodiment of the present application, the above-mentioned distillation apparatus 100 applied to the BERT model further includes: an optimization training data acquisition module, a distillation loss data calculation module, and a parameter optimization module. in:

The optimized training data acquisition module is used to obtain optimized training data in the local database;

The optimized training data input module is used to input the optimized training data into the trained original BERT model and the target reduced model, respectively, to obtain the original transformer layer output data and the target transformer layer output data;

The distillation loss data calculation module is used to calculate the distillation loss data of the output data of the original transformer layer and the output data of the target transformer layer based on the moving distance;

The parameter optimization module is used to optimize the parameters of the target reduced model according to the distillation loss data to obtain the optimized reduced model.

In some optional implementations of the second embodiment of the present application, the above-mentioned distillation loss data calculation module specifically includes: a target attention matrix acquisition sub-module, an attention EMD distance calculation sub-module, a target FFN hidden layer matrix acquisition sub-module, FFN Hidden layer EMD distance calculation sub-module and distillation loss data acquisition sub-module. in:

The target attention matrix acquisition sub-module is used to obtain the original attention matrix output by the original transformer layer and the target attention matrix output by the target transformer layer;

The attention EMD distance calculation sub-module is used to calculate the attention EMD distance according to the original attention matrix and the target attention matrix;

The target FFN hidden layer matrix acquisition sub-module is used to obtain the original FFN hidden layer matrix output by the original transformer layer and the target FFN hidden layer matrix output by the target transformer layer;

The FFN hidden layer EMD distance calculation sub-module is used to calculate the FFN hidden layer EMD distance according to the original FFN hidden layer matrix and the target FFN hidden layer matrix;

The distillation loss data acquisition sub-module is used to obtain distillation loss data based on the attention EMD distance and the FFN hidden layer EMD distance.

In some optional implementations of the second embodiment of the present application, the attention EMD distance is expressed as:

In some optional implementation manners of the second embodiment of the present application, the FFN hidden layer EMD distance is expressed as:

In summary, the second embodiment of the present application provides a distillation device applied to the BERT model. Since the simplified BERT model retains the same model structure as the original BERT model, the difference is the number of layers, which makes the amount of code changes small, and The prediction codes of the large model and the small model are consistent, and the original code can be reused, so that the model does not need to balance the weight of each loss parameter in the process of distillation, thereby reducing the difficulty of the deep model distillation method, and at the same time, training the simplified BERT model The tasks of each stage are kept consistent, which makes the convergence of the simplified BERT model more stable. In addition, the distillation method based on layer replacement retains the same model structure as BERT. The difference is the number of layers, which makes the code changes smaller, and the prediction codes of the large model and the small model are consistent, and the original code can be reused , during distillation, some layers of the small model are randomly initialized to the weight of the trained large model mapping layer based on Bernoulli sampling, which makes the model converge faster and reduces the number of training rounds.

To solve the above technical problems, the embodiments of the present application also provide computer equipment. For details, please refer to FIG. 7 , which is a block diagram of the basic structure of a computer device according to this embodiment.

The computer device 200 includes a memory 210 , a processor 220 , and a network interface 230 that communicate with each other through a system bus. It should be noted that only the computer device 200 with components 210-230 is shown in the figure, but it should be understood that implementation of all of the shown components is not required, and more or less components may be implemented instead. Among them, those skilled in the art can understand that the computer device here is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions, and its hardware includes but is not limited to microprocessors, special-purpose Integrated circuit (Application Specific Integrated Circuit, ASIC), programmable gate array (Field-Programmable Gate Array, FPGA), digital processor (Digital Signal Processor, DSP), embedded equipment, etc.

The computer equipment may be a desktop computer, a notebook computer, a palmtop computer, a cloud server and other computing equipment. The computer device can perform human-computer interaction with the user through a keyboard, a mouse, a remote control, a touch pad or a voice control device.

The memory 210 includes at least one type of readable storage medium, including flash memory, hard disk, multimedia card, card-type memory (eg, SD or DX memory, etc.), random access memory (RAM), static Random Access Memory (SRAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Programmable Read Only Memory (PROM), magnetic memory, magnetic disks, optical disks, etc., the computer readable storage Media can be non-volatile or volatile. In some embodiments, the memory 210 may be an internal storage unit of the computer device 200 , such as a hard disk or a memory of the computer device 200 . In other embodiments, the memory 210 may also be an external storage device of the computer device 200, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, flash memory card (Flash Card), etc. Of course, the memory 210 may also include both the internal storage unit of the computer device 200 and its external storage device. In this embodiment, the memory 210 is generally used to store the operating system and various application software installed on the computer device 200 , such as computer-readable instructions applied to the distillation method of the BERT model. In addition, the memory 210 can also be used to temporarily store various types of data that have been output or will be output.

The processor 220 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips in some embodiments. The processor 220 is typically used to control the overall operation of the computer device 200 . In this embodiment, the processor 220 is configured to execute the computer-readable instructions stored in the memory 210 or process data, for example, the computer-readable instructions for executing the distillation method applied to the BERT model.

The network interface 230 may include a wireless network interface or a wired network interface, and the network interface 230 is generally used to establish a communication connection between the computer device 200 and other electronic devices.

The steps of the above distillation method applied to the BERT model include:

The distillation method applied to the BERT model provided by this application, because the simplified BERT model retains the same model structure as the original BERT model, the difference is the number of layers, which makes the amount of code changes smaller, and the prediction codes of the large model and the small model are It is consistent, and the original code can be reused, so that the model does not need to balance the weight of each loss parameter in the process of distillation, thereby reducing the difficulty of the deep model distillation method. , making the convergence of the reduced BERT model more stable.

The present application also provides another embodiment, that is, to provide a computer-readable storage medium, where the computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions can be executed by at least one processor to causing the at least one processor to perform the steps of the distillation method applied to the BERT model as follows:

From the description of the above embodiments, those skilled in the art can clearly understand that the method of the above embodiment can be implemented by means of software plus a necessary general hardware platform, and of course can also be implemented by hardware, but in many cases the former is better implementation. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence or in a part that contributes to the prior art, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, CD-ROM), including several instructions to make a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in the various embodiments of this application.

Obviously, the above-described embodiments are only a part of the embodiments of the present application, rather than all of the embodiments. The accompanying drawings show the preferred embodiments of the present application, but do not limit the scope of the patent of the present application. This application may be embodied in many different forms, rather these embodiments are provided so that a thorough and complete understanding of the disclosure of this application is provided. Although the present application has been described in detail with reference to the foregoing embodiments, those skilled in the art can still modify the technical solutions described in the foregoing specific embodiments, or perform equivalent replacements for some of the technical features. . Any equivalent structure made by using the contents of the description and drawings of the present application, which is directly or indirectly used in other related technical fields, is also within the scope of protection of the patent of the present application.

Claims

A distillation method applied to a BERT model, comprising the following steps:

Receive a model distillation request sent by the user terminal, where the model distillation request at least carries a distillation object identifier and a distillation coefficient;

Read the local database, obtain the trained original BERT model corresponding to the distillation object identifier in the local database, and the loss function of the original BERT model is cross entropy;

Build a default reduced model to be trained that is consistent with the trained original BERT model structure, and the loss function of the default reduced model is cross entropy;

Perform a distillation operation on the default reduced model based on the distillation coefficient to obtain an intermediate reduced model;

Acquiring training data of the intermediate reduced model in the local database;

A model training operation is performed on the intermediate reduced model based on the training data to obtain a target reduced model.
The distillation method applied to the BERT model according to claim 1, wherein the step of performing a distillation operation on the default reduced model based on the distillation coefficient to obtain an intermediate reduced model specifically includes:

Perform a grouping operation on the transformer layer of the original BERT model based on the distillation coefficient to obtain a grouped transformer layer;

Based on the Bernoulli distribution, extracting operations are performed in the grouped transformer layers to obtain the transformer layers to be replaced;

The to-be-replaced transformer layers are respectively replaced with the default reduced model to obtain the intermediate reduced model.
The distillation method applied to the BERT model according to claim 1, wherein the step of acquiring the training data of the intermediate reduced model in the local database specifically includes:

Obtain the original training data after the original BERT model is trained;

Increase the temperature parameter of the softmax layer of the original BERT model to obtain an increased BERT model;

Inputting the original training data into the BERT model to perform a prediction operation to obtain a mean result label;

Perform a screening operation on the original training data based on the tag information to obtain a tagged screening result tag;

The reduced model training data is selected based on the enlarged training data and the filtered training data.
The distillation method applied to the BERT model according to claim 1, wherein after the step of performing a model training operation on the intermediate reduced model based on the training data to obtain a target reduced model, the method further comprises:

obtaining optimized training data in the local database;

Inputting the optimized training data into the trained original BERT model and the target reduced model, respectively, to obtain the original transformer layer output data and the target transformer layer output data;

Calculate the distillation loss data of the original transformer layer output data and the target transformer layer output data based on the soil removal distance;

A parameter optimization operation is performed on the target reduced model according to the distillation loss data to obtain an optimized reduced model.
The distillation method applied to the BERT model according to claim 4, wherein the step of calculating the distillation loss data of the original transformer layer output data and the target transformer layer output data based on the soil removal distance specifically includes:

Obtain the original attention matrix output by the original transformer layer and the target attention matrix output by the target transformer layer;

Calculate the attention EMD distance according to the original attention matrix and the target attention matrix;

Obtain the original FFN hidden layer matrix output by the original transformer layer and the target FFN hidden layer matrix output by the target transformer layer;

Calculate the FFN hidden layer EMD distance according to the original FFN hidden layer matrix and the target FFN hidden layer matrix;

The distillation loss data is obtained based on the attention EMD distance and the FFN hidden layer EMD distance.
The distillation method applied to a BERT model according to claim 5, wherein the attention EMD distance is expressed as:

Among them, L attn represents the attention EMD distance; A T represents the original attention matrix; A S represents the target attention matrix;
represents the mean squared error between the original attention matrix and the standard attention matrix, and
represents the original attention matrix of the original transformer layer of the i-th layer;
Represents the target attention matrix of the j-th target transformer layer; f ij represents the amount of knowledge migrated from the i-th original transformer layer to the j-th target transformer layer; M represents the number of layers of the original transformer layer; N represents the target transformer layer. layers.
The distillation method applied to the BERT model according to claim 5, wherein the FFN hidden layer EMD distance is expressed as:

Among them, Lffn represents the EMD distance of the FFN hidden layer; H T represents the original FFN hidden layer matrix of the original transformer layer; H S represents the target FFN hidden layer matrix of the target transformer layer;
represents the mean squared error between the original FFN hidden layer matrix and the target FFN hidden layer matrix, and
Represents the target FFN hidden layer matrix of the j-th target transformer layer; W h represents the transformation matrix;
Represents the original FFN hidden layer matrix of the original transformer layer of the i-th layer; f ij represents the amount of knowledge migrated from the original transformer layer of the i-th layer to the target transformer layer of the j-th layer; M represents the number of layers of the original transformer layer; N represents the target transformer layer the number of layers.
A distillation apparatus applied to a BERT model, including:

a request receiving module, configured to receive a model distillation request sent by a user terminal, where the model distillation request at least carries a distillation object identifier and a distillation coefficient;

The original model acquisition module is used for reading the local database, and in the local database, the trained original BERT model corresponding to the distillation object identifier is obtained, and the loss function of the original BERT model is cross entropy;

The default model building module is used to construct a default reduced model to be trained that is consistent with the trained original BERT model structure, and the loss function of the default reduced model is cross entropy;

a distillation operation module, configured to perform a distillation operation on the default reduced model based on the distillation coefficient to obtain an intermediate reduced model;

a training data acquisition module, used for acquiring the training data of the intermediate reduced model in the local database;

A model training module, configured to perform a model training operation on the intermediate reduced model based on the training data to obtain a target reduced model.
A computer device, comprising a memory and a processor, wherein computer-readable instructions are stored in the memory, and when the processor executes the computer-readable instructions, the steps of the distillation method applied to the BERT model as described below are implemented:

Receive a model distillation request sent by the user terminal, where the model distillation request at least carries a distillation object identifier and a distillation coefficient;

Read the local database, obtain the trained original BERT model corresponding to the distillation object identifier in the local database, and the loss function of the original BERT model is cross entropy;

Build a default reduced model to be trained that is consistent with the trained original BERT model structure, and the loss function of the default reduced model is cross entropy;

Perform a distillation operation on the default reduced model based on the distillation coefficient to obtain an intermediate reduced model;

Acquiring training data of the intermediate reduced model in the local database;

A model training operation is performed on the intermediate reduced model based on the training data to obtain a target reduced model.
The computer device according to claim 9, wherein the step of performing a distillation operation on the default reduced model based on the distillation coefficient to obtain an intermediate reduced model specifically includes:

Perform a grouping operation on the transformer layer of the original BERT model based on the distillation coefficient to obtain a grouped transformer layer;

Perform extraction operations in the grouped transformer layers based on the Bernoulli distribution to obtain the transformer layer to be replaced;

The to-be-replaced transformer layers are respectively replaced with the default reduced model to obtain the intermediate reduced model.
The computer device according to claim 9, wherein the step of acquiring the training data of the intermediate reduced model in the local database specifically includes:

Obtain the original training data after the original BERT model is trained;

Increase the temperature parameter of the softmax layer of the original BERT model to obtain an increased BERT model;

Inputting the original training data into the BERT model to perform a prediction operation to obtain a mean result label;

Perform a screening operation on the original training data based on the tag information to obtain a tagged screening result tag;

The reduced model training data is selected based on the enlarged training data and the filtered training data.
The computer device according to claim 9, wherein after the step of performing a model training operation on the intermediate reduced model based on the training data to obtain a target reduced model, it further comprises:

obtaining optimized training data in the local database;

Inputting the optimized training data into the trained original BERT model and the target reduced model, respectively, to obtain the original transformer layer output data and the target transformer layer output data;

Calculate the distillation loss data of the original transformer layer output data and the target transformer layer output data based on the soil removal distance;

A parameter optimization operation is performed on the target reduced model according to the distillation loss data to obtain an optimized reduced model.
The computer device according to claim 12, wherein the step of calculating the distillation loss data of the output data of the original transformer layer and the output data of the target transformer layer based on the soil removal distance specifically includes:

Obtain the original attention matrix output by the original transformer layer and the target attention matrix output by the target transformer layer;

Calculate the attention EMD distance according to the original attention matrix and the target attention matrix;

Obtain the original FFN hidden layer matrix output by the original transformer layer and the target FFN hidden layer matrix output by the target transformer layer;

Calculate the FFN hidden layer EMD distance according to the original FFN hidden layer matrix and the target FFN hidden layer matrix;

The distillation loss data is obtained based on the attention EMD distance and the FFN hidden layer EMD distance.
The computer device of claim 13, wherein the attention EMD distance is expressed as:

Among them, L attn represents the attention EMD distance; A T represents the original attention matrix; A S represents the target attention matrix;
represents the mean squared error between the original attention matrix and the standard attention matrix, and
represents the original attention matrix of the original transformer layer of the i-th layer;
Represents the target attention matrix of the j-th target transformer layer; f ij represents the amount of knowledge migrated from the i-th original transformer layer to the j-th target transformer layer; M represents the number of layers of the original transformer layer; N represents the target transformer layer. layers.
A computer-readable storage medium, wherein computer-readable instructions are stored on the computer-readable storage medium, and when the computer-readable instructions are executed by a processor, the steps of the distillation method applied to the BERT model as described below are implemented :

Receive a model distillation request sent by the user terminal, where the model distillation request at least carries a distillation object identifier and a distillation coefficient;

Read the local database, obtain the trained original BERT model corresponding to the distillation object identifier in the local database, and the loss function of the original BERT model is cross entropy;

Build a default reduced model to be trained that is consistent with the trained original BERT model structure, and the loss function of the default reduced model is cross entropy;

Perform a distillation operation on the default reduced model based on the distillation coefficient to obtain an intermediate reduced model;

Acquiring training data of the intermediate reduced model in the local database;

A model training operation is performed on the intermediate reduced model based on the training data to obtain a target reduced model.
The computer-readable storage medium according to claim 15, wherein the step of performing a distillation operation on the default reduced model based on the distillation coefficient to obtain an intermediate reduced model specifically includes:

Perform a grouping operation on the transformer layer of the original BERT model based on the distillation coefficient to obtain a grouped transformer layer;

Based on the Bernoulli distribution, extracting operations are performed in the grouped transformer layers to obtain the transformer layers to be replaced;

The to-be-replaced transformer layers are respectively replaced with the default reduced model to obtain the intermediate reduced model.
The computer-readable storage medium according to claim 15, wherein the step of acquiring the training data of the intermediate reduced model in the local database specifically comprises:

Obtain the original training data after the original BERT model is trained;

Increase the temperature parameter of the softmax layer of the original BERT model to obtain an increased BERT model;

Inputting the original training data into the BERT model to perform a prediction operation to obtain a mean result label;

Perform a screening operation on the original training data based on the tag information to obtain a tagged screening result tag;

The reduced model training data is selected based on the enlarged training data and the filtered training data.
The computer-readable storage medium according to claim 15, wherein after the step of performing a model training operation on the intermediate reduced model based on the training data to obtain a target reduced model, the method further comprises:

Obtaining optimized training data in the local database;

Inputting the optimized training data into the trained original BERT model and the target reduced model, respectively, to obtain the original transformer layer output data and the target transformer layer output data;

Calculate the distillation loss data of the original transformer layer output data and the target transformer layer output data based on the soil removal distance;

A parameter optimization operation is performed on the target reduced model according to the distillation loss data to obtain an optimized reduced model.
The computer-readable storage medium according to claim 18, wherein the step of calculating the distillation loss data of the output data of the original transformer layer and the output data of the target transformer layer based on the soil removal distance specifically comprises:

Obtain the original attention matrix output by the original transformer layer and the target attention matrix output by the target transformer layer;

Calculate the attention EMD distance according to the original attention matrix and the target attention matrix;

Obtain the original FFN hidden layer matrix output by the original transformer layer and the target FFN hidden layer matrix output by the target transformer layer;

Calculate the FFN hidden layer EMD distance according to the original FFN hidden layer matrix and the target FFN hidden layer matrix;

The distillation loss data is obtained based on the attention EMD distance and the FFN hidden layer EMD distance.
The computer-readable storage medium of claim 19, wherein the attention EMD distance is represented as:

Among them, L attn represents the attention EMD distance; A T represents the original attention matrix; A S represents the target attention matrix;
represents the mean squared error between the original attention matrix and the standard attention matrix, and
represents the original attention matrix of the original transformer layer of the i-th layer;
Represents the target attention matrix of the j-th target transformer layer; f ij represents the amount of knowledge migrated from the i-th original transformer layer to the j-th target transformer layer; M represents the number of layers of the original transformer layer; N represents the target transformer layer. layers.