CN113706347A

CN113706347A - Multitask model distillation method, multitask model distillation system, multitask model distillation medium and electronic terminal

Info

Publication number: CN113706347A
Application number: CN202111009408.1A
Authority: CN
Inventors: 何哲宇
Original assignee: OneConnect Financial Technology Co Ltd Shanghai
Current assignee: OneConnect Smart Technology Co Ltd; OneConnect Financial Technology Co Ltd Shanghai
Priority date: 2021-08-31
Filing date: 2021-08-31
Publication date: 2021-11-26

Abstract

The invention relates to the technical field of artificial intelligence, in particular to a multitask model distillation method, a multitask model distillation system, a multitask model distillation medium and an electronic terminal, wherein the method comprises the following steps: when the multi-task model is subjected to one-time alternate training of a plurality of tasks according to a preset task training sequence, freezing a middle parameter layer of the multi-task model, reserving an embedded layer and a plurality of classification layers corresponding to the tasks, wherein the middle parameter layer of the multi-task model comprises a plurality of first sublayers, and the step of freezing the middle parameter layer of the multi-task model comprises the following steps: freezing all parameters or part of parameters in the intermediate parameter layer of the multitask model; the embedding layer subjected to multi-task rotation training, the frozen middle parameter layer and any classification layer corresponding to the tasks are used as teacher models, model distillation is carried out by using the teacher models, a plurality of distilled student models are obtained, secondary rotation training is carried out on the plurality of distilled student models, a final model is determined, and countermeasures among the multi-tasks are avoided.

Description

Multitask model distillation method, multitask model distillation system, multitask model distillation medium and electronic terminal

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a multitask model distillation method, a multitask model distillation system, a multitask model distillation medium and an electronic terminal.

Background

Since the new generation algorithm model of the transform-based algorithm tends to consume a large amount of computation power and computation time, which is unacceptable for some applications requiring control time and economic cost, the transform is an attention-based Encoder-Decoder (Encoder-Decoder) structure. Therefore, model distillation (model distillation) is produced as a compression model technology, namely, a model training framework such as a teacher-student is designed, so that a model with a small parameter number can obtain an expression similar to a large model under the supervision of a large parameter model, and the teacher-student refers to a model training framework formed by a teacher model and a student model in the model distillation process. Besides, the design of the multitask model can also compress the size of the model. It is therefore readily conceivable to carry out the distillation on the basis of a multitask model, and to achieve a further compression of the model.

However, in the actual processing process, the training data of the multitask model is usually unbalanced, even across-domain linguistic data, and due to the difference of the downstream task tag sets, in the distillation process of the multitask model, a plurality of tasks can resist each other, and the training effect of mutual promotion cannot be achieved.

Disclosure of Invention

The invention provides a multitask model distillation method, a multitask model distillation system, a multitask model distillation medium and an electronic terminal, and aims to solve the problems that countermeasures are easy to generate among multitasks in a model distillation process due to unbalanced training data of a multitask model, mutual promotion training effects cannot be well achieved, and the distilled multitask model is low in accuracy.

The invention provides a multitask model distillation method, which comprises the following steps:

when the multi-task model is subjected to one-time alternate training of a plurality of tasks according to a preset task training sequence, freezing a middle parameter layer of the multi-task model, reserving an embedded layer and a plurality of classification layers corresponding to the tasks, wherein the middle parameter layer of the multi-task model comprises a plurality of first sublayers, and the step of freezing the middle parameter layer of the multi-task model comprises the following steps: freezing all or part of parameters in an intermediate parameter layer of the multitask model, wherein the part of parameters comprises parameters of a plurality of continuous first sub-layers from a first sub-layer close to the embedded layer;

taking the embedding layer subjected to multi-task alternate training, the frozen intermediate parameter layer and any classification layer corresponding to the tasks as teacher models, and performing model distillation by using the teacher models to obtain a plurality of distilled student models;

performing secondary alternate training on the distilled student models according to the task training sequence to obtain the student models subjected to the secondary alternate training;

and determining a final model according to the student models subjected to the secondary alternate training.

Optionally, the step of performing secondary rotation training on a plurality of distilled student models comprises:

freezing the middle parameter layer of the distilled student model, wherein the step of freezing the middle parameter layer of the distilled student model comprises the following steps: freezing all parameters or part of parameters in an intermediate parameter layer of the distilled student model, wherein the intermediate parameter layer of the distilled student model comprises a plurality of second sublayers, and the part of parameters comprise parameters of a plurality of continuous second sublayers from a second sublayer close to an embedded layer of the distilled student model;

and the embedded layer and the corresponding classification layer of the distilled student model are reserved, and then the student model after the secondary rotation training is obtained.

Optionally, the step of freezing all or part of the parameters in the intermediate parameter layer of the multitask model includes:

according to a preset first freezing layer number, starting from a first sublayer close to an embedding layer of the multitask model, acquiring a first sublayer to be frozen;

determining a first parameter to be frozen according to the first sublayer to be frozen;

and freezing the first parameter to be frozen.

Optionally, the step of freezing all or part of parameters in the middle parameter layer of the distilled student model comprises:

according to a preset second freezing layer number, a second to-be-frozen sublayer is obtained from a second sublayer close to the embedding layer of the distilled student model;

determining a second parameter to be frozen according to the second sublayer to be frozen;

and freezing the second parameter to be frozen.

Optionally, the step of freezing the first parameter to be frozen includes:

updating the parameter attribute of the first parameter to be frozen according to a preset freezing attribute;

adding a parameter filter to an optimizer of the multitask model;

in the process of one rotation training, the parameter filter filters the first parameter to be frozen according to the updated parameter attribute of the first parameter to be frozen, and the freezing of the first parameter to be frozen is completed.

Optionally, the step of performing model distillation using the teacher model comprises:

acquiring a training data set, the training data set comprising: a plurality of training samples and predictive labels corresponding to the training samples;

respectively inputting the training samples in the training data set into the teacher model and the obtained student models for prediction to obtain a teacher prediction result and a student prediction result;

acquiring a first loss of a teacher model according to the teacher prediction result and a preset first loss function;

acquiring a second loss of the student model according to the student prediction result and a preset second loss function;

and acquiring a third loss according to the first loss, the second loss and a preset weight, and training and optimizing the student model by using the third loss to acquire the distilled student model.

Optionally, the step of determining the final model according to the student model after the secondary rotation training includes:

and combining the embedded layers, the intermediate parameter layers and the classification layers of the plurality of student models which are subjected to secondary rotation training according to a preset combination rule to obtain a final model, wherein the plurality of embedded layers correspond to the plurality of intermediate parameter layers, and the plurality of intermediate parameter layers correspond to the plurality of classification layers.

The present invention also provides a multitasking model distillation system comprising:

the primary rotation training module is used for freezing a middle parameter layer of the multi-task model when primary rotation training of a plurality of tasks is carried out on the multi-task model according to a preset task training sequence, reserving an embedding layer and a plurality of classification layers corresponding to the tasks, wherein the middle parameter layer of the multi-task model comprises a plurality of first sublayers, and the step of freezing the middle parameter layer of the multi-task model comprises the following steps: freezing all or part of parameters in an intermediate parameter layer of the multitask model, wherein the part of parameters comprises parameters of a plurality of continuous first sub-layers from a first sub-layer close to the embedded layer;

the distillation module is used for taking the embedded layer subjected to the multi-task alternate training, the frozen intermediate parameter layer and any classification layer corresponding to the tasks as a teacher model, and performing model distillation by using the teacher model to obtain a plurality of distilled student models;

the secondary alternate training module is used for performing secondary alternate training on the distilled student models according to the task training sequence to obtain the student models subjected to secondary alternate training;

and the processing module is used for determining a final model according to the student models subjected to the secondary alternate training.

The invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method as defined in any one of the above.

The present invention also provides an electronic terminal, comprising: a processor and a memory;

the memory is adapted to store a computer program and the processor is adapted to execute the computer program stored by the memory to cause the terminal to perform the method as defined in any one of the above.

The invention has the beneficial effects that: according to the multi-task model distillation method, the multi-task model distillation system, the medium and the electronic terminal, through carrying out one-time alternate training of a plurality of tasks on a multi-task model, in the process of one-time alternate training, a middle parameter layer of the multi-task model is frozen, an embedded layer and a plurality of classification layers corresponding to the tasks are reserved, the middle parameter layer of the multi-task model comprises a plurality of first sub-layers, and the step of freezing the middle parameter layer of the multi-task model comprises the following steps: freezing all parameters or partial parameters in an intermediate parameter layer of the multi-task model, wherein the partial parameters comprise a plurality of continuous parameters of the first sublayer from the first sublayer close to the embedded layer, performing model distillation by using the teacher model by using the embedded layer subjected to multi-task rotation training, the frozen intermediate parameter layer and any classification layer corresponding to the task as teacher models, obtaining a plurality of distilled student models, performing secondary rotation training on the plurality of distilled student models according to the task training sequence, obtaining the student models subjected to secondary rotation training, and determining a final model according to the student models subjected to secondary rotation training; in the process of model training, the intermediate parameter layer cannot be over-fitted to a certain task due to the imbalance of the multi-task training data, countermeasures among the multi-tasks are avoided, the influence caused by sample imbalance among the multiple tasks is eliminated, the resource consumption of model training is greatly reduced, the mutual promotion among the multi-task learning and training is realized, and the accuracy is high.

Drawings

FIG. 1 is a schematic flow diagram of a multi-tasking model distillation process in an embodiment of the invention.

FIG. 2 is a schematic structural diagram of a multitask model in an embodiment of the present invention.

FIG. 3 is a schematic flow chart of the second rotation training in the multi-tasking model distillation method according to an embodiment of the invention.

FIG. 4 is a schematic flow diagram of freezing all or a portion of the parameters in the intermediate parameter layer of the multitask model in the multitask model distillation method in an embodiment of the invention.

FIG. 5 is a schematic flow diagram of a model distillation carried out in an example of the invention.

FIG. 6 is a schematic flow chart of the determination of the final model in the multi-tasking model distillation process of the example of the invention.

FIG. 7 is a schematic diagram of the construction of a multi-tasking model distillation system in an embodiment of the invention.

FIG. 8 is a schematic diagram of an electronic terminal for multitask model distillation according to an embodiment of the present invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.

As shown in fig. 1, the multitask model distillation method in the present embodiment includes:

s101: when the multi-task model is subjected to one-time alternate training of a plurality of tasks according to a preset task training sequence, freezing a middle parameter layer of the multi-task model, reserving an embedded layer and a plurality of classification layers corresponding to the tasks, wherein the middle parameter layer of the multi-task model comprises a plurality of first sublayers, and the step of freezing the middle parameter layer of the multi-task model comprises the following steps: freezing all or part of parameters in an intermediate parameter layer of the multitask model, wherein the part of parameters comprises parameters of a plurality of continuous first sub-layers from a first sub-layer close to the embedded layer; all or part of parameters in a common intermediate parameter layer in the multi-task model are frozen in a rotation training process, so that all or part of parameters in the common intermediate parameter layer do not participate in back propagation and parameter updating in the training process any more, resources consumed by training are effectively reduced, rotation training of multiple tasks is performed by reserving a common embedded layer and classification layers corresponding to the multiple tasks in the multi-task model, shallow expression of the embedded layer can be well utilized, and conflicts and countermeasures between different downstream tasks of the multi-task model are avoided.

For example: selecting a multi-task model with a large parameter quantity to perform one-time alternate training of a plurality of tasks, wherein the task training sequence can be set according to the actual situation, for example: task one, task two, task three, task four, etc. According to the task training sequence, the selected multi-task model is subjected to one-time alternate training, all or part of parameters in the middle parameter layer of the multi-task model are frozen in the one-time alternate training process, and the situation that countermeasures occur in the training process due to unbalance of training data of a plurality of tasks is avoided. The rotation training of a plurality of tasks comprises the following steps: and performing cyclic training according to the sequence of the task one- > the task two- > the task three- > the task four- > … - > the task one.

It is understood that a multitask model refers to a model that learns multiple tasks simultaneously, the multitask model including: an embedding layer (embedding layer), a transformer structure intermediate parameter layer and N classification layers (classifier layers) corresponding to the tasks, wherein the transformer is an attention-based coder-Decoder (Encoder-Decoder) structure, and a multi-task model and a single-task model are mutually opposite. In the process of distilling and training the multitask model, training samples among the multitask in the training data set are often unbalanced, even language materials across fields, such as a cat classification task and a dog classification task, the corresponding training data are greatly different, different tasks are easily confronted in the training process, and the training effect which is better promoted mutually cannot be achieved, so that all or part of parameters of the middle parameter layer of the multitask model are frozen in one training process, the embedded layer and the classification layer are reserved, all or part of parameters of the middle parameter layer of the multitask model are kept unchanged in one rotation training process, confrontation among different tasks caused by unbalanced training data sets is avoided, and meanwhile, the training of an embedding layer and a classification layer of the multi-task model is realized.

As shown in fig. 2, in step S101, the intermediate parameter layer is composed of a plurality of stacked transform blocks, and the first sub-layer is a transform block. Wherein, the embedding layer has the functions of: carrying out dimension reduction and vectorization on input training data to obtain a corresponding mapping vector with expression significance; the main functions of the intermediate parameter layer are as follows: calculating the mapping vector to obtain a calculation result; the classification layer corresponds to a task, and the classification layer has the main functions of: and classifying and deciding the calculation result output by the intermediate parameter layer. When the multi-task model is subjected to one rotation training, all or part of parameters in a middle parameter layer of a transform structure in the multi-task model are frozen, an embedding layer and a classifier layer corresponding to a task are reserved, and the multi-task model which is subjected to one rotation training and has a structure of 1+1+ N (a public embedding layer, a public transform layer and classifier layers corresponding to N different tasks) is obtained, wherein N is the total number of the tasks.

Specifically, the step of performing dimension reduction and vectorization on the input training data includes: firstly, coding input training data to obtain an index value of the training data; and secondly, performing vector conversion according to the index value to obtain a corresponding word vector matrix, and then obtaining a mapping matrix according to the word vector matrix and a preset target matrix, namely, taking the product of the word vector matrix and the target matrix as the mapping matrix to further obtain a mapping vector, thereby realizing the dimensionality reduction and vectorization of the training data.

The intermediate parameter layer includes: encoder (Encoder) layer and Decoder layer, Encoder layer include a plurality of Encoder blocks, and is a plurality of Encoder block connects gradually, and the Decoder layer includes a plurality of Decoder blocks, and is a plurality of Decoder block connects gradually. The step of calculating the mapping vector by the intermediate parameter layer comprises the following steps: receiving mapping vectors of different tasks output by an embedding layer; inputting the mapping vector into the Encoder layer for coding to obtain a coding matrix; and inputting the coding matrix into a Decoder layer for prediction to obtain a calculation result, wherein the calculation result is a prediction value output by the Decoder layer.

The step of classifying and deciding the calculation result output by the intermediate parameter layer comprises the following steps: and classifying the calculation result by using a preset classification function. The classification function may be a softmax classification function or a sigmoid classification function, and details thereof are not repeated here.

S102: taking the embedding layer subjected to multi-task alternate training, the frozen intermediate parameter layer and any classification layer corresponding to the tasks as teacher models, and performing model distillation by using the teacher models to obtain a plurality of distilled student models;

because a plurality of tasks exist in the multi-task model, one task is selected at will, the classification layer corresponding to the task is obtained, and then the embedding layer, the intermediate parameter layer and any classification layer corresponding to the task which are subjected to one-time rotation training are used as the teacher model. And acquiring a plurality of corresponding teacher models according to the plurality of classification layers corresponding to the tasks. And carrying out model distillation by using the teacher model to obtain a plurality of distilled student models. The embedding layer and the middle parameter layer which are trained alternately through a plurality of tasks and any classification layer corresponding to the tasks are used as a teacher model, so that the independent distillation of different tasks can be facilitated, and the confrontation and conflict caused by the imbalance of training data among the tasks can be avoided.

S103: performing secondary alternate training on the distilled student models according to the task training sequence to obtain the student models subjected to the secondary alternate training; through carrying out the training of secondary rotation to a plurality of student models after distilling, can avoid because training data is unbalanced, lead to producing the antagonism with the training of a plurality of student models that the task corresponds, cause the training loss.

S104: and determining a final model according to the student models subjected to the secondary alternate training. Through carrying out the training of twice rotation to the student model after distilling to according to the student model through the training of twice rotation, confirm final model, realized distilling better to the multitask model, avoided distilling and training in-process, because the unbalance of training data, and lead to taking place conflict and antagonism between the different tasks, reduced the influence that the unbalanced training data set led to the fact to the multitask model distill and train, the multitask study that has utilized the multitask model greatly promotes each other between the training, the feasibility of execution is stronger, and the cost is lower.

Referring to fig. 3, in order to avoid overfitting caused by unbalanced training data sets during the secondary rotation training process, the inventors propose that the step of performing the secondary rotation training on a plurality of distilled student models includes:

s301: freezing the middle parameter layer of the distilled student model, wherein the step of freezing the middle parameter layer of the distilled student model comprises the following steps: freezing all parameters or part of parameters in an intermediate parameter layer of the distilled student model, wherein the intermediate parameter layer of the distilled student model comprises a plurality of second sublayers, and the part of parameters comprise parameters of a plurality of continuous second sublayers from a second sublayer close to an embedded layer of the distilled student model; and the embedded layer and the corresponding classification layer of the distilled student model are reserved, and then the student model after the secondary rotation training is obtained. In the secondary rotation training process, all or part of parameters of the middle parameter layer of the distilled student model are frozen, so that all or part of parameters are kept unchanged in the secondary rotation training process, the embedded layers and the classification layers of the distilled student models are iteratively updated by using training samples in the training data set, the embedded layers and the classification layers are trained, and poor training effect caused by imbalance of the multi-task training data set is avoided. In some embodiments, there is a one-to-one correspondence between the plurality of distilled student models and the plurality of tasks, such as: the first task corresponds to a distilled student model I, and the second task corresponds to a distilled student model II.

Through according to the task training order that sets up in advance, carry out the training of secondary rotation to a plurality of student models after distilling, the process of training of secondary rotation is like: the distilled student model I- > the distilled student model II- > the distilled student model III … - > the distilled student model I. In addition, in the process of secondary rotation training, all or part of parameters in the middle parameter layer are frozen, the embedded layer and the corresponding classification layer are reserved, the influence on the training effect of a plurality of distilled student models due to the imbalance of the training data set can be avoided, secondly, the middle parameter layer in the distilled student models does not participate in back propagation and parameter iteration, the consumption and the cost of training resources are reduced, in addition, the shallow expression of training samples can be well utilized through the reserved embedded layer in the distilled student models, the parameters of the embedded layer of the distilled student models are subjected to iterative updating, the conflict can not occur for different downstream tasks, and the influence on the training effect due to the imbalance of the training data set is avoided.

As shown in fig. 4, in some embodiments, since the intermediate parameter layer of the transform structure of the multitask model is a stacked structure and includes a plurality of first sub-layers, the first sub-layer is a transform block, the step of freezing all or part of the parameters in the intermediate parameter layer of the multitask model includes:

s401: according to a preset first freezing layer number, starting from a first sublayer close to an embedding layer of the multitask model, acquiring a first sublayer to be frozen, and determining a first parameter to be frozen according to the first sublayer to be frozen; for example: when the number of the first freezing layers is i, starting from the direction close to the embedding layer, taking the first sublayer of the i layer as a first to-be-frozen sublayer, and determining the parameters in the first to-be-frozen sublayer as first to-be-frozen parameters.

S402: and freezing the first parameter to be frozen. The method comprises the steps of freezing a first parameter to be frozen of a common middle parameter layer of the multi-task model in a rotation training process, enabling the first parameter to be frozen to no longer participate in back propagation and parameter updating in the training process, effectively reducing resources consumed by training, carrying out rotation training on a plurality of tasks by reserving a common embedding layer and classification layers corresponding to the plurality of tasks, and being capable of better utilizing shallow expression of the embedding layer and avoiding conflicts and countermeasures between different downstream tasks of the pre-training model.

For example: according to the preset first freezing layer number, the parameters of the partial layers of the middle parameter layer are frozen from the direction close to the embedding layer, so that the parameters of the partial first sub-layer close to the embedding layer in the frozen middle parameter layer do not participate in parameter iteration and updating in the process of one rotation training, and overfitting of multi-task training caused by capturing of common semantic features of different natural language tasks is avoided. The number of the freezing layers can be set according to actual conditions, if the multitask overfitting condition is serious in the multitask model training process, the number of the freezing layers is increased, and if the multitask overfitting condition does not occur in the multitask model training process, the number of the freezing layers is properly reduced, and the repeated description is omitted here.

In order to better implement freezing of the parameter to be frozen, so that the parameter to be frozen remains unchanged during a rotation training, i.e. only participates in forward loss calculation, and does not participate in backward propagation and updating, the inventor proposes that the step of freezing the first parameter to be frozen includes:

adding a parameter filter to an optimizer of the multitask model;

in the process of one rotation training, the parameter filter filters the first parameter to be frozen according to the updated parameter attribute of the first parameter to be frozen, and the freezing of the first parameter to be frozen is completed. For example: setting the attribute (requires _ grad) of the first parameter to be frozen as False, adding a parameter filter (filter) into the optimizer, and performing parameter filtering during model training by using the parameter filter according to the requires _ grad of the parameter to be frozen so as to freeze the first parameter to be frozen. By the method, the first parameter to be frozen can be better ensured not to participate in back propagation and updating in the process of one rotation training.

In some embodiments, the step of freezing all or part of the parameters in the middle parameter layer of the distilled student model during the second rotation training comprises: according to a preset second freezing layer number, a second to-be-frozen sublayer is obtained from a second sublayer close to the embedding layer of the distilled student model; determining a second parameter to be frozen according to the second sublayer to be frozen; and freezing the second parameter to be frozen. For example: all parameters of the middle parameter layer of the student model are frozen, or parameters in a partial second sub-layer close to the embedding layer are frozen. Because the transformer block close to the embedding layer is often shallower than the feature captured by the later transformer block, the transformer block tends to capture common features among different tasks, and therefore, by freezing part of parameters of the middle parameter layer of the distilled student model, confrontation and conflict among multi-task training can be avoided, and training accuracy is improved.

In some embodiments, the step of freezing the second parameter to be frozen comprises:

updating the parameter attribute of the second parameter to be frozen according to the preset freezing attribute;

adding a parameter filter into an optimizer of the distilled student model;

and in the process of the secondary rotation training, the corresponding parameter filter filters the second parameter to be frozen according to the updated parameter attribute of the second parameter to be frozen, and the second parameter to be frozen is updated. For example: setting the attribute (requires _ grad) of the second parameter to be frozen as False, adding a parameter filter (filter) into the corresponding optimizer, and performing parameter filtering in the model training process by using the corresponding parameter filter according to the requires _ grad of the second parameter to be frozen so as to freeze the second parameter to be frozen. By the aid of the method, the second parameter to be frozen can be better guaranteed not to participate in back propagation and updating in the process of secondary rotation training.

Referring to fig. 5, the step of performing model distillation using the teacher model includes:

s501: acquiring a training data set, the training data set comprising: a plurality of training samples and predictive labels corresponding to the training samples; for example: and acquiring information by using an artificial intelligence technology to obtain a corresponding training data set.

S502: respectively inputting the training samples in the training data set into the teacher model and the obtained student models for prediction to obtain a teacher prediction result and a student prediction result;

s503: acquiring a first loss of a teacher model according to the teacher prediction result and a preset first loss function;

s504: acquiring a second loss of the student model according to the student prediction result and a preset second loss function;

s505: and acquiring a third loss according to the first loss, the second loss and a preset weight, and training and optimizing the student model by using the third loss to acquire the distilled student model. Weighting the first loss and the second loss in a weight setting mode to obtain a smoothed third loss, performing back propagation by using the third loss, realizing iterative training and optimization of the student model, and obtaining a better student model as the distilled student model.

For example: when model distillation is carried out, a teacher model is used for prediction to obtain a corresponding teacher prediction result (soft target), meanwhile, a student model is used for prediction to obtain a corresponding student prediction result (hard target), and a first loss of the teacher model is obtained according to the teacher prediction result (soft target) and a preset first loss function_softPredicting the result (h) according to the studentard target) and a preset second Loss function to obtain a second Loss of the student model_hardAnd acquiring a third loss according to the first loss, the second loss and a preset weight, and training and optimizing the student model by using the third loss to acquire the distilled student model.

The mathematical expression of the first loss function is:

loss_soft＝CrossEntropyLoss(teacher_pred,true_lable)

the mathematical expression of the second loss function is:

Loss_hard＝CrossEntropyLoss(student_pred,true_lable)

the mathematical expression for the third loss function is obtained as:

Loss_distillation＝loss_soft*λ+(1－λ)*Loss_hard

wherein Cross EntropyLoss () is the cross entropy loss, teacher_predFor the predicted value of the teacher model, true _ lay is the true value, student_predIs the predicted value of the student model, and lambda is the preset weight, loss_softFor the first Loss, Loss_hardFor the second Loss, Loss_distillationIs the third loss.

In some embodiments, the step of obtaining a student model comprises:

inputting the training samples into a neural network for prediction to obtain a training prediction result;

and training the neural network according to the training prediction result and a teacher prediction result corresponding to the training sample to obtain a corresponding student model, wherein parameters in the student model are less than parameters in the teacher model. It can be appreciated that teacher models are often more complex than student models, and have more parameters than student models, and therefore, by obtaining student models and distilling the models, compression of the models is facilitated.

As shown in fig. 6, the step of determining the final model according to the student models subjected to the secondary rotation training includes:

s601: and combining the embedded layers, the intermediate parameter layers and the classification layers of the plurality of student models which are subjected to secondary rotation training according to a preset combination rule to obtain a final model, wherein the plurality of embedded layers correspond to the plurality of intermediate parameter layers, and the plurality of intermediate parameter layers correspond to the plurality of classification layers. Through the embedding layer of a plurality of student models through the training of twice rotation, middle parameter layer to and the categorised layer that corresponds with a plurality of tasks makes up, obtain the final model that has less parameter quantity after the distillation, realized the accurate distillation to the multitask model, avoided because the unbalance of training data set, it takes place to resist to lead to the model training, like in the training process, the model is emphatic in the great task of training data volume, cause the effect loss of other tasks etc., effectively reduced the influence of the distillation of unbalanced data set to the multitask model, realized mutual promotion between the multitask study training, the cost is lower, can the implementation nature is stronger.

In some embodiments, the combination can be performed in the order of an embedded layer, an intermediate parameter layer, and a classification layer, resulting in a multi-tasking model with a lower parameter number after distillation, the plurality of embedded layers corresponding to the plurality of intermediate parameter layers, the plurality of intermediate parameter layers corresponding to the plurality of classification layers.

The first embodiment is as follows:

aiming at the application scene of the multi-task model distillation, the multi-task model is subjected to one-time alternate training of a plurality of tasks according to a preset task training sequence, such as: the method comprises the steps of firstly, secondly, thirdly … …, firstly, freezing all parameters of a middle parameter layer of the multi-task model in a rotation training process, wherein the middle parameter layer is a parameter layer with a transformer structure, and an embedding layer and a classifier layer of the multi-task model are reserved. Taking an imbedding layer subjected to one rotation training, a frozen middle parameter layer and any classifier layer subjected to rotation training and corresponding to the tasks as teacher models, and obtaining a plurality of teacher models, wherein the number of the teacher models corresponds to the number of the tasks. And performing model distillation by using the teacher model to obtain a plurality of distilled student models. And a plurality of distilled student models correspond to a plurality of tasks one by one. According to the preset task training sequence, performing secondary rotation training on the plurality of student models, such as: the method comprises the following steps that first student models, second student models and third student models … … are first class, in the process of secondary rotation training, the middle parameter layer of the distilled student models is frozen, and the imbedding layer and the corresponding classifier layer are reserved. The imbedding layer, the middle parameter layer and the corresponding classifier layer of the student model after the secondary rotation training are combined to form a multi-task model with less parameter quantity after distillation, namely a final model, so that the accuracy of the multi-task model after distillation is effectively improved, the conflict and confrontation among a plurality of tasks in the distillation and training process of the multi-task model caused by unbalanced training data sets are avoided, the influence of the unbalanced training data sets on the distillation of the multi-task model is reduced, and the mutual promotion among the multi-task is well utilized.

Example two:

aiming at the unbalanced application scene of a multi-task learning training sample, carrying out one rotation training of a plurality of tasks on a task model, freezing partial parameters of a middle parameter layer of the multi-task model in the process of one rotation training, and reserving an embedding layer and a classifier layer of the multi-task model, wherein the step of freezing partial parameters of the middle parameter layer of a transform structure of the multi-task model comprises the following steps: and determining parameters to be frozen of the middle parameter layer of the multi-task model according to the preset number of freezing layers, and freezing the parameters to be frozen in the process of one rotation training. And then taking the imbedding layer and the middle parameter layer which are alternately trained and the classifier layer corresponding to any task as a teacher model, and carrying out model distillation to obtain a plurality of corresponding student models. Further, according to a task corresponding to the student model, performing secondary rotation training on a plurality of student models, freezing partial parameters of a middle parameter layer in the student model in the secondary rotation training process, and reserving an embedding layer and a classfier layer of the student model, wherein the step of freezing partial parameters of the middle parameter layer in the student model comprises the following steps: and determining parameters to be frozen of the middle parameter layer of the student model according to the preset number of freezing layers. And finally, forming a multi-task model by a plurality of student models which are alternately trained. In the case of an imbalance of the multitask learning samples, such as: under the condition that the training sample of one task exceeds the training of other tasks, the method can avoid model training conflict and confrontation caused by sample imbalance, realize mutual promotion among multi-task learning training and improve the distillation accuracy of the multi-task model.

As shown in fig. 7, the present embodiment also provides a multitask model distillation system comprising:

the processing module is used for determining a final model according to the student models subjected to the secondary alternate training; the primary rotation training module, the distillation module and the secondary rotation training module are connected with the processing module. This system is through carrying out the training of once taking turns of a plurality of tasks to the multitask model, in the training process of taking turns once, freezes the middle parameter layer of multitask model, keeps embedding layer and a plurality of categorised layer corresponding with the task, the middle parameter layer of multitask model includes a plurality of first sublayers, the step of freezing the middle parameter layer of multitask model includes: freezing all parameters or partial parameters in an intermediate parameter layer of the multi-task model, wherein the partial parameters comprise a plurality of continuous parameters of the first sublayer from the first sublayer close to the embedded layer, performing model distillation by using the teacher model by using the embedded layer subjected to multi-task rotation training, the frozen intermediate parameter layer and any classification layer corresponding to the task as teacher models, obtaining a plurality of distilled student models, performing secondary rotation training on the plurality of distilled student models according to the task training sequence, obtaining the student models subjected to secondary rotation training, and determining a final model according to the student models subjected to secondary rotation training; in the process of model training, the intermediate parameter layer cannot be over-fitted to a certain task due to the imbalance of the multi-task training data, countermeasures are avoided among the multi-tasks, the influence caused by sample imbalance among the multiple tasks is eliminated, the resource consumption of model training is greatly reduced, the mutual promotion among the multi-task learning and training is realized, the accuracy is high, and the cost is low.

In some embodiments, the step of performing a secondary rotation training of the distilled plurality of student models by the secondary rotation training module comprises:

freezing the middle parameter layer of the distilled student model, wherein the step of freezing the middle parameter layer of the distilled student model comprises the following steps: freezing all or part of parameters in an intermediate parameter layer of the distilled student model, wherein the intermediate parameter layer of the distilled student model comprises a plurality of second sublayers, and the part of parameters comprise parameters of a plurality of continuous second sublayers starting from a second sublayer close to an embedded layer of the distilled student model;

In some embodiments, the step of freezing all or part of the parameters in the middle parameter layer of the multi-tasking model by the one-turn training module comprises:

and freezing the first parameter to be frozen.

In some embodiments, the step of freezing all or part of the parameters in the middle parameter layer of the distilled student model by the secondary rotation training module comprises:

and freezing the second parameter to be frozen.

In some embodiments, the step of freezing the first parameter to be frozen comprises:

adding a parameter filter to an optimizer of the multitask model;

in the process of one rotation training, the parameter filter filters the first parameter to be frozen according to the updated parameter attribute of the first parameter to be frozen, and the freezing of the first parameter to be frozen is completed. In some embodiments, the distillation module utilizes the teacher model, and the step of performing model distillation comprises:

In some embodiments, the step of determining the final model by the processing module according to the student model after the second rotation training comprises:

The present embodiment also provides a computer-readable storage medium on which a computer program is stored, which when executed by a processor implements any of the methods in the present embodiments.

The present embodiment further provides an electronic terminal, including: a processor and a memory;

the memory is used for storing computer programs, and the processor is used for executing the computer programs stored by the memory so as to enable the terminal to execute the method in the embodiment.

Fig. 8 is a schematic structural diagram of an electronic terminal according to an embodiment of the invention. This example provides an electronic terminal, includes: a processor 81, a memory 82, a communicator 83, a communication interface 84, and a system bus 85; the memory 82 and the communication interface 84 are connected with the processor 81 and the communicator 83 through the system bus 85 and are used for mutual communication, the memory 82 is used for storing computer programs, the communication interface 84 is used for communicating with other equipment, and the processor 81 and the communicator 83 are used for running the computer programs so that the electronic terminal can execute the steps of the multi-task model distillation method.

The system bus 85 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The system bus may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus. The communication interface is used for realizing communication between the database access device and other equipment (such as a client, a read-write library and a read-only library). The memory may include a Random Access Memory (RAM), and may further include a non-volatile memory (non-volatile memory), such as at least one disk memory.

In this embodiment, the Memory may include a Random Access Memory (RAM), and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory.

The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.

The computer-readable storage medium in the present embodiment can be understood by those skilled in the art as follows: all or part of the steps for implementing the above method embodiments may be performed by hardware associated with a computer program. The aforementioned computer program may be stored in a computer readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing. Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. The server may be an independent server, or may be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), and a big data and artificial intelligence platform. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device, such as through the internet using an internet service provider.

The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, and mechatronics. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

In summary, in the multitask model distillation method, the multitask model distillation system, the multitask model distillation medium and the electronic terminal in this embodiment, a rotation training of multiple tasks is performed on the multitask model, in a rotation training process, an intermediate parameter layer of the multitask model is frozen, an embedded layer and multiple classification layers corresponding to the tasks are reserved, the intermediate parameter layer of the multitask model includes multiple first sublayers, and the step of freezing the intermediate parameter layer of the multitask model includes: freezing all or part of parameters in an intermediate parameter layer of the multitask model, wherein the part of parameters comprises parameters of a plurality of continuous first sub-layers from a first sub-layer close to the embedded layer. Taking the embedded layer subjected to multi-task alternate training, the frozen intermediate parameter layer and any classification layer corresponding to the tasks as teacher models, distilling the models by using the teacher models to obtain a plurality of distilled student models, performing secondary alternate training on the plurality of distilled student models according to the task training sequence to obtain student models subjected to secondary alternate training, and determining a final model according to the student models subjected to secondary alternate training; the training data set training sharing advantage of multitask model learning is fully played, a general semantic learning module which does not distinguish tasks is made, namely an embedding layer (embedding layer) is fully trained, meanwhile, an intermediate parameter layer of a transform structure cannot be over-fitted to a certain task due to unbalance of multitask training data, countermeasures are avoided between multitasks, influences caused by sample unbalance between the tasks are eliminated, resource consumption of model training is greatly saved, mutual promotion between multitask learning and training is achieved, accuracy is high, practicability is high, and cost is low.

Furthermore, the above-described figures are merely schematic illustrations of processes involved in methods according to exemplary embodiments of the invention, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. A multitasking model distillation process, comprising:

2. The multitask model distillation method according to claim 1, wherein the step of performing a second rotation training on the plurality of distilled student models comprises:

3. The multitask model distillation method of claim 1, wherein the step of freezing all or a portion of the parameters in the intermediate parameter layer of the multitask model comprises:

and freezing the first parameter to be frozen.

4. The multitask model distillation method according to claim 2, wherein the step of freezing all or part of the parameters in the middle parameter layer of the student model after distillation comprises:

and freezing the second parameter to be frozen.

5. The multitask model distillation method of claim 3, wherein the step of freezing the first parameter to be frozen comprises:

adding a parameter filter to an optimizer of the multitask model;

6. The multitasking model distillation method according to claim 1, wherein with the teacher model, the step of performing model distillation comprises:

7. The multitask model distillation method of claim 1, wherein the step of determining a final model from the twice rotation trained student models comprises:

8. A multitasking model distillation system, comprising:

9. A computer-readable storage medium having stored thereon a computer program, characterized in that: the computer program, when executed by a processor, implements the method of any one of claims 1 to 7.

10. An electronic terminal, comprising: a processor and a memory;

the memory is for storing a computer program and the processor is for executing the computer program stored by the memory to cause the terminal to perform the method of any of claims 1 to 7.