CN116432731A

CN116432731A - Student model training method and text classification system

Info

Publication number: CN116432731A
Application number: CN202310240085.XA
Authority: CN
Inventors: 汪诚愚; 陈小庆
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2023-03-07
Filing date: 2023-03-07
Publication date: 2023-07-14

Abstract

The present disclosure relates to a student model training method and a text classification system. The method comprises the following steps: adding hint information and mask text placeholders to the samples to obtain processed training samples; fine-tuning a pre-training language model PLM by using the processed training sample to obtain a teacher model with hinted fine-tuning; using the labeled outside-domain training data to fine tune PLM to obtain a teacher model subjected to outside-domain data fine tuning; and training a student model using the processed training samples, and simultaneously learning the classification probability vectors output by the above two teacher models during the training process. According to the method, the external teacher model is introduced during knowledge distillation, so that the distillation accuracy of the student model is improved. Further, the degree of influence of the expert score of the intra-domain model on the out-of-domain teacher model can be controlled. Overfitting due to tag starvation in small sample scenarios can also be further mitigated by additional pseudo-classification probability vectors.

Description

Student model training method and text classification system

Technical Field

The disclosure relates to the field of deep learning, and in particular relates to a student model training method and a text classification system.

Background

To achieve high-precision predictions on a particular natural language processing task, it is often necessary to train a pre-trained language model (PLM) using a large amount of tagged data. But the large amount of tagged data can make training cost prohibitive. The developed small sample learning technology can enable the pre-training language model to train under the condition of a small quantity of training samples, so that higher prediction accuracy is realized with lower training cost.

However, in order to learn knowledge in a huge corpus, the parameter scale of PLM is very large, and the parameter amount of the existing GPT-3 model is up to 175B. This makes PLM unusable in resource constrained or delay sensitive scenarios. For this reason, there is a need for an improved deep learning language model suitable for resource constrained scenarios.

Disclosure of Invention

One technical problem to be solved by the present disclosure is to provide a student model training method and a text classification system. According to the method, the external teacher model is introduced during knowledge distillation, so that the distillation accuracy of the student model is improved. Further, the degree of influence of the expert score of the intra-domain model on the out-of-domain teacher model can be controlled. Overfitting due to tag starvation in small sample scenarios can also be further mitigated by additional pseudo-classification probability vectors.

According to a first aspect of the present disclosure, there is provided a student model training method, comprising: adding hint information and mask text placeholders to the samples to obtain processed training samples; using the processed training sample to fine tune a pre-training language model PLM to obtain a teacher model with hinted fine tuning; using the labeled outside-domain training data to fine tune the PLM to obtain a teacher model subjected to outside-domain data fine tuning; and training a student model using the processed training samples, and during the training process the student model learns classification probability vectors output by the hinted-hinted teacher model and the out-of-domain data-hinted teacher model simultaneously.

Optionally, the method further comprises: training a student model using the labeled out-of-domain training data, wherein the student model simultaneously learns the classification probability vectors output by the hinted-fine-tuned teacher model and the out-of-domain data-fine-tuned teacher model during the training process comprises: and limiting the learning degree of the student model to the classification probability vector output by the teacher model subjected to the out-of-domain data fine tuning in the training process based on the difference between the prediction results respectively output by the teacher model subjected to the prompt fine tuning and the teacher model subjected to the out-of-domain data fine tuning aiming at the out-of-domain training data.

Optionally, training the student model using the processed training samples comprises: obtaining a corresponding prediction result of the student model on the mask text placeholder; and adjusting network parameters of the student model by a first loss function, wherein the first loss function carries out loss calculation according to whether the corresponding prediction result of the mask text placeholder is the same as a label or not.

Optionally, the student model learning the hinted-tuned teacher model and the classification probability vector of the out-of-domain data-tuned teacher model simultaneously during training includes: adjusting network parameters of the student model with a second loss function, wherein the second loss function characterizes similarity of the classification probability vector output by the student model and the classification probability vector output by the prompted and fine-tuned teacher model; and adjusting network parameters of the student model with a third loss function, wherein the third loss function characterizes similarity of the classification probability vector output by the student model and the classification probability vector output by the teacher model subjected to the out-of-domain data fine adjustment.

Optionally, the third loss function characterizes an adjusted similarity of the classification probability vector output by the student model and the classification probability vector output by the out-of-domain data-trimmed teacher model, the adjustment coefficient corresponding to a difference between prediction results for each output of the out-of-domain training data based on the hinted-trimmed teacher model and the out-of-domain data-trimmed teacher model.

Optionally, the student model learning the prompted fine-tuned teacher model and the classification probability vector output by the out-of-domain data fine-tuned teacher model simultaneously during the training process includes: the student model does not learn intermediate layer outputs of the hinted-trimmed teacher model and the out-of-domain data trimmed teacher model during training.

Optionally, the method further comprises: the student model learns a pseudo-probability distribution constructed based on a label smoothing operation during training.

Optionally, the learning of the pseudo-probability distribution constructed based on the label smoothing operation by the student model during training includes: converting the pseudo-probability distribution into pseudo-classification probability vectors; and adjusting network parameters of the student model with a fourth loss function, the fourth loss function characterizing a difference between a real label of the processed training sample and the pseudo-classification probability vector.

Optionally, the student model learning the prompted fine-tuned teacher model and the classification probability vector output by the out-of-domain data fine-tuned teacher model simultaneously during the training process includes: training the student model using a first loss function and a weighted sum of loss functions characterizing similarity of the classification probability vectors output by the hinted-at teacher model and the out-of-domain data-hinted-at teacher model with the classification probability vectors output by the student model, respectively, as a total loss function.

According to a second aspect of the present disclosure, there is provided a text classification system comprising: an input acquisition unit configured to acquire text input from a user; a classification determination unit including a student model acquired by the method according to the first aspect, the student model being for classifying based on the input text; and an operation unit configured to perform an operation according to the classification result, the operation including at least one of: feeding back based on the intention classification result of the input text; counting based on the emotion tendency classification result of the input text; and reporting based on the attribute classification of the input text.

According to a third aspect of the present disclosure, there is provided a computing device comprising: a processor; and a memory having executable code stored thereon which, when executed by the processor, causes the processor to perform the method as described in the first aspect above.

According to a fourth aspect of the present disclosure, there is provided a computer program product comprising executable code which, when executed by a processor of an electronic device, causes the processor to perform the method as described in the first aspect above.

According to a fifth aspect of the present disclosure there is provided a non-transitory machine-readable storage medium having stored thereon executable code which, when executed by a processor of an electronic device, causes the processor to perform the method as described in the first aspect above.

Therefore, the invention greatly enhances the supervision under a small sample scene by introducing the outside teacher model, thereby improving the distillation precision of the student model. The influence degree of the outside teacher model can be controlled according to the expert scores of the inside teacher model; overfitting due to tag starvation in small sample scenarios can also be further mitigated by additional pseudo-classification probability vectors.

Drawings

The foregoing and other objects, features and advantages of the disclosure will be apparent from the following more particular descriptions of exemplary embodiments of the disclosure as illustrated in the accompanying drawings wherein like reference numbers generally represent like parts throughout exemplary embodiments of the disclosure.

Fig. 1 shows an example of prompt and label selection for comment emotion analysis.

FIG. 2 shows a schematic flow chart of a student model training method according to one embodiment of the invention.

Fig. 3 shows an example of a soft and hard target and a temperature factor adjusting soft target.

Figure 4 shows an overall schematic of the training of a student model based on two teacher models according to the invention.

Fig. 5 shows a schematic composition diagram of a text classification system according to an embodiment of the invention.

FIG. 6 illustrates a schematic diagram of a computing device that may be used to implement the student model training method described above, according to one embodiment of the invention.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The large-scale pre-training language model has achieved great success in various fields of NLP (natural language processing), people do not train the language model from scratch, but firstly, on a large number of general corpora, general PLM is obtained through some unsupervised proxy tasks; and then in a downstream task, the general PLM fine-tunes parameters on supervision data, and the target classification function can be realized by utilizing the existing language knowledge in the general corpus. Such a two-stage model paradigm has been widely adopted in many practical language application scenarios.

Small sample Learning (Few-shot Learning) is a paradigm of machine Learning, and aims to perform only a small amount of fine tuning (finishing) on a model under the condition of extremely small training samples, so as to obtain a model with higher precision. Whether or not to have the ability to learn and generalize from a small number of samples is an obvious demarcation point that distinguishes artificial intelligence from human intelligence, as humans can easily build knowledge of new things by just one or a few examples, whereas machine learning algorithms typically require thousands of labeled samples to guarantee their generalization ability. In the fields of machine vision, natural language processing, etc., labeling of data is expensive; in a new scenario, the annotation data is quite scarce. This limits the application of deep learning algorithms. Small sample learning is significant and challenging in the field of machine learning. In the light of the rapid learning ability of human beings, it is expected that machine learning models can learn rapidly with a small amount of samples for new categories after learning a large amount of data of a certain category, which is a problem to be solved by small sample learning.

For the special nature of the small sample task, the downstream task fine-tuning of stage two can be reconstructed as a "complete gap-filling problem", i.e. using PET (complete gap-filling training, pattern Exploiting Training).

Starting from BERT, pre-training a language model based on Prompt-based fine-tuning (Prompt-based finishing) has become a common practice in the NLP arts in downstream tasks. Whereas the GPT-3 model with 175B parameters brings a new approach to using LM for downstream tasks: by using natural language hints (prompt) and task examples (remotion) as contexts, GPT-3 can process many tasks with just a few samples, without updating parameters in the underlying model. The huge model size of GPT-3 is an important factor for its success, and the concepts of hints and business examples also give us new insight into how to better use language models. The hint information is a piece of text inserted into the input sample so that the original task can translate the predicted problem into an MLM (mask language model) problem. For example, assuming that we want to classify the emotion of the movie comment "No reason to watch (without reason to see)" we can attach a hint "It was" to the sentence, resulting in "No coast to watch.it was [ MASK ]". The "[ MASK ]" character corresponds to the predictive output of the pre-training model MLM head (MLM header) mapped to the actual class label. As for the above example, if the probability of predicting "great" is high, it corresponds to the "positive" category, and if the probability of predicting "critical" is high, it corresponds to the "negative" category. In the case of PLM with massive language knowledge, PLM will have a higher probability of judging that the "[ MASK ]" character corresponds to "term" rather than "great".

Fig. 1 shows an example of prompt and label selection for comment emotion analysis. As shown in fig. 1, in order to determine the emotion category (e.g., positive appreciation or negative criticism) of the sentence "Wonderful movie in every aspect (movie which is excellent in all aspects), a hint" It is [ MASK ] "may be directly added to the input text, and It is determined that [ MASK ] to be predicted by the model may be" good "corresponding to a positive tag or" bad "corresponding to a negative tag. In other words, the hint templates may be structured in the "It is+emotion attribute vocabulary" format and two from the vocabulary corresponding to positive and negative emotion are selected as labels using a Verbalizer (language expressive), in this case "good" and "terable". Thus, the original training sample "Wonderful movie in every aspect" may be modified to a processed training sample with mask cues added: "Wonderful movie in every aspect.it is [ MASK ]", which is then fed into the PLM for training, e.g., to predict whether [ MASK ] is good or critical for loss function refinement and back-propagation based adjustment, based on the PLM model.

In the example shown in FIG. 1, the positive and negative labels may be manually selected, for example, "good" and "quick" may be selected for the illustration, and other words in the vocabulary (e.g., the overall vocabulary) that represent emotion attributes, such as "great" and "bad" may be selected. In addition, in the example of FIG. 1, the prompt word "It is" may also be designed by humans.

In addition, although examples of english text and hints are shown in the figures, hints, masks, and labels may also be utilized for sample construction and subsequent classification for chinese.

Although the existing PLM can be finely tuned based on the prompt in the small sample learning, and has the classification capability of the target task quickly, for example, the comment emotion is classified as positive or negative, the parameter size of the obtained finely tuned PLM is quite large, so that the PLM cannot be applied to the scene with limited resources or delay sensitivity.

Knowledge distillation (Knowledge Distillation, KD) can be used in machine learning to transfer knowledge from a large model to a small model. While large models (e.g., very deep neural networks or an aggregate of many models) have a higher knowledge capacity than small models, such capacity may not be fully utilized. On the other hand, small models are more difficult to train than large models. Knowledge distillation transfers knowledge from a large model to a smaller model without losing its effectiveness. Because of the low evaluation cost of the small models, they can be deployed on weaker hardware (e.g., mobile devices).

However, existing knowledge distillation techniques are difficult to apply to small sample learning scenarios because sparse annotation data can cause student model overfitting and existing knowledge distillation methods cannot perform training to adjust the target model based on hints.

To this end, the present invention proposes a small sample knowledge distillation scheme based on hint adjustment PLM. The scheme requires that the student model simultaneously learn from both the teacher model of the PLM hinted and trimmed and the teacher model trimmed by the out-of-domain data, thereby alleviating the over-fitting problem of the student model in small sample scenarios by adding a distillation approach to the out-of-domain supervised data. Further, the present invention has found through experiments that extraction of the intermediate layer representation can negatively impact distillation performance in small sample scenarios. The present invention therefore eliminates the popular practice in the art of knowledge migration PLM's middle layer representation, but reduces the over-fitting problem of student models by increasing the probability of pseudo-distribution distillation pipeline in the case of very small samples.

In one embodiment, the invention may be implemented as a student model training method. FIG. 2 shows a schematic flow diagram of a method of training a student model based on a pre-trained language model, according to one embodiment of the invention.

In step S210, a hint information and a mask text placeholder are added to the sample to obtain a processed training sample.

In the training of the student model of the present invention, a small sample training dataset is required, for example, a training dataset X given an N-way-K-shot. Here, N represents that the model can output N classifications, and K represents the number of samples in each classification. The N-way-K-shot training data set X contains N X K samples and in the case of small sample training the value of N X K will be very small.

The samples may be, for example, n×k sentences with emotion preferences, and the input samples may be constructed by adding the content "itis" of the corresponding hint information and the MASK text placeholder "[ MASK ]" as shown in fig. 1, and a corresponding tag (i.e., a real tag) is generated for each sample based on the emotion preferences actually contained in the sentence.

In step S220, the pre-training language model is adjusted using the processed training samples, resulting in a hinted adjusted teacher model. In step S230, the labeled out-of-domain training data is used to fine tune the pre-training language model PLM, resulting in an out-of-domain data fine-tuned teacher model. It should be understood here that when the pre-training language model PLM is fine-tuned using labeled out-of-domain training data, the same may be done by constructing hints and performing MLM predictions. That is, the out-of-domain training data may also be to add hints and mask text placeholders to the out-of-domain samples to obtain processed training samples.

In step S240, a student model is trained using the processed training samples, and learns classification probability vectors output by the hinted adjusted teacher model and the out-of-domain data-trimmed teacher model during training.

In the present invention, in addition to constructing a training data set X of N-way-K-shot, one large-scale PLM (teacher master) and another smaller PLM (student master) are also required. In addition, in order to train the out-of-domain teacher model, an out-of-domain data set larger than the training data set X is also required

(wherein->

) As an auxiliary dataset for the knowledge distillation task. The student model may have similar but fewer substructures than the teacher model. For example, the teacher model may have N _T+1 A transducer structure, a student model may have N _S+1 A transducer structure, wherein N _S+1 ＜N _T+1 And preferably N _S+1 ＜＜N _T+1 . It will be appreciated by those skilled in the art that the transducer itself is a deep learning model that uses a self-attention mechanism to increase the training speed of the model, and that existing pre-training language models will include multiple transducer structures. The training aims at modeling a teacher in a small sample The performance obtained by prompting fine tuning on the data is compressed into a student model in a knowledge distillation mode.

The content and training objectives given above may be described by mathematical symbols. Specifically, training data set x= { (X) _i ,y _i ) "where y _i Is the input text x _i Wherein

Is a tag set->

And->

). Use Θ _T The parameters represent a hinted adjusted (also referred to as hinted trim) PLM. Model Θ _T Is the initialization Θ from which it is pre-trained _T’ And prompting to adjust. In other words, Θ can be used herein _T’ To represent parameters of the teacher's model. The aim of the invention is to obtain a product consisting of Θ _S The much smaller PLM of the parametric representation, while enabling Θ _S The performance of (2) can be as close to theta as possible _T 。

To achieve this, after the small sample training data set X is constructed in step S210, it is necessary to first extract from the large-scale original PLM Θ in step S220 _T’ Obtaining the PLM theta adjusted by the prompt _T . Here, a Masking Language Model (MLM) task may be used to obtain hint-adjusted PLMΘ _T . Specifically, the training samples fed into the original PLM may have, for example, "Wonderful movie in every aspect. It is [ MASK ]]"and requires an output [ MASK ]]And (5) corresponding label classification. For example, when the classification n=2 (at this time, [ MASK ]The corresponding word includes only one positive vocabulary, e.g., "good", and one negative vocabulary, e.g., "term", the model will output [ MASK]The probability of the corresponding word being "good" or "term" is high, and if the probability of the model decision being "good" is high and is then high, the classification result is good.

In the MLM, for words in the word table, the predicted target vector is a one-hot vector. The one-hot vector is also called "one-hot vector", i.e., in a set containing predictable words, only the coefficients corresponding to the class label (in this case, "good") are 1, and the coefficients of the other words are predicted to be 0. Thus, when using one-hot vectors to construct the penalty function, no penalty will be incurred only if the model predicts the label itself, e.g., the model outputs "good", and the same penalty will be incurred when the model outputs any other word than "good".

The parameters Θ of the original PLM can be calculated based on the back propagation algorithm based on the calculated losses _T’ Adjusting and obtaining the PLM theta after inputting the training data set X of N-way-K-shot _T 。

The student model may be trained using the small sample training data set X in a similar manner as the teacher primary model is trained using the small sample training data set X. To this end, training a student model using the processed training samples includes: obtaining a corresponding prediction result of the student model on the mask text placeholder; and adjusting network parameters of the student model by a first loss function, wherein the first loss function carries out loss calculation according to whether the corresponding prediction result of the mask text placeholder is the same as the masked word or not. In other words, the first penalty function may also be constructed using one-hot vectors using a Masking Language Model (MLM) task.

In one embodiment, the PET method may be followed. At this time, the PLM may be PET. Let l (y) be the tag word of category y,

is to use input x _i And PLM theta _T The score of l (y) is predicted at the mask language token. Based on theta _T Will x _i The probability assigned to class y is defined as follows:

here, it will further

Expressed as all N categories->

Is a probability vector of (a). />

Is x _i Corresponding N-dimensional one-hot (one-hot) true vectors. The classification penalty (corresponding to the first penalty function) of the student model can be derived directly as follows:

where CE (·, ·) represents the cross entropy loss between the two vectors.

Since the present invention uses only small sample data sets with tags, challenges are faced with a lack of training data and a rather limited supervisory signal. Thus, in the present invention, labeled out-of-domain data is largely enabled as a complement to knowledge distillation. Here, out-of-domain data is a concept opposite to in-domain data. If the PLM is trained using dataset A (e.g., text data collected from newspapers), dataset B (e.g., text data collected from XX encyclopedia) does not participate in the training of the current PLM. Then data set a corresponds to intra-domain data for any downstream task based on the current PLM. The data set B then corresponds to the outside-domain data. Since different data sets typically have different forms and fields, those skilled in the art will not typically use the out-of-domain data to assist in fine-tuning of the current model.

The inventors of the present invention then minimize the impact of mismatch of the out-of-domain data with the current model by creatively introducing in-domain expert scores (domain expertise score). In one embodiment, the present invention may utilize an out-of-domain dataset of non-small samples

(wherein->

) For knowledge distillation. Because the out-of-domain data set is not a small sample data set, the over-fitting problem caused by small samples within the domain can be relatively easily solved. However, consider the out-of-domain dataset +.>

Interdomain differences between the training dataset X may lead to student models from the out-of-domain dataset +.>

Acquiring knowledge that should not be migrated (e.g., knowledge that is meaningless to tasks within the domain), in a preferred embodiment, the present invention can be performed by domain-specific scoring (i.e., for evaluating probability distribution given by a hinted-hinted model derived using training dataset X versus using an out-of-domain dataset->

The resulting index of similarity of probability distribution given by the model trimmed by the out-of-domain data) to +.>

Is limited by the learning of (a). In one embodiment, the intra-domain professional score may correspond to the hinted-tuned teacher model and the out-of-domain data-tuned teacher model for the out-of-domain data set ∈ >

Differences between the respective output prediction results. At this time, the student model simultaneously learning the hinted-tuned teacher model and the classification probability vector output by the out-of-domain data-tuned teacher model during training may include: teacher model based on the hinted fine-tuning and the out-of-domain dataAiming at the difference between the prediction results respectively output by the training samples of the external data, the fine-tuned teacher model limits the learning degree of the student model on the classification probability vector output by the teacher model subjected to the fine-tuning of the external data in the training process. The combination is directed to the loss function as follows>

The domain specialty score is detailed.

In the present invention, the hinted and trimmed PLM Θ is used _T And PLM Θ via out-of-domain data fine-tuning _OT Both are used as teacher models for knowledge distillation, i.e. the student models learn the classification probability vectors output by the hinted adjusted teacher model and the out-of-domain data fine-tuned teacher model in the process of training the student models using the small sample training data set X. In other words, knowledge distillation can be achieved by learning the classification probability vector of the output of the teacher model by the student model.

The model for classification eventually sets a softmax layer whose output value corresponds to the probability value of the corresponding class. When knowledge is distilled, a teacher model with strong generalization capability is provided, so that the student model can directly learn the generalization capability of the teacher model. One very straightforward and efficient way to migrate the generalization capability is: the probability of the category (i.e., the classification probability vector) output by the softmax layer is used as a "Soft-target".

The conventional neural network training method is to define a loss function, which aims to make the predicted value as close as possible to the real value (corresponding to the Hard-target, which may also be called "Hard target"), and the loss function is to make the loss value of the neural network as small as possible. This training procedure is to find the maximum likelihood of the true value (ground trunk). In knowledge distillation, a training process using class probabilities of the teacher model as a soft target for training the student model is then involved.

Fig. 3 shows an example of a soft and hard target and a temperature factor adjusting soft target. Suppose that fig. 3 corresponds to the output of a 10-class model. The left side of fig. 3 corresponds to a hard object, including a one-shot label of the original dataset label, and negative labels of the 9 other classes are all 0 except that the positive label of the 2 nd class is 1. The class probabilities corresponding to the soft targets, such as the class probabilities output by the teacher model softmax layer, in fig. 3, are assigned probabilities for each class, with the positive label probability corresponding to class 2 being highest (near 0.6), but the negative labels of the other 9 classes also have some probabilities, such as class 3 near 0.2, although these probabilities are lower than the positive labels.

Because the output of the softmax layer, besides the positive examples, the negative labels also carry a great deal of information that the teacher model generalizes and infers, for example, the probability that some negative labels correspond to is far greater than other negative labels (for example, class 3 shown in the middle of fig. 3), then the training mode of knowledge distillation causes each sample to bring about a greater amount of information to the student model than the traditional training mode, if the teacher model thinks that the sample has a certain similarity to the negative labels at the time of inference (for example, the image of a dog should be closer to a cat than an airplane, so when the image is classified as a dog with the highest probability, the probability of the image should be greater at the cat than at the airplane). In other words, the student model can quickly learn the reasoning process of the teacher model when training with Soft-target.

To this end, learning by the student model, during training, the classification probability vectors output by the hinted-adjusted teacher model and the out-of-domain data-trimmed teacher model includes: adjusting network parameters of the student model with a second loss function, wherein the second loss function characterizes the difference between the classification probability vector output by the student model and the classification probability vector output by the teacher model which is prompted to be adjusted; and adjusting network parameters of the student model with a third loss function that characterizes a difference of the classification probability vector output by the student model and the classification probability vector output by the out-of-domain data-trimmed teacher model (in one embodiment, the third loss function should also take into account "in-domain expert scoring" as described above, as will be described in detail below).

Here, the classification probability vectors referred to by the second and third loss functions may be the classification probability vectors output by the Softmax layer of each model, i.e., the classification probability vector output by the last layer of the model (e.g., logits, in deep learning, correspond to the output of the final fully connected layer, not its intended logic function).

Further, the "soft-target" provided by the hinted-tuned teacher model is for the small sample training data set X as described above. In particular, the same sample X from a small sample training dataset X may be used _i And respectively sending the prompt-adjusted teacher model and the student model, calculating respective classification probability vectors by the prompt-adjusted teacher model and the student model, constructing a second loss function based on the cross entropy of the two classification probability vectors, and taking the cross entropy between the two vectors as the adjustment direction of the second loss function.

In one embodiment, the knowledge distillation penalty (corresponding to the second penalty function) for the annotated small sample training dataset X may be defined as follows:

Here, α >0 is a temperature factor. As described above, the class probability output by the teacher model softmax layer can be used as a Soft-target to help the student model learn the reasoning process of the teacher model quickly. However, since the Softmax function performs probability normalization on the Logits values between categories and amplifies the difference between Logits values, when the probability distribution entropy of the Softmax output is relatively small, the value of the negative label is very close to 0, and the contribution to the loss function is very small. At this time, the temperature factor α is required to amplify the information carried by the negative tag. Throughout the knowledge distillation process, the temperature factor may be raised first and then "cold" is restored during the test phase, which is also the source of the term "distillation".

Returning to fig. 3, the class probability of the teacher model softmax layer output is shown in the middle of fig. 3, which corresponds to the temperature factor α=1. In the distillation process, the temperature factor can be increased, so that the value of the corresponding probability of other negative labels is increased. The right side of fig. 3 shows the soft targets when the temperature factor alpha increases (greater than 1), respectively. Obviously, the probability of a positive label is still maximum at this time, but the probability of a negative label increases in duty cycle.

As previously described, the present invention generally enables tagged out-of-domain data as a complement to knowledge distillation and preferably minimizes the impact of mismatch of out-of-domain data with the current model by creatively introducing in-domain expert scores. In-domain expert score for effectively measuring out-of-domain instances

Whether useful for KD without artificial labeling. To ensure homogeneity of the model, a training based on the out-of-domain dataset was performed>

I.e. PLM based on out-of-domain data fine tuning as described above, the parameterization of which is denoted Θ _OT . Examples (x) _i ,y _i ) Is simultaneously transmitted to Θ _OT And theta (theta) _T Obtaining respective prediction results

And->

Score s _i Is based on two probability vectors (i.e., instance (x _i ,y _i ) Jensen-Shannon divergence (JSD, which may also be referred to as JS divergence) between:

wherein the Kullback-Leibler divergence (KLD, which may also be referred to as KD divergence) between the two probability distributions of KLD (||). Based on domain expert scoring, it can be taken as a third loss functionLoss of knowledge of the outside of the number

The definition is as follows:

the KL divergence is used here to measure the difference between two distributions, which is equal to one cross entropy minus one information entropy. The KL divergence has non-negativity and asymmetry, and JS divergence is introduced on the basis of the KL divergence because the asymmetry of the KL divergence can have some problems in training. The JS divergence is symmetrical, and the value is between 0 and 1. It should be appreciated that the input uses the KL divergence and JS divergences based on the KL divergence to measure the difference between the two probability vectors, but in other embodiments other metrics may be utilized.

In addition to the MLM head, the prior art believes that the intermediate layer representation may also provide useful clues to knowledge distillation. However, the inventors of the present invention have found through a number of experiments that knowledge distillation using intermediate layers often has an adverse effect on the performance of a student model in a small sample scenario. Based on this finding, the present invention eliminates the common practice of knowledge distillation using teacher model middle layer information, and step S240 may include: the student model does not learn intermediate layer outputs of the hinted-trimmed teacher model and the out-of-domain data trimmed teacher model during training.

In a small sample scenario, as much information in the model as possible needs to be mined. Because the invention does not learn intermediate information, more distillation pipelines need to be constructed to meet the supervision requirements in small sample scenes. In one embodiment, the present invention may also provide a distillation conduit by introducing fake logits (pseudo logits). The premise that fake logits can be used for knowledge distillation is that the smooth distribution of labels can be regarded as a special case of knowledge distillation based on soft targets. To this end, the learning model training method of the present invention further includes: the student model learns a pseudo-probability distribution constructed based on a label smoothing operation during training. The learning of the pseudo probability distribution constructed based on the label smoothing operation by the student model in the training process can specifically comprise: converting the pseudo-probability distribution into pseudo-classification probability vectors; and adjusting network parameters of the student model with a fourth loss function, the fourth loss function characterizing a difference between a real label of the processed training sample and the pseudo-classification probability vector.

In particular, the present invention mimics the behavior of a teacher model to generate like logits for student model learning. In particular, a pseudo-probability distribution may be derived based on a tag smoothing operation

Wherein:

where M is a constant having a value close to 1. By setting a higher temperature (e.g. a higher temperature factor) it is possible to reduce the temperature by

Transforming pseudo-logic vector->

Thus, the pseudo KD loss can be defined as:

here, CEL (·, ·) is the cross entropy penalty between two vectors, defined, and logits.

Thus, learning by the student model, during training, the classification probability vectors output by the hinted adjusted teacher model and the out-of-domain data-trimmed teacher model may include: training the student model using a first loss function and a weighted sum of loss functions characterizing similarity of the classification probability vectors output by the hinted-adjusted teacher model and the out-of-domain data-trimmed teacher model with the classification probability vectors output by the student model, respectively, as a total loss function.

In a preferred embodiment of the present invention, the penalty functions characterizing the classification probability vector similarity of the student model and the teacher model output may include a first penalty function of the above student model's MLM task calculation based on training data set X, a second penalty function of the hinted adjusted soft target similarity based on the student model's learning, and a third penalty function of the out-of-domain data adjusted soft target similarity based on the student model's learning (in this case, an out-of-domain data set is used)

And domain specialty scores need to be considered). Further, a fourth loss function of the fake logits may also be utilized.

Combining the knowledge distillation targets to carry out weighted summation, the following final loss function is obtained:

wherein lambda is ₁ And lambda (lambda) ₂ Is a equilibrium hyper-parameter, from which a student model is obtained after distillation. In addition, it should be understood that although this is the case

And->

The same hyper-parameters are assigned, but in other embodiments the weights of the two may be different.

As shown, the student model has a similar structure to the teacher model, but with fewerAlso has fewer fransformer encoder layers (illustrated as Trm layers). The two teacher models have the same network structure, and only after the teacher model (called as an intra-domain teacher model in the figure) which is adjusted by prompt is trained and adjusted based on a small sample of the training data set X of N-way-K-shot, the parameters theta of the teacher model are adjusted _T Θ compared to the original teacher model _T’ Fine tuning is carried out; while the teacher model adjusted by the out-of-domain data (referred to as the "out-of-domain teacher model" in the figure) is based on the out-of-domain data set

(wherein- >

) After training and adjusting the non-small sample of the sample, the parameter theta _OT Θ compared to the original teacher model _T’ Fine tuning is performed.

In training a student model, it is necessary to construct task-specific MLM losses, i.e., the first loss function as described above

This is trained based on the labeled dataset X.

While knowledge distillation of student models by two teacher models may first be based on, for example, classification probability vector similarity output by Softmax layers. For the similarity between the adjusted teacher model and student model, the classification vector similarity above can be derived by training reasoning on the labeled data in the upper right side of FIG. 4, i.e., corresponding to the second loss function as described above

For similarity between teacher model and student model via data fine-tuning outside domain, the above classification vector similarity can be calculated for the labeled outside domain data set +.>

Derived by training reasoning, i.e. pairCorresponds to the third loss function as described above>

Further, the fake logits may be utilized to provide distillation piping. Pseudo-probability distribution can be derived based on label smoothing operations

And thus constructing the pseudo KD as a fourth loss function +. >

The student model training method based on the pre-training language model of the present invention is described above in connection with fig. 2 and 4. After the student model is obtained by the method, the parameter quantity of the student model is smaller, and knowledge contained in the large-scale pre-training language model is learned through multi-pipeline knowledge distillation, so that the method is suitable for being arranged in an actual application scene.

To this end, the invention may also be implemented as a text classification system. FIG. 5 shows a schematic diagram of the composition of a text classification system according to an embodiment of the invention,

As shown, the system 500 may include an input acquisition unit 510, a classification decision unit 520, and an operation unit 530.

The input acquisition unit 510 is used to acquire text input from a user. The text input obtained here may be text input by the user, for example, a movie comment issued by the user, or text input converted by the user, for example, a recognition result of user voice input.

The classification decision unit 520 may include a student model acquired via the method as described above for classifying based on the text input. The operation unit 530 may be used for performing an operation according to the classification result.

The text classification system may be applied in a variety of scenarios. For example, in an intelligent robot interaction scenario, the content input by the user may be obtained from the text box, and the user intention included in the content input by the user may be determined in real time, so that the operation unit may give appropriate text feedback or other operations according to the identified intention. For another example, massive comments on a certain artistic work can be read and classified, so that the emotional tendency of the general user to evaluate the work as a whole can be given out, and the emotional tendency can be used as a basis for recommending to other users. In addition, it is also possible to classify whether the text itself is soft or unhealthy, and delete or report it in a subsequent operation.

For this, the operation performed by the operation unit 530 according to the classification result may include at least one of: feeding back based on the intention classification result of the input text; counting based on the emotion tendency classification result of the input text; and reporting based on the attribute classification of the input text.

FIG. 6 illustrates a schematic diagram of a computing device that may be used to implement the pre-training language model based student model training method described above, according to one embodiment of the invention.

Referring to fig. 6, a computing device 600 includes a memory 610 and a processor 620.

Processor 620 may be a multi-core processor or may include multiple processors. In some embodiments, processor 620 may include a general-purpose host processor and one or more special coprocessors, such as a Graphics Processor (GPU), digital Signal Processor (DSP), etc. In some embodiments, the processor 620 may be implemented using custom circuitry, for example, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC) or a field programmable gate array (Field Programmable Gate Arrays, FPGA).

Memory 610 may include various types of storage units, such as system memory, read Only Memory (ROM), and persistent storage. Where the ROM may store static data or instructions that are required by the processor 620 or other modules of the computer. The persistent storage may be a readable and writable storage. The persistent storage may be a non-volatile memory device that does not lose stored instructions and data even after the computer is powered down. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the persistent storage may be a removable storage device (e.g., diskette, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as dynamic random access memory. The system memory may store instructions and data that are required by some or all of the processors at runtime. Furthermore, memory 610 may include any combination of computer-readable storage media including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic disks, and/or optical disks may also be employed. In some implementations, memory 610 may include readable and/or writable removable storage devices such as Compact Discs (CDs), digital versatile discs (e.g., DVD-ROMs, dual-layer DVD-ROMs), blu-ray discs read only, super-density discs, flash memory cards (e.g., SD cards, min SD cards, micro-SD cards, etc.), magnetic floppy disks, and the like. The computer readable storage medium does not contain a carrier wave or an instantaneous electronic signal transmitted by wireless or wired transmission.

The memory 610 has stored thereon executable code that, when processed by the processor 620, causes the processor 620 to perform the pre-trained language model based student model training method described above.

The student model training and text classification system based on the pre-training language model according to the present invention has been described in detail above with reference to the accompanying drawings.

The invention uses prompt-based learning to improve the small sample learning performance of large-scale PLMs. In order to achieve online application deployment of PLM in resource-limited environments, the present invention employs knowledge distillation to compress large-scale PLM. Specifically, the present invention proposes a small sample knowledge distillation implementation for hinting a fine-tuning PLM and enabling a student model to learn from both a hinted fine-tuned teacher model and a data-outside-domain fine-tuned teacher model. The present invention changes the prior art from learning knowledge from the intermediate (proved to be detrimental to the performance of the student model under small sample learning) and instead uses the probability of pseudo-distribution to provide additional supervision.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region, and provide corresponding operation entries for the user to select authorization or rejection.

Furthermore, the method according to the invention may also be implemented as a computer program or computer program product comprising computer program code instructions for performing the steps defined in the above-mentioned method of the invention.

Alternatively, the invention may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or a computer program, or computer instruction code) which, when executed by a processor of an electronic device (or computing device, server, etc.), causes the processor to perform the steps of the above-described method according to the invention.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The foregoing description of embodiments of the invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A student model training method, comprising:

adding hint information and mask text placeholders to the samples to obtain processed training samples;

using the processed training sample to fine tune a pre-training language model PLM to obtain a teacher model with hinted fine tuning;

using the labeled outside-domain training data to fine tune the PLM to obtain a teacher model subjected to outside-domain data fine tuning; and

training a student model using the processed training samples, and during the training process the student model learns classification probability vectors output by the hinted-trimmed teacher model and the out-of-domain data-trimmed teacher model simultaneously.

2. The method of claim 1, further comprising:

training a student model using the labeled out-of-domain training data,

wherein, the student model simultaneously learns the prompted and fine-tuned teacher model and the classification probability vector output by the teacher model fine-tuned by the out-of-domain data in the training process comprises:

and limiting the learning degree of the student model to the classification probability vector output by the teacher model subjected to the out-of-domain data fine tuning in the training process based on the difference between the prediction results respectively output by the teacher model subjected to the prompt fine tuning and the teacher model subjected to the out-of-domain data fine tuning aiming at the out-of-domain training data.

3. The method of claim 1, wherein training a student model using the processed training sample comprises:

obtaining a corresponding prediction result of the student model on the mask text placeholder;

and adjusting network parameters of the student model by a first loss function, wherein the first loss function carries out loss calculation according to whether the corresponding prediction result of the mask text placeholder is the same as a label or not.

4. The method of claim 3, wherein the student model learning the hinted-tuned teacher model and the classification probability vector of the out-of-domain data-tuned teacher model simultaneously during training comprises:

Adjusting network parameters of the student model with a second loss function, wherein the second loss function characterizes similarity of the classification probability vector output by the student model and the classification probability vector output by the prompted and fine-tuned teacher model; and

and adjusting network parameters of the student model by a third loss function, wherein the third loss function represents similarity of the classification probability vector output by the student model and the classification probability vector output by the teacher model subjected to the out-of-domain data fine adjustment.

5. The method of claim 4, wherein the third loss function characterizes an adjusted similarity of the classification probability vector output by the student model and the classification probability vector output by the out-of-domain data-trimmed teacher model, the adjustment coefficient corresponding to a difference between predictions based on the hinted-trimmed teacher model and the out-of-domain data-trimmed teacher model for respective outputs of the out-of-domain training data.

6. The method of claim 1, wherein the student model learning the hinted-tuned teacher model and the classification probability vector output by the out-of-domain data-tuned teacher model simultaneously during training comprises:

The student model does not learn intermediate layer outputs of the hinted-trimmed teacher model and the out-of-domain data trimmed teacher model during training.

7. The method of claim 1, further comprising:

the student model learns a pseudo-probability distribution constructed based on a label smoothing operation during training.

8. The method of claim 7, wherein the learning of the pseudo-probability distribution constructed based on the label smoothing operation by the student model during training comprises:

converting the pseudo-probability distribution into pseudo-classification probability vectors; and

network parameters of the student model are adjusted with a fourth loss function characterizing differences between real labels of the processed training samples and the pseudo-classification probability vectors.

9. The method of claim 3, wherein the student model learning the hinted-tuned teacher model and the classification probability vector output by the out-of-domain data-tuned teacher model simultaneously during training comprises:

training the student model using a first loss function and a weighted sum of loss functions characterizing similarity of the classification probability vectors output by the hinted-at teacher model and the out-of-domain data-hinted-at teacher model with the classification probability vectors output by the student model, respectively, as a total loss function.

10. A text classification system, comprising:

an input acquisition unit configured to acquire text input from a user;

a classification decision unit comprising a student model obtained by the method of any one of claims 1-9 for classifying based on the input text; and

an operation unit, configured to perform an operation according to the classification result, where the operation includes at least one of:

feeding back based on the intention classification result of the input text;

counting based on the emotion tendency classification result of the input text; and

reporting based on the attribute classification of the input text.

11. A computing device, comprising:

a processor; and

a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method of any of claims 1-9.

12. A computer program product comprising executable code which, when executed by a processor of an electronic device, causes the processor to perform the method of any of claims 1-9.

13. A non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the method of any of claims 1-9.