CN115526332A

CN115526332A - Student model training method and text classification system based on pre-training language model

Info

Publication number: CN115526332A
Application number: CN202210987063.5A
Authority: CN
Inventors: 汪诚愚; 侯博宇; 邱明辉; 黄�俊
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2022-08-17
Filing date: 2022-08-17
Publication date: 2022-12-27

Abstract

A student model training method and a text classification system based on a pre-training language model are disclosed. The method comprises the following steps: constructing a prompt training sample; adjusting the pre-training language model by using the prompt training sample to obtain a prompt-adjusted teacher model; and training a student model using the processed training samples, and the student model simultaneously learns the class probability vectors output by the cue-adjusted teacher model and the original teacher model during the training process. The present invention requires student models to learn from two teacher models simultaneously, thereby alleviating the overfitting problem of student models in small sample scenarios by adding a distillation path to learn from the original PLM teacher with unsupervised data. Furthermore, the middle layer of PLM is migrated through the knowledge probe to express, and the high-order dependency relationship can be learned from the middle layer of the teacher model by comparing the performance of learning stable knowledge distillation, so that the accuracy and the efficiency of knowledge distillation are improved.

Description

Student model training method and text classification system based on pre-training language model

Technical Field

The present disclosure relates to the field of deep learning, and in particular, to a method for training a compression model based on a pre-training language model and a text classification system.

Background

To achieve high accuracy predictions on a particular natural language processing task, it is often necessary to train a pre-trained language model (PLM) using a large amount of tagged data. The large amount of tagged data can make training cost prohibitive. The small sample learning technology developed for the purpose can enable the pre-training language model to be trained under the condition of a small number of training samples, so that high prediction accuracy is achieved with low training cost.

However, in order to learn knowledge in massive corpora, the parameter scale of the PLM is huge, and the parameter quantity of the existing GPT-3 model is up to 175B. This makes PLM impractical for use in resource-constrained or delay-sensitive scenarios. To this end, there is a need for an improved deep-learning language model that is suitable for resource-constrained or delay-sensitive scenarios.

Disclosure of Invention

The technical problem to be solved by the present disclosure is to provide a student model training method and a text classification system based on a pre-training language model. According to the method, the PLM adjusted by the small sample prompt and the original PLM are used as teacher models of the student models to be trained for knowledge distillation, so that overfitting of the student models caused by rare annotation data is avoided. Furthermore, the invention can also use a knowledge probe to derive teacher model intermediate layer information as a supervision source of the student model intermediate layer, and use contrast learning as training assistance, thereby obtaining a small-volume student model which can fully learn PLM knowledge based on small sample training.

According to a first aspect of the present disclosure, there is provided a student model training method based on a pre-training language model, including: adding hint information and mask text placeholders to the sample to obtain a processed training sample; adjusting a pre-training language model by using the processed training samples to obtain a teacher model adjusted by prompting, wherein the pre-training language model not adjusted by prompting is an original teacher model; and training a student model using the processed training samples, and the student model learns the classification probability vectors output by the cue-adjusted teacher model and the original teacher model during training.

Optionally, training a student model using the processed training samples comprises: acquiring a corresponding prediction result of the student model to the mask text placeholder; and adjusting the network parameters of the student model by using a first loss function, and performing loss solving by using the first loss function according to whether the corresponding prediction result of the mask text placeholder is the same as the label.

Optionally, the learning, by the student model, the classification probability vectors output by the cue-adjusted teacher model and the original teacher model during training includes: adjusting the network parameters of the student model by a second loss function, wherein the second loss function represents the similarity between the classification probability vector output by the student model and the classification probability vector output by the teacher model with the prompt adjustment; and adjusting the network parameters of the student model by a third loss function, wherein the third loss function represents the similarity between the classification probability vector output by the student model and the classification probability vector output by the original teacher model.

Optionally, the method further comprises: adding hint information and mask text placeholders to a second sample to obtain a processed second training sample, the second training sample being an unlabeled sample, wherein the third loss function characterizes a difference of a classification probability vector output by the student model for the second training sample and a classification probability vector output by the original teacher model for the second training sample.

Optionally, the learning, by the student model, the class probability vectors output by the cue-adjusted teacher model and the original teacher model during the training process includes: and in the training process, the student model learns the classification probability vectors output by the intermediate layers of the teacher model adjusted by the prompt and the original teacher model.

Optionally, the learning, by the student model, the classification probability vectors output by the intermediate layers of the cue-adjusted teacher model and the original teacher model in the training process includes: adjusting the network parameters of the student model by a fourth loss function, wherein the fourth loss function represents the difference of the similarity of the classification probability vector output by the student model middle layer and the classification probability vector output by the teacher model middle layer with the prompt adjustment when different real labels are input; and adjusting the network parameters of the student model by a fifth loss function, wherein the third loss function represents the difference of the similarity of the classification probability vector output by the student model intermediate layer and the classification probability vector output by the original teacher model intermediate layer when different real labels are input.

Optionally, the classification probability vector output by each middle layer of the student model is multiplied by the classification probability vector output by each middle layer of the teacher model, and the average value of the products is obtained to represent the similarity of the middle layer outputs of the student model and the teacher model.

Optionally, training a student model using the processed training samples, and learning, by the student model, classification probability vectors output by the cue-adjusted teacher model and the original teacher model during training comprises: training the student model using a first loss function and a weighted sum of loss functions characterizing similarity of classification probability vectors output by the cue-adjusted teacher model and the original teacher model, respectively, and classification probability vectors output by the student model, as a total loss function.

According to a second aspect of the present disclosure, there is provided a text classification system comprising: an input acquisition unit for acquiring an input from a user; an intention determination unit comprising a student model obtained by the method of the first aspect, the student model being used for classifying the intention of the user based on the input; and an operation unit for performing a subsequent operation according to the classified user intention.

According to a third aspect of the present disclosure, there is provided a computing device comprising: a processor; and a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method as described in the first aspect above.

According to a fourth aspect of the present disclosure, there is provided a non-transitory machine-readable storage medium having stored thereon executable code which, when executed by a processor of an electronic device, causes the processor to perform the method as described in the first aspect above.

Thus, a small sample knowledge distillation scheme based on hint-adjusted PLM was proposed. This scheme requires that the student model learns from the teacher model of both the hinted, trimmed PLM and the original PLM at the same time, thereby alleviating the overfitting problem of the student model in a small sample scenario by adding a distillation path to learn from the original PLM teacher with unsupervised data. Furthermore, the middle layer representation of PLM is migrated through the knowledge probe, and the high-order dependency relationship can be learned from the middle layer representation of the teacher model by comparing the performance of stable knowledge distillation in learning, so that the accuracy and the efficiency of knowledge learning of the student model are improved.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing in greater detail exemplary embodiments thereof with reference to the attached drawings, in which like reference numerals generally represent like parts throughout.

Fig. 1 shows an example of prompt and tag selection for comment sentiment analysis.

FIG. 2 shows a schematic flow diagram of a method for student model training based on a pre-trained language model according to one embodiment of the present invention.

Fig. 3 shows an example of soft and hard targets and a temperature factor adjusting soft target.

FIG. 4 shows an overall schematic of the present invention for training a student model based on two teacher models.

FIG. 5 shows a schematic diagram of the composition of a text classification system according to one embodiment of the invention.

Fig. 6 is a schematic structural diagram of a computing device that can be used to implement the above-described student model training method based on a pre-trained language model according to an embodiment of the present invention.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Large-scale pre-training language models have achieved great success in various fields of NLP (natural language processing), and people do not train language models from scratch, but first, on a large number of general linguistic data, obtain general PLM through some unsupervised proxy tasks; and then in a downstream task, the universal PLM finely adjusts parameters on the supervision data, namely, the existing language knowledge in the universal corpus can be utilized to realize the target classification function. Such a two-stage model paradigm has been widely adopted in many practical language application scenarios.

The small-shot Learning (Few-shot Learning) is a normal form of machine Learning, and aims to perform only a small amount of optimization on a model under the condition of a tiny training sample to obtain the model with higher precision. Whether to have the ability to learn and generalize from a small number of samples is a clear cut-off to distinguish artificial intelligence from human intelligence, since humans can easily build knowledge of new things by just one or a few examples, whereas machine learning algorithms typically require thousands of labeled samples to guarantee their generalization ability. In the fields of machine vision, natural language processing, etc., the labeling of data is expensive; in a new scenario, the annotation data is very scarce. This limits the application of deep learning algorithms. Small sample learning is of great significance and challenge in the field of machine learning. Under the initiation of the rapid learning ability of human beings, people hope that the machine learning model can rapidly learn a new category by only needing a small number of samples after learning a large amount of data of a certain category, which is the problem to be solved by small sample learning.

For the special nature of the small sample task, the downstream task fine-tuning of stage two can be reconstructed as a "perfect shape filling problem", i.e. using PET (pattern extending training).

Starting from BERT, in downstream tasks, pre-training a language model based on Prompt-based fine-tuning (Prompt-based training) has become a common practice in the field of NLP. While the GPT-3 model with 175B parameters brings a new approach to using LM for downstream tasks: by using natural language hints (prompt) and task examples (negotiation) as context, GPT-3 can process many tasks with only a few samples, without updating parameters in the underlying model. The enormous model size of GPT-3 is an important factor for its success, and the notion of hints and business examples also gives us new insight into how to better use language models. The hint information is a piece of text that is inserted into the input sample, so the original task can translate the prediction problem into an MLM (mask language model) problem. For example, assuming we want to classify the emotion for the film comment "No release to watch (there is No reason to see)", a prompt "Itws (It is)" can be appended to the sentence, resulting in "No release to watch. The "[ MASK ]" character maps to the actual class label corresponding to the predicted output of the pre-trained model MLMhead (MLM head). As with the above example, a "positive" category is assigned when the probability of "great" is predicted, and a "negative" category is assigned when the probability of "terrible" is predicted. In the case that the PLM has a huge amount of language knowledge, the PLM may have a higher probability of determining that the "[ MASK ]" character corresponds to "terrible" rather than "great".

Fig. 1 shows an example of prompt and tag selection for comment sentiment analysis. As shown in fig. 1, in order to determine the emotion category (e.g., whether positive appreciation or negative criticism) of the sentence "wonderfulmlovieineveryaspect" is a very good movie, a prompt "It is [ MASK ]" may be directly added to the input text, and It is determined that [ MASK ] to be predicted by the model may be "good" corresponding to a positive tag or "terrible" corresponding to a negative tag. In other words, the prompt template may be constructed in the "Itis + sentiment attribute vocabulary" format and Verbalizer (linguistic enunciator) is used to select two of the vocabularies in the vocabulary corresponding to positive and negative sentiments as tags, in this case "good" and "terrible". Thus, the original training sample "wonderfulmovieineeveryaspec. "can be adapted to the processed training samples with the addition of masking cues: it is [ MASK ] ", which is then sent to the PLM for training, e.g., loss function extraction and back-propagation based adjustment, according to the PLM model to predict whether [ MASK ] is good or terrible.

In the example shown in fig. 1, the positive and negative labels may be selected manually, for example, the illustrated "good" and "terrible" may be selected, and other words used to represent emotional attributes in the vocabulary (e.g., the total vocabulary) such as "great" and "bad" may also be selected. In addition, in the example of fig. 1, the prompt "Itis (it is)" may also be designed by human.

In addition, although examples of English text and prompts are shown in the figures, prompt, mask and label may also be used for sample construction and subsequent classification for Chinese.

Although the existing PLM can be finely tuned based on prompts in small sample learning, and has the classification capability of target tasks, for example, classifying comment emotions as positive or negative, the parameters of the obtained fine-tuned PLM are huge in scale, so that the fine-tuned PLM cannot be applied to scenes with limited resources or sensitive delay.

In machine learning, knowledge Distillation (KD) can be used to transfer Knowledge from a large model to a small model. Although large models (e.g., very deep neural networks or aggregates of many models) have a higher knowledge capacity than small models, this capacity may not be fully exploited. Even if a model utilizes little knowledge capacity, it is computationally expensive to evaluate it. On the other hand, small models are more difficult to train than large models. Knowledge distillation transfers knowledge from one large model to a smaller model without losing its effectiveness. Because small models are less costly to evaluate, they can be deployed on less powerful hardware (e.g., mobile devices).

However, the existing knowledge distillation technology is difficult to apply to a small sample learning scene because sparse annotation data may cause overfitting of a student model, and the existing knowledge distillation method cannot perform training for adjusting a target model based on a prompt.

To this end, the present invention proposes a small sample knowledge distillation scheme based on hint-adjusted PLM. This scheme requires student models to learn from the teacher models of both the hinted and trimmed PLM and the original PLM at the same time, thereby alleviating the overfitting problem of student models in small sample scenarios by adding a distillation path to learn from the original PLM teacher with unsupervised data. Furthermore, the middle layer representation of PLM is migrated through the knowledge probe, and the high-order dependency relationship can be learned from the middle layer representation of the teacher model by comparing the performance of learning stable knowledge distillation, so that the accuracy and the efficiency of learning the knowledge of the student model are improved.

In one embodiment, the invention can be implemented as a student model training method based on a pre-training language model. FIG. 2 shows a schematic flow diagram of a method for student model training based on a pre-trained language model according to one embodiment of the present invention.

At step S210, hints and mask text placeholders are added to the samples to obtain processed training samples.

In the student model training of the present invention, a small sample training data set is used, e.g., training data set X given an N-way-K-shot. Here, N denotes that the model can output N classifications, and K denotes the number of samples in each classification. The N × K samples are therefore contained in the training data set X of the N-way-K-shot, and in the case of small sample training, the value of N × K will be very small.

The sample may be, for example, an N × K sentence with emotional preferences, and the input sample may be constructed as shown in fig. 1 by adding the content "itis" of the corresponding prompt information and the MASK text placeholder "[ MASK ]", and a corresponding tag (i.e., a real tag) is generated for each sample based on the emotional preferences actually contained by the sentence.

In step S220, the pre-training language model is adjusted by using the processed training samples, so as to obtain a teacher model with adjusted prompt, wherein the pre-training language model without adjusted prompt is the original teacher model. Then, in step S230, a student model is trained using the processed training samples, and the student model learns the classification probability vectors output by the cue-adjusted teacher model and the original teacher model during training.

In the present invention, in addition to constructing an N-way-K-shot training data set X, a large-scale PLM (teacher original model) and another smaller PLM (student original model) need to be given. The student model may have similar but fewer substructures than the teacher model. For example, a teacher model may have N _T+1 A Transformer structure, the student model may have N _S+1 A Transformer structure wherein N _S+1 ＜N _T+1 And preferably, N _S+1 ＜＜N _T+1 (the Transformer itself is a deep learning model that uses the self-attention mechanism to increase the training speed of the model, and the pre-trained language model in the prior art includes a plurality of Transformer structures). The training goal is to compress the performance of the teacher model on the small sample data through hint fine tuning into the student model in a knowledge distillation manner.

The content and training objectives given above may be described by mathematical symbols. Specifically, the training data set X = { (X) _i ,y _i ) (here, y) _i Is to input a text x _i The classification label of (1), wherein

Is a set of labels that are to be used,

and

during the parameter adjustment, another one isSmall sample validation set for parameter tuning with X of the same size). Using theta _T Hint-adjusted (also referred to as hint-trimmed) PLMs of parameter representations. Model theta _T Is the initialization theta from which it was pre-trained _T’ And prompting to adjust. In other words, Θ may be used herein _T’ To represent the parameters of the teacher's original model. The object of the present invention is to obtain a composition of _S Much smaller PLM of parametric representation while making Θ _S Can be as close as possible to theta _T 。

To achieve this goal, after constructing the small sample training data set X in step S210, it is necessary to first construct the large-scale original PLM Θ in step S220 _T’ Obtaining PLM theta adjusted by the prompt _T . Here, a Masking Language Model (MLM) task may be used to obtain the cue-adjusted PLM Θ _T . In particular, the training sample that is fed into the original PLM may have, for example, "Wonderful movie in every aspect is]"and requires an output [ MASK]The corresponding label classification. For example, when classification N =2 (at this time, [ MASK [)]Corresponding words include only a positive word, e.g., "good", and a negative word, e.g., "terrible"), the model will output [ MASK]The probability of whether the corresponding word is "good" or "terrible", and if the model determines that the probability of "good" is greater than the probability of "terrible", the classification result is good ".

In MLM, the prediction target vector is a one-hot vector for words in the word list. The one-hot vector is also referred to as a "one-hot vector", i.e., in a set containing predictable words, only the coefficients corresponding to class labels ("good" in this example) are 1, and the coefficients for predicting other words are all 0. Thus, when a one-hot vector is used to construct the penalty function, no penalty is incurred only if the model predicts the tag itself, e.g., the model output is "good", whereas the same penalty is incurred when the model outputs any other single time than "good".

The parameter Θ of the original PLM can be calculated based on the back propagation algorithm based on the calculated loss _T’ Performing adjustment and training after inputting N-way-K-shotAfter data set X, a prompt adjusted PLM Θ is obtained _T 。

The student model may be trained using the small sample training dataset X in a similar manner as the teacher prototype model is trained using the small sample training dataset X. To this end, training a student model using the processed training samples comprises: acquiring a corresponding prediction result of the student model to the mask text placeholder; and adjusting the network parameters of the student model by using a first loss function, and performing loss solving by using the first loss function according to whether the corresponding prediction result of the mask text placeholder is the same as the masked word. In other words, the first loss function may also be constructed using one-hot vectors using a Masking Language Model (MLM) task.

In one embodiment, the PET method may be followed, assuming l (y) is a tagword for category y,

is to use the input x _i And PLM Θ _T The score of l (y) is predicted at the mask language markup. Based on theta _T X is to be _i The probability assigned to class y is defined as follows:

here, further will

Expressed as all N classes

The probability vector of (2).

Is x _i The corresponding N-dimensional one-hot (one-hot) true vector. The classification loss (corresponding to the first loss function) of the student model can be derived directly as follows:

where CE (·,) represents the cross-entropy loss between the two vectors.

In the present invention, the original PLM Θ is used _T’ And PLM Θ adjusted by hint _T Both are used as teacher models for knowledge distillation, i.e. in the process of training student models by using a small sample training data set X, the student models learn the classification probability vectors output by the prompt-adjusted teacher model and the original teacher model. In other words, the knowledge distillation may be realized by the student model learning the classification probability vector of the output of the teacher model.

The model for classification will finally set a softmax layer, the output values of which correspond to the probability values of the respective classes. During knowledge distillation, a teacher model with strong generalization ability is provided, so that the student model can directly learn the generalization ability of the teacher model. One very straightforward and efficient method of migrating generalization capability is: the probability of the class output by the softmax layer (i.e., the classification probability vector) is used as the "Soft-target".

The conventional neural network training method is to define a loss function, and the goal is to make the predicted value as close as possible to the true value (corresponding to Hard-target, also called as "Hard target"), and the loss function is to make the sum of the loss values of the neural network as small as possible. This training process is to find the maximum likelihood for the true value (groudtruth). In knowledge distillation, the class probabilities of the teacher model are used as the training process for training the soft labels of the student models.

Fig. 3 shows an example of soft and hard targets and a temperature factor adjusting soft target. Assume that fig. 3 corresponds to the output of a 10-class model. The left side of fig. 3 corresponds to a hard target, including the one-shot label labeled for the original dataset, with the 9 other classes having a negative label of 0, except that the class 2 positive label is 1. The middle of fig. 3 corresponds to the class probabilities output by the soft object, for example, the teacher model softmax layer, each class is assigned with a probability, and the probability of the positive label corresponding to class 2 is the highest (close to 0.6), but the probability of the negative label of other 9 classes is also certain, for example, the probability of class 3 is close to 0.2, although the probabilities are lower than that of the positive label.

Because the output of the softmax layer, in addition to the positive example, the negative labels carry a lot of information for the teacher model to generalize reasoning, for example, the probability of some negative labels corresponding to the negative labels is much greater than that of other negative labels (for example, the 3 rd label shown in the middle of fig. 3), the teacher model is represented that the sample and the negative label are considered to have certain similarity during reasoning, and therefore, the training mode of knowledge distillation enables the amount of information brought to the student model by each sample to be greater than that of the traditional training mode. In other words, when training with Soft-target, the student model can quickly learn the reasoning process of the teacher model.

To this end, the learning, by the student model, of the class probability vectors output by the cue-adjusted teacher model and the original teacher model during training includes: adjusting network parameters of the student model by a second loss function, wherein the second loss function represents the difference between the classification probability vector output by the student model and the classification probability vector output by the teacher model with the prompt adjustment; and adjusting the network parameters of the student model by a third loss function, wherein the third loss function represents the difference between the classification probability vector output by the student model and the classification probability vector output by the original teacher model.

Here, the classification probability vectors referred to by the second loss function and the third loss function may be classification probability vectors output from the Softmax layer of each model, that is, classification probability vectors output from the last layer of the model.

Further, the "soft target" provided by the cue-adjusted teacher model is a training data set X for a small sample as it is. In particular, the same sample X from a small sample training data set X may be used _i Respectively sending the teacher model and the student model with adjusted prompt, calculating respective classification probability vectors of the teacher model and the student model with adjusted prompt, and constructing a second loss function based on the cross entropy of the two classification probability vectors and reducingThe cross entropy between the lower two vectors is the adjustment direction of the second loss function.

In one embodiment, the annotated knowledge distillation loss (corresponding to the second loss function) can be defined as follows:

here, α >0 is a temperature factor. As mentioned previously, the class probability output by the teacher model softmax layer can be used as the Soft-target to help the student model to quickly learn the inference process to the teacher model. However, since the Softmax function makes the logs values normalized in probability between classes and amplifies the difference between the logs values, when the entropy of the probability distribution output by Softmax is relatively small, the negative label values are all close to 0, and contribute very little to the loss function. At this time, a temperature factor α is required to amplify the information carried by the negative label. Throughout the knowledge distillation process, the temperature factor may be raised and then "low temperature" restored during the test phase, which is also the source of the term "distillation".

Returning to fig. 3, in the middle of fig. 3, the class probability of the teacher model softmax layer output is shown, which corresponds to the temperature factor α =1. And in the distillation process, the temperature factor can be increased, so that the values of the corresponding probabilities of other negative labels are improved. The soft target respectively at an elevated temperature factor alpha (greater than 1) is shown on the right side of fig. 3. Obviously, the probability of a positive label is still the largest at this point, but the probability of a negative label is increasing.

Since the present invention uses only a small sample dataset with labels, it faces the challenges of lack of training data and rather limited supervisory signals. Thus, it is contemplated to learn directly from a pre-trained teacher model without any fine-tuning, and to use unlabeled data to alleviate student model overfitting problems that would normally result from training a small sample. When the unlabelled data is used, the prediction similarity of the student PLM and the teacher PLM can be used as a supervision signal, so that the student PLM can learn the language knowledge contained in the teacher PLM.

The second loss function as above corresponds to the annotated knowledge distillation loss, i.e. both the student model and the hinted-to-fine-tuned teacher model are performed on small sample data, i.e. the model input for the second loss function is the training data set X of N-way-K-shot as described above. While a third loss function may correspond to unmarked knowledge distillation losses, i.e., both the student model and the hinted-to-fine-tuned teacher model are performed on larger sample data that is unmarked (e.g., an unmarked dataset, as described below)

Wherein

To this end, the training method of the present invention may include adding hinting information and masked text placeholders to a second sample to obtain a processed second training sample, the second training sample being an unlabeled sample. That is, since the size of X is very small (i.e., N K), it is assumed that there is a larger unlabeled dataset

Wherein

As an auxiliary data set (i.e., corresponding to the second training sample) for knowledge distillation. Thus, the third loss function is due to a difference between a classification probability vector characterizing the student model output for the second training sample and a classification probability vector characterizing the original teacher model output for the second training sample.

Specifically, the parameter of the original teacher model is Θ as described above _T’ Thus, in one embodiment, Θ -based can be further defined _T’ And

distillation loss without annotated knowledge (corresponding to the third loss function) as follows:

in small sample scenarios, it is desirable to mine as much information in the model as possible. In addition to MLM heads, the mid-layer representation may also provide useful clues for knowledge distillation. To this end, learning, by the student model, the class probability vectors output by the cue-adjusted teacher model and the original teacher model during training may include: and in the training process, the student model learns the classification probability vectors output by the intermediate layers of the teacher model adjusted by the prompt and the original teacher model.

Because of the difference in the capacity of the models, directly approximating the differences in the middle level representation for teachers and students does not yield better results. To alleviate the model capacity gap problem, the present invention migrates the middle tier knowledge by finding the relevance of the middle tier representation. Further, contrast learning can be utilized to keep correlations under different labels away from each other. Contrast learning is one of unsupervised learning, emphasizes learning common characteristics among similar examples, and distinguishes differences among non-similar examples. The aim of the comparison learning is to learn an encoder which performs similar encoding on the same type of data and makes the encoding results of different types of data different as much as possible. In the invention, the comparative learning can be realized by selecting samples of different real labels.

At this time, in the training process, the learning, by the student model, of the classification probability vectors output by the intermediate layers of the teacher model adjusted by the prompt and the original teacher model includes: adjusting the network parameters of the student model by a fourth loss function, wherein the fourth loss function represents the difference of the correlation between the classification probability vector output by the student model middle layer and the classification probability vector output by the teacher model middle layer with the prompt adjustment when different real labels are input; and adjusting the network parameters of the student model by a fifth loss function, wherein the third loss function represents the difference of the correlation between the classification probability vector output by the student model intermediate layer and the classification probability vector output by the original teacher model intermediate layer when different real labels are input.

In one embodiment, the classification probability vectors output by each intermediate layer of the student model are respectively multiplied by the classification probability vectors output by each intermediate layer of the teacher model, and the average value of the products is obtained to represent the correlation of the outputs of the intermediate layers of the student model and the teacher model.

The present invention uses knowledge probes to migrate knowledge in the middle layer. Knowledge probes, i.e. a series of pseudo-MLM heads, relate the information of the middle layer to the actual label, thereby assisting the migration of the middle layer information. After the teacher model and the student model are subjected to hint refinement and distillation, parameters of the teacher model and the student model can be frozen to train each knowledge probe through a real label, and finally the information of the included middle layer is exported and used as a middle layer supervision source of the student model.

Specifically, upon freezing Θ _T And Θ _S Then, is theta _T And Θ _S The coding layer (except the last layer) of each transducer in the set trains an MLM-based probe classifier on the true tag words. In general, there is N _T The probes are used for teacher model, N _S One probe for student model (in this case, teacher model has N _T+1 A Transformer structure, the student model having N _S+1 One Transformer structure, the probe classifier can be trained for the coding layer of each Transformer except the last Transformer).

The representation is based on theta _T And j-th probe (j =1, \8230; N) _T ) X is to be _i Assigned to N classes

The probability vector of (2). Similarly, the probability of an outcome from a student model is expressed as

Wherein k =1, \ 8230;, N _S . The exponential match between the two models can be defined as:

to stabilize performance on small-scale data, contrast learning may be employed herein as an auxiliary training target. For one example, a set of examples that differ from their labels may be randomly selected as negative examples, and the results of all knowledge probes may be considered as an augmentation to the different data for these examples. This gave a negative example of a batch. Thus, the present invention proposes that there is a annotated contrast-prompted distillation (CPD) loss (corresponding to a fourth loss function) to transfer intermediate knowledge across models:

similarly, for the original teacher model Θ _T’ And label-free data sets

No-label contrast can be suggested suggesting distillation loss (corresponding to the fifth loss function):

the difference from the fourth loss function is here the negative example

Cannot be extracted directly based on the real tag (since the real tag is not available). As a simple heuristic rule, the same hint and label words can be used to directly determine the value of the word theta _T’ Upper reasoning

Can be regarded as zero sample learning.

Thus, training a student model using the processed training samples, and learning by the student model the classification probability vectors output by the cue-adjusted teacher model and the original teacher model during training may include: training the student model using a first loss function and a weighted sum of loss functions characterizing similarity of classification probability vectors output by the cue-adjusted teacher model and the original teacher model, respectively, and classification probability vectors output by the student model as a total loss function.

In a preferred embodiment of the present invention, the loss functions characterizing the similarity of the classification probability vectors output by the student model and the teacher model may include second and third functions directly based on the similarity of the final classification probability vectors output by softmax, and may also include fourth and fifth functions describing the similarity of the classification probability vectors output by the middle layer. The above knowledge is combined with the distillation target to perform weighted summation, and the following loss function is obtained:

wherein λ is ₁ And λ ₂ Is an equilibrium hyper-parameter, thereby obtaining a distilled student model.

FIG. 4 shows an overall schematic of the present invention for training a student model based on two teacher models. As shown in fig. 4, the left side shows the overall framework of the hint-distiller of the present invention, and the right side shows an example of hint enhancement data (i.e., that can be enhanced with an unlabeled dataset).

As shown on the left side of the figure, the student model has a similar structure to the teacher model, but with fewer transformers, and also with fewer Transformer encoder layers (shown as Trm Layer). The two teacher models have the same network structure, and the parameter theta of the teacher model after prompt adjustment is adjusted based on the small sample of the training data set X of the N-way-K-shot _T Compared with the original teacher modelΘ _T’ And (4) fine adjustment is performed.

In training a student model, it is necessary to construct a task-specific MLM penalty, i.e. the first penalty function L as described above _MLM (X) this is based on the labeled data training reasoning in the upper right of FIG. 4.

And the knowledge distillation of the student models by the two teacher models can be based on the similarity of classification probability vectors output by a Softmax layer, for example. For the similarity between the adjusted teacher model and the adjusted student model, the above similarity of classification vectors can be obtained by training inference on labeled data at the upper right part of fig. 4, that is, corresponding to the second loss function L as described above _KD (X). For the similarity between the original teacher model and the student model, the above similarity of classification vectors can be obtained by reasoning for the unlabeled data training in the lower right part of fig. 4, i.e. corresponding to the third loss function as described above

Further, knowledge may be acquired from the middle tier. At this time, a knowledge Probe (Probe) can be trained for each middle layer TrmLayer of the three models, and the middle layer knowledge distillation can be realized by means of the value separation under positive and negative labels based on the overall similarity of the classification probability vectors of each layer. Similarly, the similarity between the adjusted teacher model and the adjusted student model is solved based on the labeled training data set X to obtain a fourth loss function L _CPD (X). The similarity between the original teacher model and the student model is still derived based on the unlabeled data training inference, i.e. corresponding to the fifth loss function as described above

The student model training method based on the pre-training language model of the present invention is described above with reference to fig. 2 and 4. After the student model is obtained by the method, the parameter quantity of the student model is less, and the knowledge implied in the large-scale pre-training language model is learned through multi-pipeline knowledge distillation, so that the method is suitable for being arranged in the actual application scene

To this end, the present invention may also be embodied as a text classification system. FIG. 5 illustrates a schematic diagram of the components of a text classification system according to one embodiment of the invention,

As shown, the system 500 may include an input acquisition unit 510, a classification determination unit 520, and an operation unit 530.

The input acquisition unit 510 is used to acquire text input from a user. The text input acquired here may be text input by the user himself, for example, a movie comment issued by the user, or text converted by the user input, for example, a recognition result of the user's voice input.

The classification determination unit 520 may include a student model obtained via the method described above, the student model being used for classification based on the text input. The operation unit 530 is configured to perform an operation according to the classification result.

The text classification system can be applied in a variety of scenarios. For example, in an intelligent robot interaction scene, content input by a user can be acquired from a text box, user intentions contained in the content input by the user can be judged in real time, and the operation unit can subsequently give appropriate text feedback or other operations according to the identified intentions. For another example, a large amount of comments on a certain artistic work can be read and classified, so that emotional tendency of a large number of users to evaluate the work in the whole can be given, and the emotional tendency can be used as a basis for recommending other users. In addition, the text itself can be classified as soft or unhealthy and deleted or reported in subsequent operations.

To this end, the operation performed by the operation unit 530 according to the classification result may include at least one of: feeding back based on the intention classification result of the input text; counting based on the emotional tendency classification result of the input text; and reporting based on the attribute classification of the input text.

Referring to fig. 6, computing device 600 includes memory 610 and processor 620.

The processor 620 may be a multi-core processor or may include multiple processors. In some embodiments, processor 620 may include a general-purpose main processor and one or more special coprocessors such as a Graphics Processor (GPU), a Digital Signal Processor (DSP), or the like. In some embodiments, processor 620 may be implemented using custom circuits, such as an Application Specific Integrated Circuit (ASIC) or a Field Programmable Gate Array (FPGA).

The memory 610 may include various types of storage units, such as system memory, read Only Memory (ROM), and permanent storage. Wherein the ROM may store static data or instructions that are required by the processor 620 or other modules of the computer. The persistent storage device may be a read-write storage device. The persistent storage may be a non-volatile storage device that does not lose stored instructions and data even after the computer is powered off. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the permanent storage may be a removable storage device (e.g., floppy disk, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as a dynamic random access memory. The system memory may store instructions and data that some or all of the processors require at runtime. Further, the memory 610 may include any combination of computer-readable storage media, including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash, programmable read only memory), magnetic and/or optical disks may also be employed. In some embodiments, memory 610 may include a removable storage device that is readable and/or writable, such as a Compact Disc (CD), a read-only digital versatile disc (e.g., DVD-ROM, dual layer DVD-ROM), a read-only Blu-ray disc, an ultra-density optical disc, a flash memory card (e.g., SD card, min SD card, micro-SD card, etc.), a magnetic floppy disc, or the like. Computer-readable storage media do not contain carrier waves or transitory electronic signals transmitted by wireless or wired means.

The memory 610 has stored thereon executable code that, when processed by the processor 620, causes the processor 620 to perform the above-described student model training method based on a pre-trained language model.

The student model training and text classification system based on the pre-training language model according to the present invention has been described in detail above with reference to the accompanying drawings.

The present invention improves small sample learning performance of large scale PLMs using hint based learning. In order to realize online application deployment of PLM in resource-limited environment, the invention adopts knowledge distillation to compress large-scale PLM. In particular, the present invention proposes a Prompt Distiller, the first small sample knowledge distillation implementation for Prompt tweaking of PLM, and enables student models to learn from pre-trained and Prompt tweaked teacher models simultaneously. In consideration of different knowledge bearing capacities of the teacher model and the student model, the invention further designs a contrast learning technology for learning high-order dependency relationship from the middle layer representation of the teacher model.

For the problems existing in the related art, the invention mainly provides the following solutions and improvements:

1. the loss function of knowledge distillation is improved to adapt to the prompt fine-tuning model;

2. a distillation pipeline for learning from a teacher PLM by unsupervised data is added, so that overfitting caused by lack of labels in a small sample scene is relieved;

3. migration of the middle layer representation of PLM by the knowledge probe and treatment of different probe results as different data augmentation stabilized distillation performance by comparative learning.

Furthermore, the method according to the invention may also be implemented as a computer program or computer program product comprising computer program code instructions for carrying out the above-mentioned steps defined in the above-mentioned method of the invention.

Alternatively, the invention may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or a computer program, or computer instruction code) which, when executed by a processor of an electronic device (or computing device, server, etc.), causes the processor to perform the steps of the above-described method according to the invention.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

While embodiments of the present invention have been described above, the above description is illustrative, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A student model training method based on a pre-training language model comprises the following steps:

adding hint information and mask text placeholders to the sample to obtain a processed training sample;

adjusting a pre-training language model by using the processed training samples to obtain a teacher model adjusted by prompting, wherein the pre-training language model not adjusted by prompting is an original teacher model; and

training a student model using the processed training samples, and during the training process the student model simultaneously learns the classification probability vectors output by the cue-adjusted teacher model and the original teacher model.

2. The method of claim 1, wherein training a student model using the processed training samples comprises:

acquiring a corresponding prediction result of the student model to the mask text placeholder;

and adjusting the network parameters of the student model by using a first loss function, and performing loss solving by using the first loss function according to whether the corresponding prediction result of the mask text placeholder is the same as the label.

3. The method of claim 2, wherein the learning by the student model of the class probability vectors output by the cue-adjusted teacher model and the original teacher model concurrently during the training process comprises:

adjusting the network parameters of the student model by a second loss function, wherein the second loss function represents the similarity between the classification probability vector output by the student model and the classification probability vector output by the teacher model with the prompt adjustment; and

and adjusting the network parameters of the student model by using a third loss function, wherein the third loss function represents the similarity between the classification probability vector output by the student model and the classification probability vector output by the original teacher model.

4. The method of claim 3, further comprising:

adding hinting information and masked text placeholders to a second sample to obtain a processed second training sample, the second training sample being an unlabeled sample,

wherein the third loss function characterizes a difference of the classification probability vector output by the student model for the second training sample and the classification probability vector output by the original teacher model for the second training sample.

5. The method of claim 2, wherein the learning by the student model of the class probability vectors output by the cue-adjusted teacher model and the original teacher model concurrently during the training process comprises:

the student model learns the classification probability vectors output by the intermediate layer of the cue-adjusted teacher model and the original teacher model during training.

6. The method of claim 5, wherein the learning by the student model of the classification probability vectors output by the cue-adjusted teacher model and the original teacher model intermediate layer simultaneously during the training process comprises:

adjusting the network parameters of the student model by a fourth loss function, wherein the fourth loss function represents the difference of the similarity of the classification probability vector output by the student model middle layer and the classification probability vector output by the teacher model middle layer with the prompt adjustment when different real labels are input; and

and adjusting the network parameters of the student model by using a fifth loss function, wherein the third loss function represents the difference of the similarity of the classification probability vector output by the student model intermediate layer and the classification probability vector output by the original teacher model intermediate layer when different real labels are input.

7. The method of claim 6, wherein the classification probability vectors output by each intermediate level of the student model are used to multiply the classification probability vectors output by each intermediate level of the teacher model, respectively, and the products are averaged to characterize the similarity of the student model and teacher model intermediate level outputs.

8. The method of claim 2, wherein training a student model using the processed training samples, and wherein the student model simultaneously learns the prompt-adjusted teacher model and the classification probability vectors output by the original teacher model during the training process comprises:

training the student model using a first loss function and a weighted sum of loss functions characterizing similarity of classification probability vectors output by the cue-adjusted teacher model and the original teacher model, respectively, and classification probability vectors output by the student model as a total loss function.

9. A text classification system comprising:

an input acquisition unit for acquiring a text input from a user;

a classification decision unit comprising a student model obtained by the method of any one of claims 1-8, the student model being for classifying based on the input text; and

an operation unit, configured to perform an operation according to the classification result, where the operation includes at least one of:

feeding back based on the intention classification result of the input text;

counting based on the emotional tendency classification result of the input text; and

reporting is performed based on the attribute classification of the input text.

10. A computing device, comprising:

a processor; and

a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method of any of claims 1 to 8.

11. A non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the method of any of claims 1-8.