CN114722805B

CN114722805B - Little sample emotion classification method based on size instructor knowledge distillation

Info

Publication number: CN114722805B
Application number: CN202210653730.6A
Authority: CN
Inventors: 李寿山; 常晓琴; 周国栋
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2022-06-10
Filing date: 2022-06-10
Publication date: 2022-08-30
Anticipated expiration: 2042-06-10
Also published as: CN114722805A

Abstract

The invention relates to a few-sample emotion classification method based on large and small instructor knowledge distillation, which comprises the steps of collecting unmarked samples and marked samples on a large number of emotion classification tasks, and training a large instructor model and a small instructor model by using the marked samples; all unlabeled samples pass through a little instructor model to obtain the uncertainty of the probability of each sample, and then the samples with uncertain sample probability are screened out according to a threshold value and pass through a big instructor model again; and combining probability output of the large instructor model and the small instructor model to form a soft label to distill the student model, and performing classification prediction by using the distilled student model. The invention reduces the frequency of visiting the tutor model, reduces the distillation time in the process of training the student model, reduces the resource consumption and simultaneously improves the accuracy of classification and identification.

Description

Little sample emotion classification method based on knowledge distillation of size guide

Technical Field

The invention relates to the technical field of natural language processing, in particular to a few-sample emotion classification method based on knowledge distillation of a size instructor.

Background

The emotion classification task aims at automatically judging the emotion polarity (such as negative and positive) of the text expression. The task is a research hotspot in the field of natural language processing research, is widely applied to a plurality of application systems such as intention mining, information retrieval and question-answering systems, and is a basic link of the application systems. The low-sample emotion classification in the emotion classification means that only a small number of labeled samples can be used in training the classifier.

When few-sample emotion classification is carried out, machine learning and deep learning algorithms are generally used in the field of artificial intelligence to extract emotion meanings from a section of text, and the most extensive artificial intelligence method at present is to model the problem into a problem of inputting a section of text and outputting a label. The prior art generally comprises the following steps: (1) a professional labels a small amount of texts with different polarity labels, each section of text is used as a sample, and a small amount of corpus of labeled samples with balanced polarity labels is obtained; (2) a large-scale pre-training language model (such as GPT-3) based on prompt utilizes a small amount of labeled samples to train the model, and a classification model is obtained; (3) and testing the text of a certain unknown label by using a classification model to obtain a polarity label of the text segment. During the test, each time the classification model is entered, a single text is entered. Wherein the network structure of the large scale pre-training language model based on the hint in step (2) is shown in fig. 1, where [ CLS ] x [ SEP ] is the input sentence, [ CLS ] marks the beginning of the sentence, [ SEP ] marks the separation of the sentence from the sentence, and x is the classification of the prediction sentence of the original pre-training model. The "MLM head" in FIG. 1 is a fixed usage of the masking language model in the large scale pre-training prompt-based language model. Get the active label "good" by "MLM head" and get the input sentence "[ CLS ] I will recommend them to everyone! It [ MASK ]. The feedback output of SEP is "I will recommend them to everyone! It is good. ".

Few-sample emotion classification because training samples are few, common shallow neural networks (such as CNN, LSTM and the like) and deep pre-training language models (such as BERT, RoBERTA and the like) are difficult to make correct judgment on semantics of some texts, and recognition rate of classification is not high enough. The parameter quantity of a GPT-3 large-scale model in the prior art reaches 1750 hundred million, and the GPT-3 large-scale model can perform excellent on a few-sample learning task by adding a plurality of examples of input and corresponding output as contexts. However, the parameter quantity is too large, expensive computing resources are consumed for calling the model, the reasoning speed is slow, and the practical application is hindered.

Disclosure of Invention

Therefore, the technical problem to be solved by the present invention is to overcome the defects in the prior art, and provide a method for classifying emotion with few samples based on knowledge distillation of a teacher and a teacher, which can effectively reduce the frequency of visiting the teacher and the distillation time in the process of training student models, and improve the accuracy of classification and identification while reducing resource consumption.

In order to solve the technical problems, the invention provides a few-sample emotion classification method based on knowledge distillation of a size instructor, which comprises the following steps of:

s1: dividing the sample into labeled samplesx _u And unlabeled samplesx _u ' Collection of unlabelled samples across a large number of emotion classification tasksx _u ' set up with labeled samples

And collection of unlabeled samplesD _u ={x _u ′}；

S2: constructing a big instructor model and a small instructor model, and using a labeled sample setD _l Training the tutor model to obtain the trained tutor modelM _L Using sets of labelled samplesD _l Training the tutor model to obtain the trained tutor modelM _B ；

S3: using a trained mentor's modelM _B Predicting all unlabeled samplesx _u ' obtaining sample probabilities

Calculating uncertainty of probability of each sample

；

S4: will not be uncertain

And a predetermined threshold valuethresholdComparing and screening out samples with uncertain sample probabilityx _u ″；

S5: mixing the samplex _u ' input training completed little tutor modelM _B Soft tag for obtaining little instructor modelPTo samplex _u "input training completed tutor modelM _L Soft tag for obtaining large tutor modelP' Soft tag incorporating a little mentor modelPAnd soft label of big instructor modelP' obtaining the final Soft Label

；

S6: constructing a student model, using the unlabeled samplesThis collectionD _u And said soft label

Distilling the student model to obtain a distilled student model;

s7: and (4) carrying out classification prediction on the test set by using a distillation completed student model.

Preferably, the large tutor model and the small tutor model are both pre-trained language models based on promptsMAnd the parameter quantity of the large instructor model is greater than that of the small instructor model.

Preferably, the use is of a set of labelled samplesD _l Training the tutor model to obtain the trained tutor modelM _L The method specifically comprises the following steps:

s21: training set D _l ={x _u }={x,yIn (c) } the reaction solution is,xa sample of the input is shown,yrepresenting a real tag; for input samplexAdding a prompt template and converting into a complete form filling task form:

P(x)=[CLS] x It is [MASK].[SEP]wherein [ MASK ]]For filling words, P: (x) Is the input of a language model, It is [ MASK]Is a prompt template for input text addition;

s22: will be provided withLAs a set of labels for the classification task,Vand constructing a label mapping function as a label word set of the classification task:

；

P(x) As input to the language model, by pre-training the language model based on the promptMObtaining [ MASK ]]Corresponding to different labels

Score on

，

Wherein

，

Presentation labellThe corresponding label words are then written to the corresponding label words,kis the length of the label word;

s23: establishing predictions [ MASK ] by softmax layer]In different labelslClass probability of (2) to obtain an input sample by the class probabilityxEmotion classification of

；

S24: establishing a loss function of a great tutor model output layer;

s25: repeating S22-S24 until the tutor model converges, ending the training to obtain the tutor model after the trainingM _L ；

The use of labeled sample setsD _l Training the tutor model to obtain the trained tutor modelM _B The method specifically comprises the following steps:

s26: training set D _l ={x _u }={x,yIn (c) } the reaction solution is,xa sample of the input is shown,yrepresenting a genuine label; for input samplexAdding a prompt template and converting into a complete form filling task form:

p (x) = [ CLS ] x It is [ MASK ] [ SEP ], where [ MASK ] is a filler;

s27: will be provided withLAs a set of labels for the classification task,Vand constructing a label mapping function as a label word set of the classification task:

；

by pre-training language models based on promptsMObtaining [ MASK ]]Corresponding to different labels

Score on

，

Wherein

，

s28: establishing predictions by softmax layer [ MASK]In different labelslThe input sample is obtained through the class probabilityxEmotion classification of

；

S29: establishing a loss function of an output layer of the tutor model;

s210: repeating S27-S29 until the tutor model converges, finishing training and obtaining the trained tutor modelM _B 。

Preferably, the use of a trained mentor's modelM _B Predicting all unlabeled samplesx _u ' obtaining sample probabilities

Calculating uncertainty of probability of each sample

The method specifically comprises the following steps:

s31: all unlabeled samplesx _u ' input training completed little tutor modelM _B The predicted probability distribution is

；

S32: calculating uncertainty of probability of each sample

The calculation formula is as follows:

；

therein withoutL|The number of the types of the labels in the classification task.

Preferably, the preset threshold valuethresholdHas a value range of

。

Preferably, the uncertainty is

And a predetermined thresholdthresholdComparing and screening out samples with uncertain sample probabilityx _u ", specifically:

uncertainty of sample probability

Is greater thanthresholdThen the sample is used as the sample with high uncertainty of sample probabilityx _u ″。

Preferably, the sample isx _u ' input training completed little tutor modelM _B Soft tag for obtaining little instructor modelPTo samplex _u "input training completed tutor modelM _L Soft tag for obtaining large tutor modelP' Soft tag incorporating a little mentor modelPAnd soft label of big instructor modelP' obtaining the final Soft Label

The method specifically comprises the following steps:

s51: mixing the samplex _u ' input training completed little tutor modelM _B Soft tag for obtaining little instructor model

；

S52: mixing the samplex _u "input training completed tutor modelM _L Soft tag for obtaining large tutor model

；

S53：

The expression of (a) is:

。

preferably, the using the unlabeled sample setD _u And said soft label

Distilling the student model to obtain a distilled student model, wherein the specific process is as follows:

s61: collecting unlabelled samplesD _u As a training set of distillation student models, vectors passing through the student models are represented as

Wherein g () represents a network function of the student model,A _u for unlabeled sample setD _u Corresponding word vector matrix, superscriptsThe model of the student is represented by,

a learnable parameter representing a student model;

s62: establishing loss function of student model output layer

WhereinnWhich is indicative of the size of the batch,

representing the passage through the student modeliThe probability of prediction of a sample is,

representing the final sample probability

To middleiThe probability of prediction of a sample is,Tis a temperature parameter of the distillation model, D _KL Representing the KL divergence loss function;

S63：

sequentially passing through the linear layer and the softmax activation layer to obtain an unlabeled sample setD _u Is output according to the probability

，W ^s Representing a weight matrix to be learned on a linear layer of the student model;

s64: using a loss function L _KD Updating the learnable parameters of the student model;

s65: repeating S61-S64 until the loss function L _KD And (5) converging to obtain a distilled student model.

Preferably, the word vector matrixA _u In, each row is an input samplex _u ' A word vector for each character is represented, and the word vector for each character is obtained by word2vec or Glove model training.

Preferably, the expression of the KL divergence loss function is

In which is shownL|The number of the types of the labels in the classification task.

Compared with the prior art, the technical scheme of the invention has the following advantages:

according to the invention, the student model is distilled by establishing the large instructor model and the small instructor model, so that the sample is screened by the small instructor model and then passes through the large instructor model, the distillation time of the student model can be effectively reduced, and the resource consumption is reduced; meanwhile, under the condition that resource consumption is reduced by the large instructor model and the small instructor model, a large number of unlabeled samples in the emotion classification task can be collected, so that the accuracy of classification and identification is improved.

Drawings

In order that the present disclosure may be more readily and clearly understood, reference is now made to the following detailed description of the embodiments of the present disclosure taken in conjunction with the accompanying drawings, in which

FIG. 1 is a network structure of a prompt-based large-scale pre-trained language model;

FIG. 2 is a schematic diagram of the structure of a conventional single teacher and single student knowledge distillation method;

FIG. 3 is a schematic diagram of the structure of the distillation method based on knowledge of the size instructor's mechanism;

FIG. 4 is a graph of the results of experiments performed on the BERT model with YELP and IMDB datasets in an embodiment of the invention;

FIG. 5 is a graph of the results of an experiment of YELP and IMDB datasets on the RoBERTA model in an embodiment of the present invention.

Detailed Description

The present invention is further described below in conjunction with the following figures and specific examples so that those skilled in the art may better understand the present invention and practice it, but the examples are not intended to limit the present invention.

In the optimization process of the model, the large model is often a single complex network or a collection of a plurality of networks, and has good performance and generalization capability; and the small model has limited expression capacity because the network size is small. Therefore, the knowledge learned by the large model (teacher model) can be used to guide the training of the small model (student model), so that the small model has the performance equivalent to that of the large model, but the number of parameters is greatly reduced, thereby realizing the compression and acceleration of the model, and the process is called distillation.

Compared with the conventional single teacher and single student knowledge distillation method shown in FIG. 2, the method of the invention shown in FIG. 3 uses a large number of unlabelled samples on the basis of the conventional method, and introduces two teacher models, namely a big instructor model and a small instructor model based on prompts, in the drawing

Is the output probability of the student model.

The invention relates to a few-sample emotion classification method based on knowledge distillation of a teacher with a large size, which comprises the following steps of:

And collection of unlabeled samples

With marked samplex _u For samples containing labels, unlabeled samplesx _u ' is an unlabeled sample, with a small number of labeled samples in the samplex _u And a large number of unlabelled samplesx _u ′。

S2: constructing a big instructor model and a small instructor model, wherein the big instructor model and the small instructor model are pre-training language models based on promptsM(i.e., the prompt method), the parameter quantity of the large instructor model is greater than the parameter quantity of the small instructor model, and the parameter quantity of the large instructor model in this embodiment is much greater than the parameter quantity of the small instructor model. Using annotated sets of samplesD _l Training a tutor model to obtain a trained tutor modelM _L Using sets of labelled samplesD _l Training the tutor model to obtain the trained tutor modelM _B 。

Using annotated sets of samplesD _l Train big tutor model and little tutor model respectively, the training process of big tutor model and little tutor model is similar, and specific process is:

s21: training set D _l ={x _u }={x,yIn (c) } the reaction solution is,xa sample of the input is shown and,yrepresenting a real tag; for input samplexAdding a prompt template and converting into a complete form filling task form:

P(x)=[CLS] x It is [MASK].[SEP]wherein [ MASK ]]For filling in words, the purpose is to let the pre-trained language model based on promptsMTo determine [ MASK ]]And (4) filling words, and converting the classification task into a complete blank filling task. Input text adding prompt template' It is [ MASK].”，[MASK]Passing the new input through the language model corresponding to different labels of the classification task, and making the language model determine [ MASK ]]And (4) filling words, thereby realizing the classification of the text.

(ii) a Mapping task labels to a prompt-based pre-trained language modelMA word or words in the vocabulary of (a). For example: word in word list corresponding to 0 category in emotion two classification task "terrible", 1 Category corresponds to a word in the vocabulary"great”。P(x) As input to the language model, by pre-training the language model based on the promptMObtaining [ MASK ]]Corresponding to different labels

Score on

，

Wherein

，

Presentation labellThe corresponding label words are then written to the corresponding label words,kis the length of the tag word.

S23: establishing predictions by softmax layer [ MASK]In different labelslThe input sample is obtained through the class probabilityxEmotion classification of

。

S24: establishing a loss function of an output layer of a master tutor model; in this embodiment, the loss function is a cross entropy function, which is used to measure the true label of the training sampleyAnd predictive probability

The difference between them.

S25: repeating S22-S24 until the teacher model converges and finishes training to obtain the trained teacher model

。

S26: training set D _l ={x _u }={x,yIn (c) } the reaction solution is,xa sample of the input is shown,yrepresenting a real tag; for input samplexAdding a prompt template and converting into a complete form filling task form:

p (x) = [ CLS ] xti [ MASK ] [ SEP ], where [ MASK ] is a stuff word.

；

Score on

，

Wherein

，

Indicating labellThe corresponding label words are then written to the corresponding label words,kis the length of the tag word.

S28: establishing predictions by softmax layer [ MASK]In different labelslClass probability of (2) to obtain an input sample by the class probabilityxEmotion classification of

。

S29: and establishing a loss function of an output layer of the tutor model.

S3: using a trained mentor modelM _B Predicting all unlabelled samplesx _u ' obtaining sample probabilities

Calculating uncertainty of probability of each sample

。

；

S32: calculating uncertainty of probability of each sample

The calculation formula is as follows:

；

therein aL|By uncertainty, for the number of classes of labels in the classification task

The quality of the sample prediction probability can be measured.

S4: will not be uncertain

And a predetermined thresholdthresholdComparing, screening out the samples with uncertain sample probability and presetting thresholdthresholdHas a value range of

。

Uncertainty of sample probability

Is greater thanthresholdThis sample is taken as a sample with a high uncertainty in sample probability. Uncertainty of sample probability

Is greater thanthresholdExplain the sample by the tutorx _u ' the classification probability result has insufficient confidence, and a new probability distribution needs to be obtained through the master tutor model again.

。

；

S52: mixing the samplex _u "input training completed teacher modelM _L Soft tag for obtaining large tutor model

；

S53：

The expression of (c) is:

。

s6: a student model is constructed, and the student model in the embodiment is composed of a small shallow neural network model. Using unlabeled sample setsD _u And soft label

And distilling the student model to obtain a distilled student model.

Wherein g () represents a network function of the student model，A _u For unlabeled sample setsD _u Corresponding word vector matrix, unlabeled samplex _u Of length ofkDimension of word vector ofdThen, then

(ii) a Upper labelsThe model of the student is represented by,

the learnable parameters representing the student model.

Word vector matrixA _u In, each row is an input samplex _u ' A word vector for each character is represented, and the word vector for each character is obtained by word2vec or Glove model training.

S62: establishing a loss function of the output layer of the student model, i.e. a loss function used when the teacher model distills the student model

In whichnWhich is indicative of the size of the batch,

representing the final sample probability

To middleiThe probability of prediction of a sample is,Tis a temperature parameter of the distillation model,Tis a parameter carried by a distillation model, the larger T is, the smoother the probability distribution of softmax is, the larger the entropy of the distribution is, the more information is carried, and D is _KL Representing the KL divergence loss function.

The expression of the KL divergence loss function is

Therein without a fluorineL|The number of the types of the labels in the classification task.

S63：

，

Representing a weight matrix to be learned on a linear layer of the student model;

The invention has the beneficial effects that:

To further illustrate the beneficial effects of the present invention, in this embodiment, the test set is input into the trained student model to obtain the prediction probability. The effect of the present invention was analyzed from three aspects of (1) the accuracy of classification results obtained by predicting the test set by the student model, (2) the prediction time of the teacher model for all unlabeled samples in the distillation model, and (3) the reduction rate of the visit rate to the tutor model.

In the present embodiment, a sentence-level YELP dataset (see the documents "Zhang X, Zhao J, LeCun Y. Character-level connected networks for text classification [ J ]. Advances in the neural information processing systems, 2015, 28: 649-. In the experimental process, 8 positive and negative balanced samples are respectively selected from each data set as a training set and a verification set, and positive and negative 500 samples are used as a test set. Further, the number of unlabeled samples of the YELP dataset was 10 ten thousand, and the number of unlabeled samples of the IMDB dataset was 9.8 ten thousand.

In order to simulate the knowledge distillation process of a size instructor mechanism, the invention respectively arranges a size instructor model and a small instructor model under a BERT model and a RoBERTA model, and uses the BERT-large (the size instructor model under the BERT), the BERT-base (the small instructor model under the BERT) and the RoBERTA-large (the size instructor model under the RoBERTA) to represent. When the teacher model is trained, the label words are respectively ' terrable ' and ' great; the batch size is set to 4 or 8; the optimizer uses AdamW where the learning rate selects one of {1e-5, 2e-5, 5e-5}, the weight decay is set to 1e-3, and the batch size and learning rate are determined from the manner in which the grid searches for the hyper-parameters. The student model is a CNN model, using 3 different sizes of convolution kernels, respectively (3, 50), (4, 50) and (5, 50). The number of each convolution kernel is 100; each CNN uses glove.6b.50d as a word vector; the batch size is set to 128; the optimizer uses Adam with the learning rate set to 1e-3 and the weight decay to 1 e-5. In order to prevent the overfitting phenomenon in the neural network model training process, a Dropout parameter is set to be 0.5.

The results of the experiments on the BERT model for the YELP and IMDB datasets are shown in fig. 4, where the uncertainty threshold on both the YELP and IMDB datasets was set to 0.85; the results of the experiments on the RoBERTa model for the YELP and IMDB datasets are shown in fig. 5, where the uncertainty threshold on the YELP dataset is set to 0.6 and the uncertainty threshold on the IMDB dataset is set to 0.9. Fine-tuning in FIGS. 4 and 5 indicates that the standard Fine-tuning pre-training language model is used, LM-BFF indicates that the prompt-based Fine-tuning pre-training language model is used, and LM-BFF distills the CNN indicates that the CNN model is distilled using the prompt-based pre-training language model. Because few sample learning is sensitive to data and the results obtained by different data segmentation training are greatly different, the accuracy of the classification result is expressed by using 5 different random seeds to respectively sample different training sets and verification sets so as to alleviate the problem; the data of the accuracy of the classification result is expressed in the form of "average of 5 results (variance of 5 results)".

As can be seen from the correct rate of the classification results in fig. 4, compared with the distillation performance of the BERT-large model, the method of the present invention is improved by 91.13% -90.64% =0.49% under the YELP data set and by 84.14% -84.08% =0.06% under the IMDB data set. Moreover, compared with the distillation performance of the BERT-base model, the distillation performance of the method is 91.13% > 87.18% under the YELP data set and 84.14% >84.08% under the IMDB data set, and the result of the method is far superior to that of the BERT-base model. As can be seen from the predicted time of fig. 4, the time spent for distillation by the method of the present invention was increased by 91.93s/163.27s =56.31% under the YELP data set and 962.37s/1598.34s =60.21% under the IMDB data set, compared to the BERT-large teacher model. Meanwhile, the simulation program counts that the proportion of the method for reducing the model access rate of the large instructor (the proportion of the reduction of the model access rate of the large instructor, namely the proportion of the times of unmarked samples passing through the large instructor model under a large instructor mechanism to the times of the unmarked samples passing through the large instructor model in total) is reduced by 74.40% under a YELP data set and 72.42% under an IMDB data set compared with BERT-large.

As can be seen from the correct rate of the classification results in fig. 5, the distillation performance of the method of the present invention is improved by 93.16% -92.80% =0.36% under YELP data set and 87.84% -87.64% =0.2% under IMDB data set, compared to the distillation performance of RoBERTa-large model. Moreover, compared with the distillation performance of the RoBERTA-base model, the distillation performance of the method is 93.16% > 91.82% under a YELP data set and 87.84% >87.64% under an IMDB data set, and the result of the method is superior to that of the RoBERTA-base model. As can be seen from the predicted time of fig. 5, the time spent in distillation by the method of the present invention was increased by 75.59s/163.32s =46.28% under the YELP data set and by 912.65s/1594.93s =57.22% under the IMDB data set, compared to the RoBERTa-large teacher model. Meanwhile, the simulation program counts that the proportion of the method for reducing the model access rate of the large instructor (the proportion of the reduction of the model access rate of the large instructor, namely the proportion of the times of unmarked samples passing through the large instructor model under a large instructor mechanism to the times of the unmarked samples passing through the large instructor model in total) is 84.65% lower than that of RoBERTA-large and is 75.56% lower than that of an IMDB data set.

The access frequency is proportional to the occupied resources. All unmarked samples need to pass through a little instructor model, and a small part of samples screened by the threshold pass through a big instructor model. Compared with the method that all unlabeled samples pass through the large instructor model, the method has the advantages that a large amount of resource consumption can be reduced, the parameters of the small instructor model are relatively small, and occupied computing resources are small, so that only the proportion of the large instructor model with the reduced access rate is analyzed in the simulation.

The simulation result further shows that the method can effectively reduce the frequency of visiting the master teacher model and the distillation time in the process of training the student model, and improve the accuracy of classification and identification while reducing the resource consumption.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications of the invention may be made without departing from the spirit or scope of the invention.

Claims

1. A few-sample emotion classification method based on knowledge distillation of a size guide is characterized by comprising the following steps of:

s1: dividing the sample into labeled samplesx _u And unlabeled samplesx _u ' Collection of unlabelled samples across a large number of emotion classification tasksx _u ' set up with labeled samplesD _l ={x _u Set of unlabeled samplesD _u ={x _u ′}；

S2: constructing a large instructor model and a small instructor model, and using a labeled sample setD _l Training the tutor model to obtain the trained tutor modelM _L Using sets of labelled samplesD _l Training the tutor model to obtain the trained tutor modelM _B ；

S3: using a trained mentor's modelM _B Predicting all unlabelled samplesx _u ' obtaining sample probabilities

Calculating uncertainty of probability of each sample

；

S4: will not be uncertain

S5: mixing the samplex _u ' input training completed little tutor modelM _B Soft tag for obtaining little instructor modelPTo samplex _u "input training completed tutor modelM _L Soft tag for obtaining large tutor modelP' Soft tag incorporating a little mentor modelPSoft label of large tutor modelLabel (Bao)P' obtaining the final Soft Label

；

S6: constructing a student model using the unlabeled sample setD _u And said soft label

Distilling the student model to obtain a distilled student model;

2. The few-sample emotion classification method based on size teacher knowledge distillation as claimed in claim 1, wherein: the big instructor model and the little instructor model are pre-training language models based on promptsMAnd the parameter quantity of the large instructor model is greater than that of the small instructor model.

3. The few-sample emotion classification method based on size teacher knowledge distillation as claimed in claim 2, wherein: the use of labeled sample setsD _l Training a tutor model to obtain a trained tutor modelM _L The method specifically comprises the following steps:

s21: training set D _l ={x _u }={x,yIn (c) } the (c) is,xa sample of the input is shown,yrepresenting a real tag; for input samplexAdding a prompt template and converting into a complete form filling task form:

P(x)=[CLS] x It is [MASK].[SEP]wherein [ MASK ]]For filling words, P: (x) Is the input of a language model, It is [ MASK]Is a prompt template for input text addition; [ CLS]Symbol sentence leader, x is the classification of the original pre-trained model predicted sentence, [ SEP ]]Separation of a marker sentence from a sentence;

s22: will be provided withLAs a set of labels for the classification task,Vtags as classification tasksWord set, constructing a label mapping function:

；

Score on

，

Wherein

，

；

S24: establishing a loss function of a great tutor model output layer;

s25: repeating S22-S24 until the tutor model converges, finishing training and obtaining the trained tutor modelM _L ；

p (x) = [ CLS ] x It is [ MASK ] [ SEP ], where [ MASK ] is a filler;

(ii) a By pre-training language models based on promptsMTo obtain [ MASK ]]Corresponding to different labels

Score on

，

Wherein

，

；

S29: establishing a loss function of an output layer of the tutor model;

4. The method of claim 3The few-sample emotion classification method based on the knowledge distillation of a size guide is characterized by comprising the following steps of: the mentor's model completed with trainingM _B Predicting all unlabeled samplesx _u ' obtaining sample probabilities

Calculating uncertainty of probability of each sample

The method specifically comprises the following steps:

；

S32: calculating uncertainty of probability of each sample

The calculation formula is as follows:

；

5. The few-sample emotion classification method based on size teacher knowledge distillation as claimed in claim 1, wherein: the preset threshold valuethresholdHas a value range of

。

6. The few-sample emotion classification method based on size teacher knowledge distillation as claimed in claim 1, wherein:the uncertainty of the future

And a predetermined threshold valuethresholdComparing and screening out samples with uncertain sample probabilityx _u ", specifically:

uncertainty of sample probability

7. The method of classifying emotion of a small sample based on distillation of knowledge of a size instructor as claimed in claim 3, wherein: the samplex _u ' input training completed little tutor modelM _B Soft tag for obtaining little instructor modelPTo samplex _u "input training completed tutor modelM _L Soft tag for obtaining large tutor modelP' Soft tag incorporating a little mentor modelPAnd soft label of big instructor modelP' obtaining the final Soft Label

The method specifically comprises the following steps:

T is the temperature parameter of the distillation model;

；

S53：

The expression of (a) is:

。

8. the method for classifying emotion of a small sample based on distillation of knowledge of a size instructor in accordance with any one of claims 1 to 7, wherein: said using said set of unlabeled samplesD _u And said soft label

Wherein g () represents a network function of the student model,A _u for unlabeled sample setsD _u Corresponding word vector matrix, superscriptsThe model of the student is represented by,

learnable parameters representing a student model;

s62: establishing loss function of student model output layer

WhereinnWhich is indicative of the size of the batch,

showing the history ofThe first of the raw modeliThe probability of prediction of a sample is,

representing the final sample probability

S63：

9. The method of classifying emotion of a small sample based on distillation of knowledge of a size instructor in claim 8, wherein: the word vector matrixA _u In, each row is an input samplex _u ' A word vector for each character is represented, and the word vector for each character is obtained by word2vec or Glove model training.

10. The method of classifying emotion of a small sample based on distillation of knowledge of a size instructor in claim 8, wherein: the expression of the KL divergence loss function is