CN114722805A

CN114722805A - Little sample emotion classification method based on size instructor knowledge distillation

Info

Publication number: CN114722805A
Application number: CN202210653730.6A
Authority: CN
Inventors: 李寿山; 常晓琴; 周国栋
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2022-06-10
Filing date: 2022-06-10
Publication date: 2022-07-08
Anticipated expiration: 2042-06-10
Also published as: CN114722805B

Abstract

The invention relates to a few-sample emotion classification method based on large and small instructor knowledge distillation, which comprises the steps of collecting unmarked samples and marked samples on a large number of emotion classification tasks, and training a large instructor model and a small instructor model by using the marked samples; all unlabeled samples pass through a little instructor model to obtain the uncertainty of the probability of each sample, and then the samples with uncertain sample probability are screened out according to a threshold value and pass through a big instructor model again; and (4) combining probability output of the large instructor model and the small instructor model to form a soft label to distill the student model, and performing classification prediction by using the distilled student model. The invention reduces the frequency of visiting a master model, reduces the distillation time in the process of training the student model, reduces the resource consumption and simultaneously improves the accuracy of classification and identification.

Description

Little sample emotion classification method based on size instructor knowledge distillation

Technical Field

The invention relates to the technical field of natural language processing, in particular to a few-sample emotion classification method based on knowledge distillation of a size instructor.

Background

The emotion classification task aims at automatically judging the emotion polarity (such as negative and positive) of the text expression. The task is a research hotspot in the field of natural language processing research, is widely applied to a plurality of application systems such as intention mining, information retrieval and question-answering systems, and is a basic link of the application systems. The low-sample emotion classification in the emotion classification means that only a small number of labeled samples can be used in training the classifier.

When few-sample emotion classification is carried out, machine learning and deep learning algorithms are generally used in the field of artificial intelligence to extract emotion meanings from a section of text, and the most extensive artificial intelligence method at present is to model the problem into a problem of inputting a section of text and outputting a label. The prior art is generally divided into the following steps: (1) a professional labels a small amount of texts with different polarity labels, each section of text is used as a sample, and a small amount of corpus of labeled samples with balanced polarity labels is obtained; (2) a large-scale pre-training language model (such as GPT-3) based on prompt utilizes a small amount of labeled samples to train the model, and a classification model is obtained; (3) and testing the text of a certain unknown label by using a classification model to obtain a polarity label of the text segment. During the test, each time the classification model is entered, a single text is entered. Wherein the network structure of the large scale pre-training language model based on the hint in step (2) is shown in fig. 1, where [ CLS ] x [ SEP ] is the input sentence, [ CLS ] marks the beginning of the sentence, [ SEP ] marks the separation of the sentence from the sentence, and x is the classification of the prediction sentence of the original pre-training model. The "MLM head" in FIG. 1 is a fixed usage of the masked language model in a large scale pre-trained prompt-based language model. Get the active label "good" by "MLM head" and get the input sentence "[ CLS ] I will recommend them to everyone! It [ MASK ]. The feedback output of SEP is "I will recommend them to everyone! It is good. ".

Few-sample emotion classification because training samples are few, common shallow neural networks (such as CNN, LSTM and the like) and deep pre-training language models (such as BERT, RoBERTA and the like) are difficult to make correct judgment on semantics of some texts, and recognition rate of classification is not high enough. The parameter quantity of a GPT-3 large-scale model in the prior art reaches 1750 hundred million, and the GPT-3 large-scale model can perform excellent on a few-sample learning task by adding a plurality of examples of input and corresponding output as contexts. However, the parameter quantity is too large, expensive computing resources are consumed for calling the model, the reasoning speed is slow, and the practical application is hindered.

Disclosure of Invention

Therefore, the technical problem to be solved by the invention is to overcome the defects in the prior art, and provide a few-sample emotion classification method based on knowledge distillation of a large tutor and a small tutor, which can effectively reduce the frequency of visiting a large tutor model and the distillation time in the process of training a student model, and improve the accuracy of classification and identification while reducing resource consumption.

In order to solve the technical problems, the invention provides a few-sample emotion classification method based on knowledge distillation of a size instructor, which comprises the following steps of:

s1: dividing the sample into labeled samplesx _uAnd unlabeled samplesx _u' Collection of unlabelled samples across a large number of emotion classification tasksx _u' set up with labeled samples

And collection of unlabeled samplesD _u={x _u′}；

S2: constructing a big instructor model and a small instructor model, and using a labeled sample setD _lTraining the tutor model to obtain the trained tutor modelM _LUsing sets of labelled samplesD _lTraining the tutor model to obtain the trained tutor modelM _B；

S3: using a trained mentor's modelM _BPredicting all unlabeled samplesx _u' obtaining sample probabilities

Calculating uncertainty of probability of each sample

；

S4: will not be uncertain

And a predetermined threshold valuethresholdComparing and screening out samples with uncertain sample probabilityx _u″；

S5: mixing the samplex _u' input training completed little tutor modelM _BSoft tag for obtaining little instructor modelPTo samplex _u"input training completed tutor modelM _LSoft tag for obtaining large tutor modelP' soft label combined with little instructor modelPAnd soft label of big instructor modelP' obtaining the final Soft Label

；

S6: constructing a student model using the unlabeled sample setD _uAnd said soft label

Distilling the student model to obtain a distilled student model;

s7: and (4) carrying out classification prediction on the test set by using a distillation completed student model.

Preferably, the large tutor model and the small tutor model are both pre-trained language models based on promptsMAnd the parameter quantity of the large instructor model is greater than that of the small instructor model.

Preferably, the use is of a set of labelled samplesD _lTraining the tutor model to obtain the trained tutor modelM _LThe method specifically comprises the following steps:

s21: training set D_l={x _u}={x,yIn (c) } the (c) is,xa sample of the input is shown,yrepresenting a real tag; for input samplexAdding a prompt template and converting into a complete form filling task form:

P(x)=[CLS] x It is [MASK].[SEP]wherein [ MASK ]]For filling words, P: (x) Is the input of a language model, It is [ MASK]Is a prompt template for input text addition;

s22: will be provided withLAs a set of labels for the classification task,Vand constructing a label mapping function as a label word set of the classification task:

；

P(x) As input to the language model, by pre-training the language model based on the promptMObtaining [ MASK ]]Corresponding to different labels

Score on

，

Wherein

，

Presentation labellThe corresponding label words are then written to the corresponding label words,kis the length of the label word;

s23: establishing predictions by softmax layer [ MASK]In different labelslThe input sample is obtained through the class probabilityxEmotion classification of

；

S24: establishing a loss function of a great tutor model output layer;

s25: repeating S22-S24 until the tutor model converges, finishing training and obtaining the trained tutor modelM _L；

The use of labeled sample setsD _lTraining the little instructor modelLittle tutor model of practicing completionM _BThe method specifically comprises the following steps:

s26: training set D_l={x _u}={x,yIn (c) } the reaction solution is,xa sample of the input is shown,yrepresenting a genuine label; for input samplexAdding a prompt template and converting into a complete form filling task form:

p (x) = [ CLS ] x It is [ MASK ] [ SEP ], where [ MASK ] is a filler;

s27: will be provided withLAs a set of labels for the classification task,Vand constructing a label mapping function as a label word set of the classification task:

；

by pre-training language models based on promptsMObtaining [ MASK ]]Corresponding to different labels

Score on

，

Wherein

，

s28: establishing predictions by softmax layer [ MASK]In different labelslThe input sample is obtained through the class probabilityxEmotion classification of

；

S29: establishing a loss function of an output layer of the little tutor model;

s210: repeating S27-S29 until the tutor model converges, finishing training and obtaining the trained tutor modelM _B。

Preferably, the use of a trained mentor's modelM _BPredicting all unlabeled samplesx _u' obtaining sample probabilities

Calculating uncertainty of probability of each sample

The method specifically comprises the following steps:

s31: all unlabeled samplesx _u' input training completed little tutor modelM _BThe predicted probability distribution is

；

S32: calculating uncertainty of probability of each sample

The calculation formula is as follows:

；

therein aL|The number of the types of the labels in the classification task.

Preferably, the preset threshold valuethresholdHas a value range of

。

Preferably, the uncertainty is

And a predetermined threshold valuethresholdComparing and screening out samples with uncertain sample probabilityx _u", specifically:

uncertainty of probability of sample

Is greater thanthresholdThen the sample is used as the sample with high uncertainty of sample probabilityx _u″。

Preferably, the sample isx _u' input training completed little tutor modelM _BSoft tag for obtaining little instructor modelPA sample is preparedx _u"input training completed tutor modelM _LSoft tag for obtaining great mentor modelP' Soft tag incorporating a little mentor modelPAnd soft label of big instructor modelP' obtaining the final Soft Label

The method specifically comprises the following steps:

s51: mixing the samplex _u' input training completed little tutor modelM _BSoft tag for obtaining little instructor model

；

S52: mixing the samplex _u"input training completed teacher modelM _LSoft tag for obtaining large tutor model

；

S53：

The expression of (a) is:

。

preferably, the using the unlabeled sample setD _uAnd said soft label

Distilling the student model to obtain a distilled student model, wherein the specific process is as follows:

s61: collecting unlabelled samplesD _uAs a training set of distillation student models, vectors passing through the student models are represented as

Wherein g () represents a network function of the student model,A _ufor unlabeled sample setD _uCorresponding word vector matrix, superscriptsThe model of the student is represented by,

learnable parameters representing a student model;

s62: establishing loss function of student model output layer

WhereinnWhich is indicative of the size of the batch,

representing passage through the student modeliThe probability of prediction of a sample is,

representing the final sample probability

To middleiThe probability of prediction of a sample is,Tis a temperature parameter of the distillation model, D_KLRepresenting a KL divergence loss function;

S63：

sequentially passing through the linear layer and the softmax activation layer to obtain an unlabeled sample setD _uIs output according to the probability

，W ^sRepresenting a weight matrix to be learned on a linear layer of the student model;

s64: using a loss function L_KDUpdating the learnable parameters of the student model;

s65: repeating S61-S64 until the loss function L_KDAnd converging to obtain the distilled student model.

Preferably, the word vector matrixA _uIn, each row is an input samplex _u' A word vector for each character is represented, and the word vector for each character is obtained by word2vec or Glove model training.

Preferably, the KL divergence loss function is expressed by

Therein without a fluorineL|The number of the types of the labels in the classification task.

Compared with the prior art, the technical scheme of the invention has the following advantages:

according to the invention, the student model is distilled by establishing the large instructor model and the small instructor model, so that the sample is screened by the small instructor model and then passes through the large instructor model, the distillation time of the student model can be effectively reduced, and the resource consumption is reduced; meanwhile, under the condition that resource consumption is reduced by the large instructor model and the small instructor model, a large number of unlabeled samples in the emotion classification task can be collected, so that the accuracy of classification and identification is improved.

Drawings

In order that the present disclosure may be more readily and clearly understood, reference is now made to the following detailed description of the embodiments of the present disclosure taken in conjunction with the accompanying drawings, in which

FIG. 1 is a network structure of a prompt-based large-scale pre-trained language model;

FIG. 2 is a schematic diagram of the structure of a conventional single teacher and single student knowledge distillation method;

FIG. 3 is a schematic diagram of the structure of the distillation method based on knowledge of the size instructor's mechanism;

FIG. 4 is a graph of the results of experiments performed on the BERT model with YELP and IMDB datasets in an embodiment of the invention;

FIG. 5 is a graph of the results of experiments performed on the YELP and IMDB datasets in an embodiment of the invention on the RoBERTA model.

Detailed Description

The present invention is further described below in conjunction with the following figures and specific examples so that those skilled in the art may better understand the present invention and practice it, but the examples are not intended to limit the present invention.

In the optimization process of the model, the large model is often a single complex network or a collection of a plurality of networks, and has good performance and generalization capability; and the small model has limited expression capacity because the network size is small. Therefore, the knowledge learned by the large model (teacher model) can be used for guiding the training of the small model (student model), so that the small model has the performance equivalent to that of the large model, but the number of parameters is greatly reduced, and the compression and acceleration of the model are realized, and the process is called distillation.

Compared with the conventional single teacher and single student knowledge distillation method shown in FIG. 2, the method of the invention shown in FIG. 3 uses a large number of unlabelled samples on the basis of the conventional method, and introduces two teacher models, namely a big instructor model and a small instructor model based on prompts, in the drawing

Is the output probability of the student model.

The invention relates to a little-sample emotion classification method based on knowledge distillation of a teacher and a teacher, which comprises the following steps of:

And collection of unlabeled samples

With marked samplex _uFor samples containing labels, unlabeled samplesx _u' is an unlabeled sample, with a small number of labeled samples in the samplex _uAnd a large number of unlabelled samplesx _u′。

S2: constructing a big instructor model and a small instructor model, wherein the big instructor model and the small instructor model are pre-training language models based on promptsM(i.e., the prompt method), the parameter quantity of the large instructor model is greater than the parameter quantity of the small instructor model, and the parameter quantity of the large instructor model in this embodiment is much greater than the parameter quantity of the small instructor model. Using annotated sets of samplesD _lTraining the tutor model to obtain the trained tutor modelM _LUsing sets of labelled samplesD _lTraining the tutor model to obtain the trained tutor modelM _B。

Using annotated sets of samplesD _lTrain big instructor's model and little instructor's model respectively, the training process of big instructor's model and little instructor's model is similar, and concrete process is:

s21: training set D_l={x _u}={x,yIn (c) } the reaction solution is,xa sample of the input is shown,yrepresenting a genuine label; for input samplexAdding a prompt template and converting into a complete form filling task form:

P(x)=[CLS] x It is [MASK].[SEP]wherein [ MASK ]]For filling in words, the purpose is to let the pre-trained language model based on promptsMTo determine [ MASK ]]And (4) filling words, and converting the classification task into a complete blank filling task. Input text adding prompt template' It is [ MASK].”，[MASK]Passing the new input through the language model corresponding to different labels of the classification task, and making the language model determine [ MASK ]]And (4) filling words, thereby realizing the classification of the text.

(ii) a Mapping task labels to a prompt-based pre-trained language modelMA word or words in the vocabulary of (a). For example: word in word list corresponding to 0 category in emotion two classification task "terrible", 1 category corresponds to a word in a vocabulary"great”。P(x) As input to the language model, by pre-training the language model based on the promptMObtaining [ MASK ]]Corresponding to different labels

Score of (3)

，

Wherein

，

Presentation labellThe corresponding label words are then written to the corresponding label words,kis the length of the tag word.

S23: establishing predictions [ MASK ] by softmax layer]In different labelslClass probability of (2) to obtain an input sample by the class probabilityxEmotion classification of

。

S24: establishing a loss function of an output layer of a master tutor model; in this embodiment, the loss function is a cross entropy function, which is used to measure the true label of the training sampleyAnd predictive probability

The difference between them.

S25: repeating S22-S24 until the teacher model converges and finishes training to obtain the trained teacher model

。

S26: training set D_l={x _u}={x,yIn (c) } the reaction solution is,xa sample of the input is shown and,yrepresenting a genuine label; for input samplexAdding a prompt template and converting into a complete form filling task form:

p (x) = [ CLS ] x It is [ MASK ] [ SEP ], where [ MASK ] is a filler word.

S27: will be provided withLAs a set of labels for the classification task,Vconstructing a label mapping function as a label word set of a classification task:

；

Score on

，

Wherein

，

Presentation labellThe corresponding label words are then displayed on the display screen,kis the length of the tag word.

S28: establishing predictions [ MASK ] by softmax layer]In different labelslThe input sample is obtained through the class probabilityxEmotion classification of

。

S29: and establishing a loss function of an output layer of the tutor model.

S210: repeating S27-S29 until the tutor model converges, finishing the training and obtaining the trained tutorTeacher modelM _B。

S3: using a trained mentor's modelM _BPredicting all unlabelled samplesx _u' obtaining sample probabilities

Calculating uncertainty of probability of each sample

。

；

S32: calculating uncertainty of probability of each sample

The calculation formula is as follows:

；

therein withoutL|By uncertainty, for the number of classes of labels in the classification task

The quality of the sample prediction probability can be measured.

S4: will not be uncertain

And a predetermined threshold valuethresholdComparing, screening out the samples with uncertain sample probability and presetting thresholdthresholdHas a value range of

。

Uncertainty of probability of sample

Is greater thanthresholdThen this sample is taken as a sample whose sample probability is highly uncertain. Uncertainty of sample probability

Is greater thanthresholdExplain the sample by the tutorx _u' the classification probability result has insufficient confidence, and a new probability distribution needs to be obtained through the master tutor model again.

S5: mixing the samplex _u' input training completed little tutor modelM _BSoft tag for obtaining little instructor modelPTo samplex _u"input training completed tutor modelM _LSoft tag for obtaining large tutor modelP' soft label combined with little instructor modelPSoft label for guiding teacher modelP' obtaining the final Soft Label

。

；

S52: mixing the samplex _u"input training completed tutor modelM _LSoft tag for obtaining large tutor model

；

S53：

The expression of (a) is:

。

s6: a student model is constructed, and the student model in the embodiment is composed of a small shallow neural network model. Using unlabeled sample setsD _uAnd soft label

And distilling the student model to obtain a distilled student model.

S61: collecting unlabeled samplesD _uAs a training set of distillation student models, vectors passing through the student models are represented as

Wherein g () represents a network function of the student model,A _ufor unlabeled sample setsD _uCorresponding word vector matrix, unlabeled samplex _uOf length ofkDimension of word vector ofdThen, then

(ii) a Upper labelsThe model of the student is represented by,

the learnable parameters representing the student model.

Word vector matrixA _uIn, each row is an input samplex _u' a word vector representation for each character, obtained by word2vec or Glove model training.

S62: establishing a loss function of the output layer of the student model, i.e. a loss function used when the teacher model distills the student model

WhereinnWhich is indicative of the size of the batch,

representing the passage through the student modeliThe probability of prediction of a sample is,

representing the final sample probability

To middleiThe probability of prediction of a sample is,Tis a temperature parameter of the distillation model,Tis a parameter carried by a distillation model, the larger T is, the smoother the probability distribution of softmax is, the larger the entropy of the distribution is, the more information is carried, and D is_KLThe KL divergence loss function is expressed.

The expression of the KL divergence loss function is

S63：

，

Representing a weight matrix to be learned on a linear layer of the student model;

S7: the test set was subjected to classification prediction using a distillation-completed student model.

The invention has the beneficial effects that:

To further illustrate the beneficial effects of the present invention, in this embodiment, the test set is input into the trained student model to obtain the prediction probability. The effect of the present invention was analyzed from three aspects of (1) the accuracy of classification results obtained by predicting the test set by the student model, (2) the prediction time of the teacher model for all unlabeled samples in the distillation model, and (3) the reduction rate of the visit rate to the tutor model.

In the present embodiment, a sentence-level YELP dataset (see the documents "Zhang X, Zhao J, LeCun Y. Character-level connected networks for text classification [ J ]. Advances in the neural information processing systems, 2015, 28: 649-. In the experimental process, 8 positive and negative balanced samples are respectively selected from each data set as a training set and a verification set, and positive and negative 500 samples are used as a test set. Further, the number of unlabeled samples of the YELP dataset was 10 ten thousand, and the number of unlabeled samples of the IMDB dataset was 9.8 ten thousand.

In order to simulate the knowledge distillation process of a size instructor mechanism, the invention respectively arranges a size instructor model and a small instructor model under a BERT model and a RoBERTA model, and uses the BERT-large (the size instructor model under the BERT), the BERT-base (the small instructor model under the BERT) and the RoBERTA-large (the size instructor model under the RoBERTA) to represent. When the teacher model is trained, the label words are respectively ' terrible ' and ' great; the batch size is set to 4 or 8; the optimizer uses AdamW where the learning rate selects one of {1e-5, 2e-5, 5e-5}, the weight decay is set to 1e-3, and the batch size and learning rate are determined from the manner in which the grid searches for the hyper-parameters. The student model was a CNN model using 3 different sizes of convolution kernels, respectively (3, 50), (4, 50) and (5, 50). The number of each convolution kernel is 100; each CNN uses glove.6b.50d as a word vector; the batch size is set to 128; the optimizer uses Adam with the learning rate set to 1e-3 and the weight decay to 1 e-5. In order to prevent the overfitting phenomenon in the neural network model training process, a Dropout parameter is set to be 0.5.

The results of the experiments on the BERT model for the YELP and IMDB datasets are shown in fig. 4, where the uncertainty threshold on both the YELP and IMDB datasets was set to 0.85; the results of the experiments on the RoBERTa model for the YELP and IMDB datasets are shown in fig. 5, where the uncertainty threshold on the YELP dataset is set to 0.6 and the uncertainty threshold on the IMDB dataset is set to 0.9. Fine-tuning in FIGS. 4 and 5 indicates that the standard Fine-tuning pre-training language model is used, LM-BFF indicates that the prompt-based Fine-tuning pre-training language model is used, and LM-BFF distills the CNN indicates that the CNN model is distilled using the prompt-based pre-training language model. Because few sample learning is sensitive to data, and the results obtained by different data segmentation training are greatly different, the accuracy of the classification result is shown by using 5 different random seeds to respectively sample different training sets and verification sets so as to relieve the problem; the data of the accuracy of the classification result is expressed in the form of "average of 5 results (variance of 5 results)".

As can be seen from the correct rate of the classification results in fig. 4, the distillation performance of the method of the present invention was improved by 91.13% -90.64% =0.49% under YELP data set and 84.14% -84.08% =0.06% under IMDB data set, compared to the distillation performance of the BERT-large model. Moreover, compared with the distillation performance of the BERT-base model, the distillation performance of the method is 91.13% > 87.18% under YELP data set and 84.14% >84.08% under IMDB data set, and the result of the method is far better than that of the BERT-base model. As can be seen from the predicted time of fig. 4, the time spent for distillation of the method of the invention was increased by 91.93s/163.27s =56.31% under the YELP data set and 962.37s/1598.34s =60.21% under the IMDB data set, compared to the BERT-large teacher model. Meanwhile, the simulation program counts that the proportion of the method for reducing the model access rate of the large instructor (the proportion of the reduction of the model access rate of the large instructor, namely the proportion of the times of unmarked samples passing through the large instructor model under a large instructor mechanism to the times of the unmarked samples passing through the large instructor model in total) is reduced by 74.40% under a YELP data set and 72.42% under an IMDB data set compared with BERT-large.

From the correct rate of the classification results in fig. 5, it can be seen that the distillation performance of the method of the present invention is improved by 93.16% -92.80% =0.36% under YELP data set and 87.84% -87.64% =0.2% under IMDB data set, compared to the distillation performance of RoBERTa-large model. Moreover, compared with the distillation performance of the RoBERTA-base model, the distillation performance of the method of the invention is 93.16% > 91.82% under YELP data set and 87.84% >87.64% under IMDB data set, and the result of the method of the invention is superior to that of the RoBERTA-base model. As can be seen from the predicted time of fig. 5, the time spent for distillation by the method of the present invention was increased by 75.59s/163.32s =46.28% under the YELP data set and by 912.65s/1594.93s =57.22% under the IMDB data set, as compared to the RoBERTa-large teacher model. Meanwhile, the simulation program counts that the proportion of the method for reducing the model access rate of the large instructor (the proportion of the reduction of the model access rate of the large instructor, namely the proportion of the times of unmarked samples passing through the large instructor model under a large instructor mechanism to the times of the unmarked samples passing through the large instructor model in total) is 84.65% lower than that of RoBERTA-large and is 75.56% lower than that of an IMDB data set.

The access frequency is proportional to the occupied resources. All unlabeled samples need to pass through a small instructor model, and a small number of samples screened by the threshold pass through a large instructor model. Compared with the method that all unlabeled samples pass through the large instructor model, the method has the advantages that a large amount of resource consumption can be reduced, the parameters of the small instructor model are relatively small, and occupied computing resources are small, so that only the proportion of the large instructor model with the reduced access rate is analyzed in the simulation.

The simulation result further shows that the method can effectively reduce the frequency of visiting the tutor model and the distillation time in the process of training the student model, and improve the accuracy of classification and identification while reducing resource consumption.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications of the invention may be made without departing from the scope of the invention.

Claims

1. A little sample emotion classification method based on knowledge distillation of a size guide is characterized by comprising the following steps:

s1: dividing the sample into labeled samplesx _uAnd unlabeled samplesx _u' Collection of unlabelled samples across a large number of emotion classification tasksx _u' set up with labeled samplesD _l={x _uSet of unlabeled samplesD _u={x _u′}；

S3: using a trained mentor modelM _BPredicting all unlabeled samplesx _u' obtaining sample probabilities

Calculating uncertainty of probability of each sample

；

S4: will not be uncertain

And a predetermined thresholdthresholdComparing and screening out samples with uncertain sample probabilityx _u″；

S5: mixing the samplex _u' input training completed little tutor modelM _BSoft tag for obtaining little instructor modelPTo samplex _u"input training completed tutor modelM _LSoft tag for obtaining large tutor modelP' Soft tag incorporating a little mentor modelPAnd soft label of big instructor modelP' obtaining the final Soft Label

；

Distilling the student model to obtain a distilled student model;

2. The few-sample emotion classification method based on size teacher knowledge distillation as claimed in claim 1, wherein: the big instructor model and the little instructor model are pre-training language models based on promptsMAnd the parameter quantity of the large instructor model is greater than that of the small instructor model.

3. The method of claim 2 for sentiment classification of small samples based on distillation of the knowledge of the size instructor, which is based onIs characterized in that: the use of labeled sample setsD _lTraining the tutor model to obtain the trained tutor modelM _LThe method specifically comprises the following steps:

s21: training set D_l={x _u}={x,yIn (c) } the reaction solution is,xa sample of the input is shown,yrepresenting a real tag; for input samplexAdding a prompt template and converting into a complete form filling task form:

；

Score of (3)

，

Wherein

，

；

S24: establishing a loss function of a great tutor model output layer;

s25: repeating S22-S24 until the tutor model converges, ending the training to obtain the tutor model after the trainingM _L；

The use of labeled sample setsD _lTraining the tutor model to obtain the trained tutor modelM _BThe method specifically comprises the following steps:

s26: training set D_l={x _u}={x,yIn (c) } the reaction solution is,xa sample of the input is shown,yrepresenting a real tag; for input samplexAdding a prompt template and converting into a complete form filling task form:

p (x) = [ CLS ] x It is [ MASK ] [ SEP ], where [ MASK ] is a filler;

(ii) a By pre-training language models based on promptsMTo obtain [ MASK ]]Corresponding to different labels

Score on

，

Wherein

，

；

S29: establishing a loss function of an output layer of the little tutor model;

s210: repeating S27-S29 until the tutor model converges, ending the training to obtain the tutor model after the trainingM _B。

4. The method of classifying emotion of a small sample based on distillation of knowledge of a size instructor as claimed in claim 3, wherein: the mentor's model completed with trainingM _BPredicting all unlabelled samplesx _u' obtaining sample probabilities

Calculating uncertainty of probability of each sample

The method specifically comprises the following steps:

；

S32: calculating uncertainty of probability of each sample

The calculation formula is as follows:

；

therein withoutL|The number of the types of the labels in the classification task.

5. The few-sample emotion classification method based on size teacher knowledge distillation as claimed in claim 1, wherein: the preset threshold valuethresholdHas a value range of

。

6. The few-sample emotion classification method based on size teacher knowledge distillation as claimed in claim 1, wherein: the uncertainty of the future

And a predetermined thresholdthresholdComparing and screening out samples with uncertain sample probabilityx _u", specifically:

uncertainty of probability of sample

7. The method of classifying emotion of a small sample based on distillation of knowledge of a size instructor as claimed in claim 3, wherein: the samplex _u' input training completed little tutor modelM _BSoft tag for obtaining little instructor modelPTo samplex _u"input training completed tutor modelM _LSoft tag for obtaining large tutor modelP' Soft tag incorporating a little mentor modelPAnd soft label of big instructor modelP' obtaining the final Soft Label

The method specifically comprises the following steps:

；

；

S53：

The expression of (a) is:

。

8. the method for classifying emotion of a small sample based on distillation of knowledge of a size instructor in accordance with any one of claims 1 to 7, wherein: said using said set of unlabeled samplesD _uAnd said soft label

Wherein g () represents a network function of the student model,A _uis notAnnotating a sample setD _uCorresponding word vector matrix, superscriptsThe model of the student is represented by,

learnable parameters representing a student model;

s62: establishing loss function of student model output layer

WhereinnWhich is indicative of the size of the batch,

representing the final sample probability

To middleiThe probability of prediction of a sample is,Tis the temperature parameter of the distillation model, D_KLRepresenting a KL divergence loss function;

S63：

s65: repeating S61-S64 until the loss function L_KDAnd (5) converging to obtain a distilled student model.

9. The method of classifying emotion of a small sample based on distillation of knowledge of a size instructor in claim 8, wherein: the word vector matrixA _uIn, each row is an input samplex _u' A word vector for each character is represented, and the word vector for each character is obtained by word2vec or Glove model training.

10. The method of classifying emotion of a small sample based on distillation of knowledge of a size instructor in claim 8, wherein: the expression of the KL divergence loss function is