CN114722805A - Little sample emotion classification method based on size instructor knowledge distillation - Google Patents
Little sample emotion classification method based on size instructor knowledge distillation Download PDFInfo
- Publication number
- CN114722805A CN114722805A CN202210653730.6A CN202210653730A CN114722805A CN 114722805 A CN114722805 A CN 114722805A CN 202210653730 A CN202210653730 A CN 202210653730A CN 114722805 A CN114722805 A CN 114722805A
- Authority
- CN
- China
- Prior art keywords
- model
- sample
- instructor
- training
- tutor
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
- G06F18/2155—Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the incorporation of unlabelled data, e.g. multiple instance learning [MIL], semi-supervised techniques using expectation-maximisation [EM] or naïve labelling
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Biomedical Technology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Probability & Statistics with Applications (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Other Investigation Or Analysis Of Materials By Electrical Means (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention relates to a few-sample emotion classification method based on large and small instructor knowledge distillation, which comprises the steps of collecting unmarked samples and marked samples on a large number of emotion classification tasks, and training a large instructor model and a small instructor model by using the marked samples; all unlabeled samples pass through a little instructor model to obtain the uncertainty of the probability of each sample, and then the samples with uncertain sample probability are screened out according to a threshold value and pass through a big instructor model again; and (4) combining probability output of the large instructor model and the small instructor model to form a soft label to distill the student model, and performing classification prediction by using the distilled student model. The invention reduces the frequency of visiting a master model, reduces the distillation time in the process of training the student model, reduces the resource consumption and simultaneously improves the accuracy of classification and identification.
Description
Technical Field
The invention relates to the technical field of natural language processing, in particular to a few-sample emotion classification method based on knowledge distillation of a size instructor.
Background
The emotion classification task aims at automatically judging the emotion polarity (such as negative and positive) of the text expression. The task is a research hotspot in the field of natural language processing research, is widely applied to a plurality of application systems such as intention mining, information retrieval and question-answering systems, and is a basic link of the application systems. The low-sample emotion classification in the emotion classification means that only a small number of labeled samples can be used in training the classifier.
When few-sample emotion classification is carried out, machine learning and deep learning algorithms are generally used in the field of artificial intelligence to extract emotion meanings from a section of text, and the most extensive artificial intelligence method at present is to model the problem into a problem of inputting a section of text and outputting a label. The prior art is generally divided into the following steps: (1) a professional labels a small amount of texts with different polarity labels, each section of text is used as a sample, and a small amount of corpus of labeled samples with balanced polarity labels is obtained; (2) a large-scale pre-training language model (such as GPT-3) based on prompt utilizes a small amount of labeled samples to train the model, and a classification model is obtained; (3) and testing the text of a certain unknown label by using a classification model to obtain a polarity label of the text segment. During the test, each time the classification model is entered, a single text is entered. Wherein the network structure of the large scale pre-training language model based on the hint in step (2) is shown in fig. 1, where [ CLS ] x [ SEP ] is the input sentence, [ CLS ] marks the beginning of the sentence, [ SEP ] marks the separation of the sentence from the sentence, and x is the classification of the prediction sentence of the original pre-training model. The "MLM head" in FIG. 1 is a fixed usage of the masked language model in a large scale pre-trained prompt-based language model. Get the active label "good" by "MLM head" and get the input sentence "[ CLS ] I will recommend them to everyone! It [ MASK ]. The feedback output of SEP is "I will recommend them to everyone! It is good. ".
Few-sample emotion classification because training samples are few, common shallow neural networks (such as CNN, LSTM and the like) and deep pre-training language models (such as BERT, RoBERTA and the like) are difficult to make correct judgment on semantics of some texts, and recognition rate of classification is not high enough. The parameter quantity of a GPT-3 large-scale model in the prior art reaches 1750 hundred million, and the GPT-3 large-scale model can perform excellent on a few-sample learning task by adding a plurality of examples of input and corresponding output as contexts. However, the parameter quantity is too large, expensive computing resources are consumed for calling the model, the reasoning speed is slow, and the practical application is hindered.
Disclosure of Invention
Therefore, the technical problem to be solved by the invention is to overcome the defects in the prior art, and provide a few-sample emotion classification method based on knowledge distillation of a large tutor and a small tutor, which can effectively reduce the frequency of visiting a large tutor model and the distillation time in the process of training a student model, and improve the accuracy of classification and identification while reducing resource consumption.
In order to solve the technical problems, the invention provides a few-sample emotion classification method based on knowledge distillation of a size instructor, which comprises the following steps of:
s1: dividing the sample into labeled samplesx u And unlabeled samplesx u ' Collection of unlabelled samples across a large number of emotion classification tasksx u ' set up with labeled samplesAnd collection of unlabeled samplesD u ={x u ′};
S2: constructing a big instructor model and a small instructor model, and using a labeled sample setD l Training the tutor model to obtain the trained tutor modelM L Using sets of labelled samplesD l Training the tutor model to obtain the trained tutor modelM B ;
S3: using a trained mentor's modelM B Predicting all unlabeled samplesx u ' obtaining sample probabilitiesCalculating uncertainty of probability of each sample;
S4: will not be uncertainAnd a predetermined threshold valuethresholdComparing and screening out samples with uncertain sample probabilityx u ″;
S5: mixing the samplex u ' input training completed little tutor modelM B Soft tag for obtaining little instructor modelPTo samplex u "input training completed tutor modelM L Soft tag for obtaining large tutor modelP' soft label combined with little instructor modelPAnd soft label of big instructor modelP' obtaining the final Soft Label;
S6: constructing a student model using the unlabeled sample setD u And said soft labelDistilling the student model to obtain a distilled student model;
s7: and (4) carrying out classification prediction on the test set by using a distillation completed student model.
Preferably, the large tutor model and the small tutor model are both pre-trained language models based on promptsMAnd the parameter quantity of the large instructor model is greater than that of the small instructor model.
Preferably, the use is of a set of labelled samplesD l Training the tutor model to obtain the trained tutor modelM L The method specifically comprises the following steps:
s21: training set D l ={x u }={x,yIn (c) } the (c) is,xa sample of the input is shown,yrepresenting a real tag; for input samplexAdding a prompt template and converting into a complete form filling task form:
P(x)=[CLS] x It is [MASK].[SEP]wherein [ MASK ]]For filling words, P: (x) Is the input of a language model, It is [ MASK]Is a prompt template for input text addition;
s22: will be provided withLAs a set of labels for the classification task,Vand constructing a label mapping function as a label word set of the classification task:;
P(x) As input to the language model, by pre-training the language model based on the promptMObtaining [ MASK ]]Corresponding to different labelsScore on,
Wherein,Presentation labellThe corresponding label words are then written to the corresponding label words,kis the length of the label word;
s23: establishing predictions by softmax layer [ MASK]In different labelslThe input sample is obtained through the class probabilityxEmotion classification of;
S24: establishing a loss function of a great tutor model output layer;
s25: repeating S22-S24 until the tutor model converges, finishing training and obtaining the trained tutor modelM L ;
The use of labeled sample setsD l Training the little instructor modelLittle tutor model of practicing completionM B The method specifically comprises the following steps:
s26: training set D l ={x u }={x,yIn (c) } the reaction solution is,xa sample of the input is shown,yrepresenting a genuine label; for input samplexAdding a prompt template and converting into a complete form filling task form:
p (x) = [ CLS ] x It is [ MASK ] [ SEP ], where [ MASK ] is a filler;
s27: will be provided withLAs a set of labels for the classification task,Vand constructing a label mapping function as a label word set of the classification task:;
by pre-training language models based on promptsMObtaining [ MASK ]]Corresponding to different labelsScore on,
Wherein,Presentation labellThe corresponding label words are then written to the corresponding label words,kis the length of the label word;
s28: establishing predictions by softmax layer [ MASK]In different labelslThe input sample is obtained through the class probabilityxEmotion classification of;
S29: establishing a loss function of an output layer of the little tutor model;
s210: repeating S27-S29 until the tutor model converges, finishing training and obtaining the trained tutor modelM B 。
Preferably, the use of a trained mentor's modelM B Predicting all unlabeled samplesx u ' obtaining sample probabilitiesCalculating uncertainty of probability of each sampleThe method specifically comprises the following steps:
s31: all unlabeled samplesx u ' input training completed little tutor modelM B The predicted probability distribution is;
therein aL|The number of the types of the labels in the classification task.
Preferably, the uncertainty isAnd a predetermined threshold valuethresholdComparing and screening out samples with uncertain sample probabilityx u ", specifically:
uncertainty of probability of sampleIs greater thanthresholdThen the sample is used as the sample with high uncertainty of sample probabilityx u ″。
Preferably, the sample isx u ' input training completed little tutor modelM B Soft tag for obtaining little instructor modelPA sample is preparedx u "input training completed tutor modelM L Soft tag for obtaining great mentor modelP' Soft tag incorporating a little mentor modelPAnd soft label of big instructor modelP' obtaining the final Soft LabelThe method specifically comprises the following steps:
s51: mixing the samplex u ' input training completed little tutor modelM B Soft tag for obtaining little instructor model;
S52: mixing the samplex u "input training completed teacher modelM L Soft tag for obtaining large tutor model;
preferably, the using the unlabeled sample setD u And said soft labelDistilling the student model to obtain a distilled student model, wherein the specific process is as follows:
s61: collecting unlabelled samplesD u As a training set of distillation student models, vectors passing through the student models are represented asWherein g () represents a network function of the student model,A u for unlabeled sample setD u Corresponding word vector matrix, superscriptsThe model of the student is represented by,learnable parameters representing a student model;
s62: establishing loss function of student model output layerWhereinnWhich is indicative of the size of the batch,representing passage through the student modeliThe probability of prediction of a sample is,representing the final sample probabilityTo middleiThe probability of prediction of a sample is,Tis a temperature parameter of the distillation model, DKLRepresenting a KL divergence loss function;
S63:sequentially passing through the linear layer and the softmax activation layer to obtain an unlabeled sample setD u Is output according to the probability,W s Representing a weight matrix to be learned on a linear layer of the student model;
s64: using a loss function LKDUpdating the learnable parameters of the student model;
s65: repeating S61-S64 until the loss function LKDAnd converging to obtain the distilled student model.
Preferably, the word vector matrixA u In, each row is an input samplex u ' A word vector for each character is represented, and the word vector for each character is obtained by word2vec or Glove model training.
Preferably, the KL divergence loss function is expressed byTherein without a fluorineL|The number of the types of the labels in the classification task.
Compared with the prior art, the technical scheme of the invention has the following advantages:
according to the invention, the student model is distilled by establishing the large instructor model and the small instructor model, so that the sample is screened by the small instructor model and then passes through the large instructor model, the distillation time of the student model can be effectively reduced, and the resource consumption is reduced; meanwhile, under the condition that resource consumption is reduced by the large instructor model and the small instructor model, a large number of unlabeled samples in the emotion classification task can be collected, so that the accuracy of classification and identification is improved.
Drawings
In order that the present disclosure may be more readily and clearly understood, reference is now made to the following detailed description of the embodiments of the present disclosure taken in conjunction with the accompanying drawings, in which
FIG. 1 is a network structure of a prompt-based large-scale pre-trained language model;
FIG. 2 is a schematic diagram of the structure of a conventional single teacher and single student knowledge distillation method;
FIG. 3 is a schematic diagram of the structure of the distillation method based on knowledge of the size instructor's mechanism;
FIG. 4 is a graph of the results of experiments performed on the BERT model with YELP and IMDB datasets in an embodiment of the invention;
FIG. 5 is a graph of the results of experiments performed on the YELP and IMDB datasets in an embodiment of the invention on the RoBERTA model.
Detailed Description
The present invention is further described below in conjunction with the following figures and specific examples so that those skilled in the art may better understand the present invention and practice it, but the examples are not intended to limit the present invention.
In the optimization process of the model, the large model is often a single complex network or a collection of a plurality of networks, and has good performance and generalization capability; and the small model has limited expression capacity because the network size is small. Therefore, the knowledge learned by the large model (teacher model) can be used for guiding the training of the small model (student model), so that the small model has the performance equivalent to that of the large model, but the number of parameters is greatly reduced, and the compression and acceleration of the model are realized, and the process is called distillation.
Compared with the conventional single teacher and single student knowledge distillation method shown in FIG. 2, the method of the invention shown in FIG. 3 uses a large number of unlabelled samples on the basis of the conventional method, and introduces two teacher models, namely a big instructor model and a small instructor model based on prompts, in the drawingIs the output probability of the student model.
The invention relates to a little-sample emotion classification method based on knowledge distillation of a teacher and a teacher, which comprises the following steps of:
s1: dividing the sample into labeled samplesx u And unlabeled samplesx u ' Collection of unlabelled samples across a large number of emotion classification tasksx u ' set up with labeled samplesAnd collection of unlabeled samplesWith marked samplex u For samples containing labels, unlabeled samplesx u ' is an unlabeled sample, with a small number of labeled samples in the samplex u And a large number of unlabelled samplesx u ′。
S2: constructing a big instructor model and a small instructor model, wherein the big instructor model and the small instructor model are pre-training language models based on promptsM(i.e., the prompt method), the parameter quantity of the large instructor model is greater than the parameter quantity of the small instructor model, and the parameter quantity of the large instructor model in this embodiment is much greater than the parameter quantity of the small instructor model. Using annotated sets of samplesD l Training the tutor model to obtain the trained tutor modelM L Using sets of labelled samplesD l Training the tutor model to obtain the trained tutor modelM B 。
Using annotated sets of samplesD l Train big instructor's model and little instructor's model respectively, the training process of big instructor's model and little instructor's model is similar, and concrete process is:
s21: training set D l ={x u }={x,yIn (c) } the reaction solution is,xa sample of the input is shown,yrepresenting a genuine label; for input samplexAdding a prompt template and converting into a complete form filling task form:
P(x)=[CLS] x It is [MASK].[SEP]wherein [ MASK ]]For filling in words, the purpose is to let the pre-trained language model based on promptsMTo determine [ MASK ]]And (4) filling words, and converting the classification task into a complete blank filling task. Input text adding prompt template' It is [ MASK].”,[MASK]Passing the new input through the language model corresponding to different labels of the classification task, and making the language model determine [ MASK ]]And (4) filling words, thereby realizing the classification of the text.
S22: will be provided withLAs a set of labels for the classification task,Vand constructing a label mapping function as a label word set of the classification task:(ii) a Mapping task labels to a prompt-based pre-trained language modelMA word or words in the vocabulary of (a). For example: word in word list corresponding to 0 category in emotion two classification task "terrible", 1 category corresponds to a word in a vocabulary"great”。P(x) As input to the language model, by pre-training the language model based on the promptMObtaining [ MASK ]]Corresponding to different labelsScore of (3),
Wherein,Presentation labellThe corresponding label words are then written to the corresponding label words,kis the length of the tag word.
S23: establishing predictions [ MASK ] by softmax layer]In different labelslClass probability of (2) to obtain an input sample by the class probabilityxEmotion classification of。
S24: establishing a loss function of an output layer of a master tutor model; in this embodiment, the loss function is a cross entropy function, which is used to measure the true label of the training sampleyAnd predictive probabilityThe difference between them.
S25: repeating S22-S24 until the teacher model converges and finishes training to obtain the trained teacher model。
S26: training set D l ={x u }={x,yIn (c) } the reaction solution is,xa sample of the input is shown and,yrepresenting a genuine label; for input samplexAdding a prompt template and converting into a complete form filling task form:
p (x) = [ CLS ] x It is [ MASK ] [ SEP ], where [ MASK ] is a filler word.
S27: will be provided withLAs a set of labels for the classification task,Vconstructing a label mapping function as a label word set of a classification task:;
by pre-training language models based on promptsMObtaining [ MASK ]]Corresponding to different labelsScore on,
Wherein,Presentation labellThe corresponding label words are then displayed on the display screen,kis the length of the tag word.
S28: establishing predictions [ MASK ] by softmax layer]In different labelslThe input sample is obtained through the class probabilityxEmotion classification of。
S29: and establishing a loss function of an output layer of the tutor model.
S210: repeating S27-S29 until the tutor model converges, finishing the training and obtaining the trained tutorTeacher modelM B 。
S3: using a trained mentor's modelM B Predicting all unlabelled samplesx u ' obtaining sample probabilitiesCalculating uncertainty of probability of each sample。
S31: all unlabeled samplesx u ' input training completed little tutor modelM B The predicted probability distribution is;
therein withoutL|By uncertainty, for the number of classes of labels in the classification taskThe quality of the sample prediction probability can be measured.
S4: will not be uncertainAnd a predetermined threshold valuethresholdComparing, screening out the samples with uncertain sample probability and presetting thresholdthresholdHas a value range of。
Uncertainty of probability of sampleIs greater thanthresholdThen this sample is taken as a sample whose sample probability is highly uncertain. Uncertainty of sample probabilityIs greater thanthresholdExplain the sample by the tutorx u ' the classification probability result has insufficient confidence, and a new probability distribution needs to be obtained through the master tutor model again.
S5: mixing the samplex u ' input training completed little tutor modelM B Soft tag for obtaining little instructor modelPTo samplex u "input training completed tutor modelM L Soft tag for obtaining large tutor modelP' soft label combined with little instructor modelPSoft label for guiding teacher modelP' obtaining the final Soft Label。
S51: mixing the samplex u ' input training completed little tutor modelM B Soft tag for obtaining little instructor model;
S52: mixing the samplex u "input training completed tutor modelM L Soft tag for obtaining large tutor model;
s6: a student model is constructed, and the student model in the embodiment is composed of a small shallow neural network model. Using unlabeled sample setsD u And soft labelAnd distilling the student model to obtain a distilled student model.
S61: collecting unlabeled samplesD u As a training set of distillation student models, vectors passing through the student models are represented asWherein g () represents a network function of the student model,A u for unlabeled sample setsD u Corresponding word vector matrix, unlabeled samplex u Of length ofkDimension of word vector ofdThen, then(ii) a Upper labelsThe model of the student is represented by,the learnable parameters representing the student model.
Word vector matrixA u In, each row is an input samplex u ' a word vector representation for each character, obtained by word2vec or Glove model training.
S62: establishing a loss function of the output layer of the student model, i.e. a loss function used when the teacher model distills the student modelWhereinnWhich is indicative of the size of the batch,representing the passage through the student modeliThe probability of prediction of a sample is,representing the final sample probabilityTo middleiThe probability of prediction of a sample is,Tis a temperature parameter of the distillation model,Tis a parameter carried by a distillation model, the larger T is, the smoother the probability distribution of softmax is, the larger the entropy of the distribution is, the more information is carried, and D isKLThe KL divergence loss function is expressed.
The expression of the KL divergence loss function isTherein without a fluorineL|The number of the types of the labels in the classification task.
S63:Sequentially passing through the linear layer and the softmax activation layer to obtain an unlabeled sample setD u Is output according to the probability,Representing a weight matrix to be learned on a linear layer of the student model;
s64: using a loss function LKDUpdating the learnable parameters of the student model;
s65: repeating S61-S64 until the loss function LKDAnd converging to obtain the distilled student model.
S7: the test set was subjected to classification prediction using a distillation-completed student model.
The invention has the beneficial effects that:
according to the invention, the student model is distilled by establishing the large instructor model and the small instructor model, so that the sample is screened by the small instructor model and then passes through the large instructor model, the distillation time of the student model can be effectively reduced, and the resource consumption is reduced; meanwhile, under the condition that resource consumption is reduced by the large instructor model and the small instructor model, a large number of unlabeled samples in the emotion classification task can be collected, so that the accuracy of classification and identification is improved.
To further illustrate the beneficial effects of the present invention, in this embodiment, the test set is input into the trained student model to obtain the prediction probability. The effect of the present invention was analyzed from three aspects of (1) the accuracy of classification results obtained by predicting the test set by the student model, (2) the prediction time of the teacher model for all unlabeled samples in the distillation model, and (3) the reduction rate of the visit rate to the tutor model.
In the present embodiment, a sentence-level YELP dataset (see the documents "Zhang X, Zhao J, LeCun Y. Character-level connected networks for text classification [ J ]. Advances in the neural information processing systems, 2015, 28: 649-. In the experimental process, 8 positive and negative balanced samples are respectively selected from each data set as a training set and a verification set, and positive and negative 500 samples are used as a test set. Further, the number of unlabeled samples of the YELP dataset was 10 ten thousand, and the number of unlabeled samples of the IMDB dataset was 9.8 ten thousand.
In order to simulate the knowledge distillation process of a size instructor mechanism, the invention respectively arranges a size instructor model and a small instructor model under a BERT model and a RoBERTA model, and uses the BERT-large (the size instructor model under the BERT), the BERT-base (the small instructor model under the BERT) and the RoBERTA-large (the size instructor model under the RoBERTA) to represent. When the teacher model is trained, the label words are respectively ' terrible ' and ' great; the batch size is set to 4 or 8; the optimizer uses AdamW where the learning rate selects one of {1e-5, 2e-5, 5e-5}, the weight decay is set to 1e-3, and the batch size and learning rate are determined from the manner in which the grid searches for the hyper-parameters. The student model was a CNN model using 3 different sizes of convolution kernels, respectively (3, 50), (4, 50) and (5, 50). The number of each convolution kernel is 100; each CNN uses glove.6b.50d as a word vector; the batch size is set to 128; the optimizer uses Adam with the learning rate set to 1e-3 and the weight decay to 1 e-5. In order to prevent the overfitting phenomenon in the neural network model training process, a Dropout parameter is set to be 0.5.
The results of the experiments on the BERT model for the YELP and IMDB datasets are shown in fig. 4, where the uncertainty threshold on both the YELP and IMDB datasets was set to 0.85; the results of the experiments on the RoBERTa model for the YELP and IMDB datasets are shown in fig. 5, where the uncertainty threshold on the YELP dataset is set to 0.6 and the uncertainty threshold on the IMDB dataset is set to 0.9. Fine-tuning in FIGS. 4 and 5 indicates that the standard Fine-tuning pre-training language model is used, LM-BFF indicates that the prompt-based Fine-tuning pre-training language model is used, and LM-BFF distills the CNN indicates that the CNN model is distilled using the prompt-based pre-training language model. Because few sample learning is sensitive to data, and the results obtained by different data segmentation training are greatly different, the accuracy of the classification result is shown by using 5 different random seeds to respectively sample different training sets and verification sets so as to relieve the problem; the data of the accuracy of the classification result is expressed in the form of "average of 5 results (variance of 5 results)".
As can be seen from the correct rate of the classification results in fig. 4, the distillation performance of the method of the present invention was improved by 91.13% -90.64% =0.49% under YELP data set and 84.14% -84.08% =0.06% under IMDB data set, compared to the distillation performance of the BERT-large model. Moreover, compared with the distillation performance of the BERT-base model, the distillation performance of the method is 91.13% > 87.18% under YELP data set and 84.14% >84.08% under IMDB data set, and the result of the method is far better than that of the BERT-base model. As can be seen from the predicted time of fig. 4, the time spent for distillation of the method of the invention was increased by 91.93s/163.27s =56.31% under the YELP data set and 962.37s/1598.34s =60.21% under the IMDB data set, compared to the BERT-large teacher model. Meanwhile, the simulation program counts that the proportion of the method for reducing the model access rate of the large instructor (the proportion of the reduction of the model access rate of the large instructor, namely the proportion of the times of unmarked samples passing through the large instructor model under a large instructor mechanism to the times of the unmarked samples passing through the large instructor model in total) is reduced by 74.40% under a YELP data set and 72.42% under an IMDB data set compared with BERT-large.
From the correct rate of the classification results in fig. 5, it can be seen that the distillation performance of the method of the present invention is improved by 93.16% -92.80% =0.36% under YELP data set and 87.84% -87.64% =0.2% under IMDB data set, compared to the distillation performance of RoBERTa-large model. Moreover, compared with the distillation performance of the RoBERTA-base model, the distillation performance of the method of the invention is 93.16% > 91.82% under YELP data set and 87.84% >87.64% under IMDB data set, and the result of the method of the invention is superior to that of the RoBERTA-base model. As can be seen from the predicted time of fig. 5, the time spent for distillation by the method of the present invention was increased by 75.59s/163.32s =46.28% under the YELP data set and by 912.65s/1594.93s =57.22% under the IMDB data set, as compared to the RoBERTa-large teacher model. Meanwhile, the simulation program counts that the proportion of the method for reducing the model access rate of the large instructor (the proportion of the reduction of the model access rate of the large instructor, namely the proportion of the times of unmarked samples passing through the large instructor model under a large instructor mechanism to the times of the unmarked samples passing through the large instructor model in total) is 84.65% lower than that of RoBERTA-large and is 75.56% lower than that of an IMDB data set.
The access frequency is proportional to the occupied resources. All unlabeled samples need to pass through a small instructor model, and a small number of samples screened by the threshold pass through a large instructor model. Compared with the method that all unlabeled samples pass through the large instructor model, the method has the advantages that a large amount of resource consumption can be reduced, the parameters of the small instructor model are relatively small, and occupied computing resources are small, so that only the proportion of the large instructor model with the reduced access rate is analyzed in the simulation.
The simulation result further shows that the method can effectively reduce the frequency of visiting the tutor model and the distillation time in the process of training the student model, and improve the accuracy of classification and identification while reducing resource consumption.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications of the invention may be made without departing from the scope of the invention.
Claims (10)
1. A little sample emotion classification method based on knowledge distillation of a size guide is characterized by comprising the following steps:
s1: dividing the sample into labeled samplesx u And unlabeled samplesx u ' Collection of unlabelled samples across a large number of emotion classification tasksx u ' set up with labeled samplesD l ={x u Set of unlabeled samplesD u ={x u ′};
S2: constructing a big instructor model and a small instructor model, and using a labeled sample setD l Training the tutor model to obtain the trained tutor modelM L Using sets of labelled samplesD l Training the tutor model to obtain the trained tutor modelM B ;
S3: using a trained mentor modelM B Predicting all unlabeled samplesx u ' obtaining sample probabilitiesCalculating uncertainty of probability of each sample;
S4: will not be uncertainAnd a predetermined thresholdthresholdComparing and screening out samples with uncertain sample probabilityx u ″;
S5: mixing the samplex u ' input training completed little tutor modelM B Soft tag for obtaining little instructor modelPTo samplex u "input training completed tutor modelM L Soft tag for obtaining large tutor modelP' Soft tag incorporating a little mentor modelPAnd soft label of big instructor modelP' obtaining the final Soft Label;
S6: constructing a student model using the unlabeled sample setD u And said soft labelDistilling the student model to obtain a distilled student model;
s7: the test set was subjected to classification prediction using a distillation-completed student model.
2. The few-sample emotion classification method based on size teacher knowledge distillation as claimed in claim 1, wherein: the big instructor model and the little instructor model are pre-training language models based on promptsMAnd the parameter quantity of the large instructor model is greater than that of the small instructor model.
3. The method of claim 2 for sentiment classification of small samples based on distillation of the knowledge of the size instructor, which is based onIs characterized in that: the use of labeled sample setsD l Training the tutor model to obtain the trained tutor modelM L The method specifically comprises the following steps:
s21: training set D l ={x u }={x,yIn (c) } the reaction solution is,xa sample of the input is shown,yrepresenting a real tag; for input samplexAdding a prompt template and converting into a complete form filling task form:
P(x)=[CLS] x It is [MASK].[SEP]wherein [ MASK ]]For filling words, P: (x) Is the input of a language model, It is [ MASK]Is a prompt template for input text addition;
s22: will be provided withLAs a set of labels for the classification task,Vand constructing a label mapping function as a label word set of the classification task:;
by pre-training language models based on promptsMObtaining [ MASK ]]Corresponding to different labelsScore of (3),
Wherein,Presentation labellThe corresponding label words are then written to the corresponding label words,kis the length of the label word;
s23: establishing predictions by softmax layer [ MASK]In different labelslThe input sample is obtained through the class probabilityxEmotion classification of
S24: establishing a loss function of a great tutor model output layer;
s25: repeating S22-S24 until the tutor model converges, ending the training to obtain the tutor model after the trainingM L ;
The use of labeled sample setsD l Training the tutor model to obtain the trained tutor modelM B The method specifically comprises the following steps:
s26: training set D l ={x u }={x,yIn (c) } the reaction solution is,xa sample of the input is shown,yrepresenting a real tag; for input samplexAdding a prompt template and converting into a complete form filling task form:
p (x) = [ CLS ] x It is [ MASK ] [ SEP ], where [ MASK ] is a filler;
s27: will be provided withLAs a set of labels for the classification task,Vand constructing a label mapping function as a label word set of the classification task:(ii) a By pre-training language models based on promptsMTo obtain [ MASK ]]Corresponding to different labelsScore on
Wherein,Presentation labellThe corresponding label words are then written to the corresponding label words,kis the length of the label word;
s28: establishing predictions by softmax layer [ MASK]In different labelslThe input sample is obtained through the class probabilityxEmotion classification of
S29: establishing a loss function of an output layer of the little tutor model;
s210: repeating S27-S29 until the tutor model converges, ending the training to obtain the tutor model after the trainingM B 。
4. The method of classifying emotion of a small sample based on distillation of knowledge of a size instructor as claimed in claim 3, wherein: the mentor's model completed with trainingM B Predicting all unlabelled samplesx u ' obtaining sample probabilitiesCalculating uncertainty of probability of each sampleThe method specifically comprises the following steps:
s31: all unlabeled samplesx u ' input training completed little tutor modelM B The predicted probability distribution is;
therein withoutL|The number of the types of the labels in the classification task.
6. The few-sample emotion classification method based on size teacher knowledge distillation as claimed in claim 1, wherein: the uncertainty of the futureAnd a predetermined thresholdthresholdComparing and screening out samples with uncertain sample probabilityx u ", specifically:
7. The method of classifying emotion of a small sample based on distillation of knowledge of a size instructor as claimed in claim 3, wherein: the samplex u ' input training completed little tutor modelM B Soft tag for obtaining little instructor modelPTo samplex u "input training completed tutor modelM L Soft tag for obtaining large tutor modelP' Soft tag incorporating a little mentor modelPAnd soft label of big instructor modelP' obtaining the final Soft LabelThe method specifically comprises the following steps:
s51: mixing the samplex u ' input training completed little tutor modelM B Soft tag for obtaining little instructor model;
S52: mixing the samplex u "input training completed tutor modelM L Soft tag for obtaining large tutor model;
8. the method for classifying emotion of a small sample based on distillation of knowledge of a size instructor in accordance with any one of claims 1 to 7, wherein: said using said set of unlabeled samplesD u And said soft labelDistilling the student model to obtain a distilled student model, wherein the specific process is as follows:
s61: collecting unlabeled samplesD u As a training set of distillation student models, vectors passing through the student models are represented asWherein g () represents a network function of the student model,A u is notAnnotating a sample setD u Corresponding word vector matrix, superscriptsThe model of the student is represented by,learnable parameters representing a student model;
s62: establishing loss function of student model output layerWhereinnWhich is indicative of the size of the batch,representing the passage through the student modeliThe probability of prediction of a sample is,representing the final sample probabilityTo middleiThe probability of prediction of a sample is,Tis the temperature parameter of the distillation model, DKLRepresenting a KL divergence loss function;
S63:sequentially passing through the linear layer and the softmax activation layer to obtain an unlabeled sample setD u Is output according to the probability,W s Representing a weight matrix to be learned on a linear layer of the student model;
s64: using a loss function LKDUpdating the learnable parameters of the student model;
s65: repeating S61-S64 until the loss function LKDAnd (5) converging to obtain a distilled student model.
9. The method of classifying emotion of a small sample based on distillation of knowledge of a size instructor in claim 8, wherein: the word vector matrixA u In, each row is an input samplex u ' A word vector for each character is represented, and the word vector for each character is obtained by word2vec or Glove model training.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210653730.6A CN114722805B (en) | 2022-06-10 | 2022-06-10 | Little sample emotion classification method based on size instructor knowledge distillation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210653730.6A CN114722805B (en) | 2022-06-10 | 2022-06-10 | Little sample emotion classification method based on size instructor knowledge distillation |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114722805A true CN114722805A (en) | 2022-07-08 |
CN114722805B CN114722805B (en) | 2022-08-30 |
Family
ID=82232411
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210653730.6A Active CN114722805B (en) | 2022-06-10 | 2022-06-10 | Little sample emotion classification method based on size instructor knowledge distillation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114722805B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115186083A (en) * | 2022-07-26 | 2022-10-14 | 腾讯科技(深圳)有限公司 | Data processing method, device, server, storage medium and product |
CN116186200A (en) * | 2023-01-19 | 2023-05-30 | 北京百度网讯科技有限公司 | Model training method, device, electronic equipment and storage medium |
CN116861302A (en) * | 2023-09-05 | 2023-10-10 | 吉奥时空信息技术股份有限公司 | Automatic case classifying and distributing method |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113762144A (en) * | 2021-09-05 | 2021-12-07 | 东南大学 | Deep learning-based black smoke vehicle detection method |
CN113886562A (en) * | 2021-10-02 | 2022-01-04 | 智联(无锡)信息技术有限公司 | AI resume screening method, system, equipment and storage medium |
CN114168844A (en) * | 2021-11-11 | 2022-03-11 | 北京快乐茄信息技术有限公司 | Online prediction method, device, equipment and storage medium |
CN114283402A (en) * | 2021-11-24 | 2022-04-05 | 西北工业大学 | License plate detection method based on knowledge distillation training and space-time combined attention |
-
2022
- 2022-06-10 CN CN202210653730.6A patent/CN114722805B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113762144A (en) * | 2021-09-05 | 2021-12-07 | 东南大学 | Deep learning-based black smoke vehicle detection method |
CN113886562A (en) * | 2021-10-02 | 2022-01-04 | 智联(无锡)信息技术有限公司 | AI resume screening method, system, equipment and storage medium |
CN114168844A (en) * | 2021-11-11 | 2022-03-11 | 北京快乐茄信息技术有限公司 | Online prediction method, device, equipment and storage medium |
CN114283402A (en) * | 2021-11-24 | 2022-04-05 | 西北工业大学 | License plate detection method based on knowledge distillation training and space-time combined attention |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115186083A (en) * | 2022-07-26 | 2022-10-14 | 腾讯科技(深圳)有限公司 | Data processing method, device, server, storage medium and product |
CN116186200A (en) * | 2023-01-19 | 2023-05-30 | 北京百度网讯科技有限公司 | Model training method, device, electronic equipment and storage medium |
CN116186200B (en) * | 2023-01-19 | 2024-02-09 | 北京百度网讯科技有限公司 | Model training method, device, electronic equipment and storage medium |
CN116861302A (en) * | 2023-09-05 | 2023-10-10 | 吉奥时空信息技术股份有限公司 | Automatic case classifying and distributing method |
CN116861302B (en) * | 2023-09-05 | 2024-01-23 | 吉奥时空信息技术股份有限公司 | Automatic case classifying and distributing method |
Also Published As
Publication number | Publication date |
---|---|
CN114722805B (en) | 2022-08-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114722805B (en) | Little sample emotion classification method based on size instructor knowledge distillation | |
CN111177374B (en) | Question-answer corpus emotion classification method and system based on active learning | |
CN110188358B (en) | Training method and device for natural language processing model | |
CN110188272B (en) | Community question-answering website label recommendation method based on user background | |
CN110619044B (en) | Emotion analysis method, system, storage medium and equipment | |
CN111914885A (en) | Multitask personality prediction method and system based on deep learning | |
CN109062958B (en) | Primary school composition automatic classification method based on TextRank and convolutional neural network | |
CN112364743A (en) | Video classification method based on semi-supervised learning and bullet screen analysis | |
CN115270752A (en) | Template sentence evaluation method based on multilevel comparison learning | |
Jishan et al. | Natural language description of images using hybrid recurrent neural network | |
Cai | Automatic essay scoring with recurrent neural network | |
Nassiri et al. | Arabic L2 readability assessment: Dimensionality reduction study | |
CN116186250A (en) | Multi-mode learning level mining method, system and medium under small sample condition | |
Aksonov et al. | Question-Answering Systems Development Based on Big Data Analysis | |
WO2020240572A1 (en) | Method for training a discriminator | |
Arifin et al. | Automatic essay scoring for Indonesian short answers using siamese Manhattan long short-term memory | |
US20220253694A1 (en) | Training neural networks with reinitialization | |
CN115391520A (en) | Text emotion classification method, system, device and computer medium | |
CN114997175A (en) | Emotion analysis method based on field confrontation training | |
Ma et al. | Enhanced hierarchical structure features for automated essay scoring | |
CN113821571A (en) | Food safety relation extraction method based on BERT and improved PCNN | |
CN112200268A (en) | Image description method based on encoder-decoder framework | |
Ratna et al. | Hybrid deep learning cnn-bidirectional lstm and manhattan distance for japanese automated short answer grading: Use case in japanese language studies | |
LU504829B1 (en) | Text classification method, computer readable storage medium and system | |
Chen et al. | An effective relation-first detection model for relational triple extraction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |