CN114722805A - Little sample emotion classification method based on size instructor knowledge distillation - Google Patents

Little sample emotion classification method based on size instructor knowledge distillation Download PDF

Info

Publication number
CN114722805A
CN114722805A CN202210653730.6A CN202210653730A CN114722805A CN 114722805 A CN114722805 A CN 114722805A CN 202210653730 A CN202210653730 A CN 202210653730A CN 114722805 A CN114722805 A CN 114722805A
Authority
CN
China
Prior art keywords
model
sample
instructor
training
tutor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210653730.6A
Other languages
Chinese (zh)
Other versions
CN114722805B (en
Inventor
李寿山
常晓琴
周国栋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN202210653730.6A priority Critical patent/CN114722805B/en
Publication of CN114722805A publication Critical patent/CN114722805A/en
Application granted granted Critical
Publication of CN114722805B publication Critical patent/CN114722805B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2155Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the incorporation of unlabelled data, e.g. multiple instance learning [MIL], semi-supervised techniques using expectation-maximisation [EM] or naïve labelling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Other Investigation Or Analysis Of Materials By Electrical Means (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to a few-sample emotion classification method based on large and small instructor knowledge distillation, which comprises the steps of collecting unmarked samples and marked samples on a large number of emotion classification tasks, and training a large instructor model and a small instructor model by using the marked samples; all unlabeled samples pass through a little instructor model to obtain the uncertainty of the probability of each sample, and then the samples with uncertain sample probability are screened out according to a threshold value and pass through a big instructor model again; and (4) combining probability output of the large instructor model and the small instructor model to form a soft label to distill the student model, and performing classification prediction by using the distilled student model. The invention reduces the frequency of visiting a master model, reduces the distillation time in the process of training the student model, reduces the resource consumption and simultaneously improves the accuracy of classification and identification.

Description

Little sample emotion classification method based on size instructor knowledge distillation
Technical Field
The invention relates to the technical field of natural language processing, in particular to a few-sample emotion classification method based on knowledge distillation of a size instructor.
Background
The emotion classification task aims at automatically judging the emotion polarity (such as negative and positive) of the text expression. The task is a research hotspot in the field of natural language processing research, is widely applied to a plurality of application systems such as intention mining, information retrieval and question-answering systems, and is a basic link of the application systems. The low-sample emotion classification in the emotion classification means that only a small number of labeled samples can be used in training the classifier.
When few-sample emotion classification is carried out, machine learning and deep learning algorithms are generally used in the field of artificial intelligence to extract emotion meanings from a section of text, and the most extensive artificial intelligence method at present is to model the problem into a problem of inputting a section of text and outputting a label. The prior art is generally divided into the following steps: (1) a professional labels a small amount of texts with different polarity labels, each section of text is used as a sample, and a small amount of corpus of labeled samples with balanced polarity labels is obtained; (2) a large-scale pre-training language model (such as GPT-3) based on prompt utilizes a small amount of labeled samples to train the model, and a classification model is obtained; (3) and testing the text of a certain unknown label by using a classification model to obtain a polarity label of the text segment. During the test, each time the classification model is entered, a single text is entered. Wherein the network structure of the large scale pre-training language model based on the hint in step (2) is shown in fig. 1, where [ CLS ] x [ SEP ] is the input sentence, [ CLS ] marks the beginning of the sentence, [ SEP ] marks the separation of the sentence from the sentence, and x is the classification of the prediction sentence of the original pre-training model. The "MLM head" in FIG. 1 is a fixed usage of the masked language model in a large scale pre-trained prompt-based language model. Get the active label "good" by "MLM head" and get the input sentence "[ CLS ] I will recommend them to everyone! It [ MASK ]. The feedback output of SEP is "I will recommend them to everyone! It is good. ".
Few-sample emotion classification because training samples are few, common shallow neural networks (such as CNN, LSTM and the like) and deep pre-training language models (such as BERT, RoBERTA and the like) are difficult to make correct judgment on semantics of some texts, and recognition rate of classification is not high enough. The parameter quantity of a GPT-3 large-scale model in the prior art reaches 1750 hundred million, and the GPT-3 large-scale model can perform excellent on a few-sample learning task by adding a plurality of examples of input and corresponding output as contexts. However, the parameter quantity is too large, expensive computing resources are consumed for calling the model, the reasoning speed is slow, and the practical application is hindered.
Disclosure of Invention
Therefore, the technical problem to be solved by the invention is to overcome the defects in the prior art, and provide a few-sample emotion classification method based on knowledge distillation of a large tutor and a small tutor, which can effectively reduce the frequency of visiting a large tutor model and the distillation time in the process of training a student model, and improve the accuracy of classification and identification while reducing resource consumption.
In order to solve the technical problems, the invention provides a few-sample emotion classification method based on knowledge distillation of a size instructor, which comprises the following steps of:
s1: dividing the sample into labeled samplesx u And unlabeled samplesx u ' Collection of unlabelled samples across a large number of emotion classification tasksx u ' set up with labeled samples
Figure 300169DEST_PATH_IMAGE001
And collection of unlabeled samplesD u ={x u ′};
S2: constructing a big instructor model and a small instructor model, and using a labeled sample setD l Training the tutor model to obtain the trained tutor modelM L Using sets of labelled samplesD l Training the tutor model to obtain the trained tutor modelM B
S3: using a trained mentor's modelM B Predicting all unlabeled samplesx u ' obtaining sample probabilities
Figure 113405DEST_PATH_IMAGE002
Calculating uncertainty of probability of each sample
Figure 565377DEST_PATH_IMAGE003
S4: will not be uncertain
Figure 602603DEST_PATH_IMAGE004
And a predetermined threshold valuethresholdComparing and screening out samples with uncertain sample probabilityx u ″;
S5: mixing the samplex u ' input training completed little tutor modelM B Soft tag for obtaining little instructor modelPTo samplex u "input training completed tutor modelM L Soft tag for obtaining large tutor modelP' soft label combined with little instructor modelPAnd soft label of big instructor modelP' obtaining the final Soft Label
Figure 22083DEST_PATH_IMAGE005
S6: constructing a student model using the unlabeled sample setD u And said soft label
Figure 373430DEST_PATH_IMAGE005
Distilling the student model to obtain a distilled student model;
s7: and (4) carrying out classification prediction on the test set by using a distillation completed student model.
Preferably, the large tutor model and the small tutor model are both pre-trained language models based on promptsMAnd the parameter quantity of the large instructor model is greater than that of the small instructor model.
Preferably, the use is of a set of labelled samplesD l Training the tutor model to obtain the trained tutor modelM L The method specifically comprises the following steps:
s21: training set D l ={x u }={x,yIn (c) } the (c) is,xa sample of the input is shown,yrepresenting a real tag; for input samplexAdding a prompt template and converting into a complete form filling task form:
P(x)=[CLS] x It is [MASK].[SEP]wherein [ MASK ]]For filling words, P: (x) Is the input of a language model, It is [ MASK]Is a prompt template for input text addition;
s22: will be provided withLAs a set of labels for the classification task,Vand constructing a label mapping function as a label word set of the classification task:
Figure 194755DEST_PATH_IMAGE006
P(x) As input to the language model, by pre-training the language model based on the promptMObtaining [ MASK ]]Corresponding to different labels
Figure 589833DEST_PATH_IMAGE007
Score on
Figure 496610DEST_PATH_IMAGE008
Wherein
Figure 448385DEST_PATH_IMAGE009
Figure 124217DEST_PATH_IMAGE010
Presentation labellThe corresponding label words are then written to the corresponding label words,kis the length of the label word;
s23: establishing predictions by softmax layer [ MASK]In different labelslThe input sample is obtained through the class probabilityxEmotion classification of
Figure 706508DEST_PATH_IMAGE011
S24: establishing a loss function of a great tutor model output layer;
s25: repeating S22-S24 until the tutor model converges, finishing training and obtaining the trained tutor modelM L
The use of labeled sample setsD l Training the little instructor modelLittle tutor model of practicing completionM B The method specifically comprises the following steps:
s26: training set D l ={x u }={x,yIn (c) } the reaction solution is,xa sample of the input is shown,yrepresenting a genuine label; for input samplexAdding a prompt template and converting into a complete form filling task form:
p (x) = [ CLS ] x It is [ MASK ] [ SEP ], where [ MASK ] is a filler;
s27: will be provided withLAs a set of labels for the classification task,Vand constructing a label mapping function as a label word set of the classification task:
Figure 585733DEST_PATH_IMAGE012
by pre-training language models based on promptsMObtaining [ MASK ]]Corresponding to different labels
Figure 278883DEST_PATH_IMAGE013
Score on
Figure 74801DEST_PATH_IMAGE014
Wherein
Figure 359151DEST_PATH_IMAGE015
Figure 240520DEST_PATH_IMAGE016
Presentation labellThe corresponding label words are then written to the corresponding label words,kis the length of the label word;
s28: establishing predictions by softmax layer [ MASK]In different labelslThe input sample is obtained through the class probabilityxEmotion classification of
Figure 986628DEST_PATH_IMAGE017
S29: establishing a loss function of an output layer of the little tutor model;
s210: repeating S27-S29 until the tutor model converges, finishing training and obtaining the trained tutor modelM B
Preferably, the use of a trained mentor's modelM B Predicting all unlabeled samplesx u ' obtaining sample probabilities
Figure 637052DEST_PATH_IMAGE018
Calculating uncertainty of probability of each sample
Figure 561146DEST_PATH_IMAGE019
The method specifically comprises the following steps:
s31: all unlabeled samplesx u ' input training completed little tutor modelM B The predicted probability distribution is
Figure 664231DEST_PATH_IMAGE020
S32: calculating uncertainty of probability of each sample
Figure 449915DEST_PATH_IMAGE021
The calculation formula is as follows:
Figure 220425DEST_PATH_IMAGE022
therein aL|The number of the types of the labels in the classification task.
Preferably, the preset threshold valuethresholdHas a value range of
Figure 49841DEST_PATH_IMAGE023
Preferably, the uncertainty is
Figure 640222DEST_PATH_IMAGE024
And a predetermined threshold valuethresholdComparing and screening out samples with uncertain sample probabilityx u ", specifically:
uncertainty of probability of sample
Figure 728133DEST_PATH_IMAGE025
Is greater thanthresholdThen the sample is used as the sample with high uncertainty of sample probabilityx u ″。
Preferably, the sample isx u ' input training completed little tutor modelM B Soft tag for obtaining little instructor modelPA sample is preparedx u "input training completed tutor modelM L Soft tag for obtaining great mentor modelP' Soft tag incorporating a little mentor modelPAnd soft label of big instructor modelP' obtaining the final Soft Label
Figure 353149DEST_PATH_IMAGE005
The method specifically comprises the following steps:
s51: mixing the samplex u ' input training completed little tutor modelM B Soft tag for obtaining little instructor model
Figure 619046DEST_PATH_IMAGE026
S52: mixing the samplex u "input training completed teacher modelM L Soft tag for obtaining large tutor model
Figure 696723DEST_PATH_IMAGE027
S53:
Figure 824210DEST_PATH_IMAGE028
The expression of (a) is:
Figure 303733DEST_PATH_IMAGE030
preferably, the using the unlabeled sample setD u And said soft label
Figure 740531DEST_PATH_IMAGE031
Distilling the student model to obtain a distilled student model, wherein the specific process is as follows:
s61: collecting unlabelled samplesD u As a training set of distillation student models, vectors passing through the student models are represented as
Figure 305504DEST_PATH_IMAGE032
Wherein g () represents a network function of the student model,A u for unlabeled sample setD u Corresponding word vector matrix, superscriptsThe model of the student is represented by,
Figure 282687DEST_PATH_IMAGE033
learnable parameters representing a student model;
s62: establishing loss function of student model output layer
Figure 865984DEST_PATH_IMAGE034
WhereinnWhich is indicative of the size of the batch,
Figure 208104DEST_PATH_IMAGE035
representing passage through the student modeliThe probability of prediction of a sample is,
Figure 260374DEST_PATH_IMAGE036
representing the final sample probability
Figure 510089DEST_PATH_IMAGE037
To middleiThe probability of prediction of a sample is,Tis a temperature parameter of the distillation model, DKLRepresenting a KL divergence loss function;
S63:
Figure 183778DEST_PATH_IMAGE038
sequentially passing through the linear layer and the softmax activation layer to obtain an unlabeled sample setD u Is output according to the probability
Figure 962379DEST_PATH_IMAGE039
W s Representing a weight matrix to be learned on a linear layer of the student model;
s64: using a loss function LKDUpdating the learnable parameters of the student model;
s65: repeating S61-S64 until the loss function LKDAnd converging to obtain the distilled student model.
Preferably, the word vector matrixA u In, each row is an input samplex u ' A word vector for each character is represented, and the word vector for each character is obtained by word2vec or Glove model training.
Preferably, the KL divergence loss function is expressed by
Figure 501944DEST_PATH_IMAGE040
Therein without a fluorineL|The number of the types of the labels in the classification task.
Compared with the prior art, the technical scheme of the invention has the following advantages:
according to the invention, the student model is distilled by establishing the large instructor model and the small instructor model, so that the sample is screened by the small instructor model and then passes through the large instructor model, the distillation time of the student model can be effectively reduced, and the resource consumption is reduced; meanwhile, under the condition that resource consumption is reduced by the large instructor model and the small instructor model, a large number of unlabeled samples in the emotion classification task can be collected, so that the accuracy of classification and identification is improved.
Drawings
In order that the present disclosure may be more readily and clearly understood, reference is now made to the following detailed description of the embodiments of the present disclosure taken in conjunction with the accompanying drawings, in which
FIG. 1 is a network structure of a prompt-based large-scale pre-trained language model;
FIG. 2 is a schematic diagram of the structure of a conventional single teacher and single student knowledge distillation method;
FIG. 3 is a schematic diagram of the structure of the distillation method based on knowledge of the size instructor's mechanism;
FIG. 4 is a graph of the results of experiments performed on the BERT model with YELP and IMDB datasets in an embodiment of the invention;
FIG. 5 is a graph of the results of experiments performed on the YELP and IMDB datasets in an embodiment of the invention on the RoBERTA model.
Detailed Description
The present invention is further described below in conjunction with the following figures and specific examples so that those skilled in the art may better understand the present invention and practice it, but the examples are not intended to limit the present invention.
In the optimization process of the model, the large model is often a single complex network or a collection of a plurality of networks, and has good performance and generalization capability; and the small model has limited expression capacity because the network size is small. Therefore, the knowledge learned by the large model (teacher model) can be used for guiding the training of the small model (student model), so that the small model has the performance equivalent to that of the large model, but the number of parameters is greatly reduced, and the compression and acceleration of the model are realized, and the process is called distillation.
Compared with the conventional single teacher and single student knowledge distillation method shown in FIG. 2, the method of the invention shown in FIG. 3 uses a large number of unlabelled samples on the basis of the conventional method, and introduces two teacher models, namely a big instructor model and a small instructor model based on prompts, in the drawing
Figure 820930DEST_PATH_IMAGE041
Is the output probability of the student model.
The invention relates to a little-sample emotion classification method based on knowledge distillation of a teacher and a teacher, which comprises the following steps of:
s1: dividing the sample into labeled samplesx u And unlabeled samplesx u ' Collection of unlabelled samples across a large number of emotion classification tasksx u ' set up with labeled samples
Figure 129552DEST_PATH_IMAGE001
And collection of unlabeled samples
Figure 62742DEST_PATH_IMAGE042
With marked samplex u For samples containing labels, unlabeled samplesx u ' is an unlabeled sample, with a small number of labeled samples in the samplex u And a large number of unlabelled samplesx u ′。
S2: constructing a big instructor model and a small instructor model, wherein the big instructor model and the small instructor model are pre-training language models based on promptsM(i.e., the prompt method), the parameter quantity of the large instructor model is greater than the parameter quantity of the small instructor model, and the parameter quantity of the large instructor model in this embodiment is much greater than the parameter quantity of the small instructor model. Using annotated sets of samplesD l Training the tutor model to obtain the trained tutor modelM L Using sets of labelled samplesD l Training the tutor model to obtain the trained tutor modelM B
Using annotated sets of samplesD l Train big instructor's model and little instructor's model respectively, the training process of big instructor's model and little instructor's model is similar, and concrete process is:
s21: training set D l ={x u }={x,yIn (c) } the reaction solution is,xa sample of the input is shown,yrepresenting a genuine label; for input samplexAdding a prompt template and converting into a complete form filling task form:
P(x)=[CLS] x It is [MASK].[SEP]wherein [ MASK ]]For filling in words, the purpose is to let the pre-trained language model based on promptsMTo determine [ MASK ]]And (4) filling words, and converting the classification task into a complete blank filling task. Input text adding prompt template' It is [ MASK].”,[MASK]Passing the new input through the language model corresponding to different labels of the classification task, and making the language model determine [ MASK ]]And (4) filling words, thereby realizing the classification of the text.
S22: will be provided withLAs a set of labels for the classification task,Vand constructing a label mapping function as a label word set of the classification task:
Figure 89603DEST_PATH_IMAGE006
(ii) a Mapping task labels to a prompt-based pre-trained language modelMA word or words in the vocabulary of (a). For example: word in word list corresponding to 0 category in emotion two classification task "terrible", 1 category corresponds to a word in a vocabulary"great”。P(x) As input to the language model, by pre-training the language model based on the promptMObtaining [ MASK ]]Corresponding to different labels
Figure 149963DEST_PATH_IMAGE013
Score of (3)
Figure 109829DEST_PATH_IMAGE043
Wherein
Figure 495811DEST_PATH_IMAGE044
Figure 495122DEST_PATH_IMAGE045
Presentation labellThe corresponding label words are then written to the corresponding label words,kis the length of the tag word.
S23: establishing predictions [ MASK ] by softmax layer]In different labelslClass probability of (2) to obtain an input sample by the class probabilityxEmotion classification of
Figure 93594DEST_PATH_IMAGE046
S24: establishing a loss function of an output layer of a master tutor model; in this embodiment, the loss function is a cross entropy function, which is used to measure the true label of the training sampleyAnd predictive probability
Figure 376808DEST_PATH_IMAGE047
The difference between them.
S25: repeating S22-S24 until the teacher model converges and finishes training to obtain the trained teacher model
Figure 668112DEST_PATH_IMAGE048
S26: training set D l ={x u }={x,yIn (c) } the reaction solution is,xa sample of the input is shown and,yrepresenting a genuine label; for input samplexAdding a prompt template and converting into a complete form filling task form:
p (x) = [ CLS ] x It is [ MASK ] [ SEP ], where [ MASK ] is a filler word.
S27: will be provided withLAs a set of labels for the classification task,Vconstructing a label mapping function as a label word set of a classification task:
Figure 200724DEST_PATH_IMAGE049
by pre-training language models based on promptsMObtaining [ MASK ]]Corresponding to different labels
Figure 117733DEST_PATH_IMAGE013
Score on
Figure 255454DEST_PATH_IMAGE050
Wherein
Figure 717659DEST_PATH_IMAGE015
Figure 940830DEST_PATH_IMAGE016
Presentation labellThe corresponding label words are then displayed on the display screen,kis the length of the tag word.
S28: establishing predictions [ MASK ] by softmax layer]In different labelslThe input sample is obtained through the class probabilityxEmotion classification of
Figure 897416DEST_PATH_IMAGE051
S29: and establishing a loss function of an output layer of the tutor model.
S210: repeating S27-S29 until the tutor model converges, finishing the training and obtaining the trained tutorTeacher modelM B
S3: using a trained mentor's modelM B Predicting all unlabelled samplesx u ' obtaining sample probabilities
Figure 624063DEST_PATH_IMAGE052
Calculating uncertainty of probability of each sample
Figure 522749DEST_PATH_IMAGE053
S31: all unlabeled samplesx u ' input training completed little tutor modelM B The predicted probability distribution is
Figure 233216DEST_PATH_IMAGE054
S32: calculating uncertainty of probability of each sample
Figure 226449DEST_PATH_IMAGE055
The calculation formula is as follows:
Figure 338762DEST_PATH_IMAGE056
therein withoutL|By uncertainty, for the number of classes of labels in the classification task
Figure 939507DEST_PATH_IMAGE057
The quality of the sample prediction probability can be measured.
S4: will not be uncertain
Figure 340533DEST_PATH_IMAGE058
And a predetermined threshold valuethresholdComparing, screening out the samples with uncertain sample probability and presetting thresholdthresholdHas a value range of
Figure 684926DEST_PATH_IMAGE023
Uncertainty of probability of sample
Figure 390759DEST_PATH_IMAGE059
Is greater thanthresholdThen this sample is taken as a sample whose sample probability is highly uncertain. Uncertainty of sample probability
Figure 631247DEST_PATH_IMAGE060
Is greater thanthresholdExplain the sample by the tutorx u ' the classification probability result has insufficient confidence, and a new probability distribution needs to be obtained through the master tutor model again.
S5: mixing the samplex u ' input training completed little tutor modelM B Soft tag for obtaining little instructor modelPTo samplex u "input training completed tutor modelM L Soft tag for obtaining large tutor modelP' soft label combined with little instructor modelPSoft label for guiding teacher modelP' obtaining the final Soft Label
Figure 50727DEST_PATH_IMAGE005
S51: mixing the samplex u ' input training completed little tutor modelM B Soft tag for obtaining little instructor model
Figure 402074DEST_PATH_IMAGE061
S52: mixing the samplex u "input training completed tutor modelM L Soft tag for obtaining large tutor model
Figure 207088DEST_PATH_IMAGE062
S53:
Figure 618478DEST_PATH_IMAGE063
The expression of (a) is:
Figure 525254DEST_PATH_IMAGE064
s6: a student model is constructed, and the student model in the embodiment is composed of a small shallow neural network model. Using unlabeled sample setsD u And soft label
Figure 414713DEST_PATH_IMAGE065
And distilling the student model to obtain a distilled student model.
S61: collecting unlabeled samplesD u As a training set of distillation student models, vectors passing through the student models are represented as
Figure 106856DEST_PATH_IMAGE066
Wherein g () represents a network function of the student model,A u for unlabeled sample setsD u Corresponding word vector matrix, unlabeled samplex u Of length ofkDimension of word vector ofdThen, then
Figure 689147DEST_PATH_IMAGE067
(ii) a Upper labelsThe model of the student is represented by,
Figure 83220DEST_PATH_IMAGE068
the learnable parameters representing the student model.
Word vector matrixA u In, each row is an input samplex u ' a word vector representation for each character, obtained by word2vec or Glove model training.
S62: establishing a loss function of the output layer of the student model, i.e. a loss function used when the teacher model distills the student model
Figure 776369DEST_PATH_IMAGE069
WhereinnWhich is indicative of the size of the batch,
Figure 103445DEST_PATH_IMAGE070
representing the passage through the student modeliThe probability of prediction of a sample is,
Figure 105905DEST_PATH_IMAGE071
representing the final sample probability
Figure 987274DEST_PATH_IMAGE072
To middleiThe probability of prediction of a sample is,Tis a temperature parameter of the distillation model,Tis a parameter carried by a distillation model, the larger T is, the smoother the probability distribution of softmax is, the larger the entropy of the distribution is, the more information is carried, and D isKLThe KL divergence loss function is expressed.
The expression of the KL divergence loss function is
Figure 484114DEST_PATH_IMAGE073
Therein without a fluorineL|The number of the types of the labels in the classification task.
S63:
Figure 134538DEST_PATH_IMAGE074
Sequentially passing through the linear layer and the softmax activation layer to obtain an unlabeled sample setD u Is output according to the probability
Figure 12627DEST_PATH_IMAGE075
Figure 381291DEST_PATH_IMAGE076
Representing a weight matrix to be learned on a linear layer of the student model;
s64: using a loss function LKDUpdating the learnable parameters of the student model;
s65: repeating S61-S64 until the loss function LKDAnd converging to obtain the distilled student model.
S7: the test set was subjected to classification prediction using a distillation-completed student model.
The invention has the beneficial effects that:
according to the invention, the student model is distilled by establishing the large instructor model and the small instructor model, so that the sample is screened by the small instructor model and then passes through the large instructor model, the distillation time of the student model can be effectively reduced, and the resource consumption is reduced; meanwhile, under the condition that resource consumption is reduced by the large instructor model and the small instructor model, a large number of unlabeled samples in the emotion classification task can be collected, so that the accuracy of classification and identification is improved.
To further illustrate the beneficial effects of the present invention, in this embodiment, the test set is input into the trained student model to obtain the prediction probability. The effect of the present invention was analyzed from three aspects of (1) the accuracy of classification results obtained by predicting the test set by the student model, (2) the prediction time of the teacher model for all unlabeled samples in the distillation model, and (3) the reduction rate of the visit rate to the tutor model.
In the present embodiment, a sentence-level YELP dataset (see the documents "Zhang X, Zhao J, LeCun Y. Character-level connected networks for text classification [ J ]. Advances in the neural information processing systems, 2015, 28: 649-. In the experimental process, 8 positive and negative balanced samples are respectively selected from each data set as a training set and a verification set, and positive and negative 500 samples are used as a test set. Further, the number of unlabeled samples of the YELP dataset was 10 ten thousand, and the number of unlabeled samples of the IMDB dataset was 9.8 ten thousand.
In order to simulate the knowledge distillation process of a size instructor mechanism, the invention respectively arranges a size instructor model and a small instructor model under a BERT model and a RoBERTA model, and uses the BERT-large (the size instructor model under the BERT), the BERT-base (the small instructor model under the BERT) and the RoBERTA-large (the size instructor model under the RoBERTA) to represent. When the teacher model is trained, the label words are respectively ' terrible ' and ' great; the batch size is set to 4 or 8; the optimizer uses AdamW where the learning rate selects one of {1e-5, 2e-5, 5e-5}, the weight decay is set to 1e-3, and the batch size and learning rate are determined from the manner in which the grid searches for the hyper-parameters. The student model was a CNN model using 3 different sizes of convolution kernels, respectively (3, 50), (4, 50) and (5, 50). The number of each convolution kernel is 100; each CNN uses glove.6b.50d as a word vector; the batch size is set to 128; the optimizer uses Adam with the learning rate set to 1e-3 and the weight decay to 1 e-5. In order to prevent the overfitting phenomenon in the neural network model training process, a Dropout parameter is set to be 0.5.
The results of the experiments on the BERT model for the YELP and IMDB datasets are shown in fig. 4, where the uncertainty threshold on both the YELP and IMDB datasets was set to 0.85; the results of the experiments on the RoBERTa model for the YELP and IMDB datasets are shown in fig. 5, where the uncertainty threshold on the YELP dataset is set to 0.6 and the uncertainty threshold on the IMDB dataset is set to 0.9. Fine-tuning in FIGS. 4 and 5 indicates that the standard Fine-tuning pre-training language model is used, LM-BFF indicates that the prompt-based Fine-tuning pre-training language model is used, and LM-BFF distills the CNN indicates that the CNN model is distilled using the prompt-based pre-training language model. Because few sample learning is sensitive to data, and the results obtained by different data segmentation training are greatly different, the accuracy of the classification result is shown by using 5 different random seeds to respectively sample different training sets and verification sets so as to relieve the problem; the data of the accuracy of the classification result is expressed in the form of "average of 5 results (variance of 5 results)".
As can be seen from the correct rate of the classification results in fig. 4, the distillation performance of the method of the present invention was improved by 91.13% -90.64% =0.49% under YELP data set and 84.14% -84.08% =0.06% under IMDB data set, compared to the distillation performance of the BERT-large model. Moreover, compared with the distillation performance of the BERT-base model, the distillation performance of the method is 91.13% > 87.18% under YELP data set and 84.14% >84.08% under IMDB data set, and the result of the method is far better than that of the BERT-base model. As can be seen from the predicted time of fig. 4, the time spent for distillation of the method of the invention was increased by 91.93s/163.27s =56.31% under the YELP data set and 962.37s/1598.34s =60.21% under the IMDB data set, compared to the BERT-large teacher model. Meanwhile, the simulation program counts that the proportion of the method for reducing the model access rate of the large instructor (the proportion of the reduction of the model access rate of the large instructor, namely the proportion of the times of unmarked samples passing through the large instructor model under a large instructor mechanism to the times of the unmarked samples passing through the large instructor model in total) is reduced by 74.40% under a YELP data set and 72.42% under an IMDB data set compared with BERT-large.
From the correct rate of the classification results in fig. 5, it can be seen that the distillation performance of the method of the present invention is improved by 93.16% -92.80% =0.36% under YELP data set and 87.84% -87.64% =0.2% under IMDB data set, compared to the distillation performance of RoBERTa-large model. Moreover, compared with the distillation performance of the RoBERTA-base model, the distillation performance of the method of the invention is 93.16% > 91.82% under YELP data set and 87.84% >87.64% under IMDB data set, and the result of the method of the invention is superior to that of the RoBERTA-base model. As can be seen from the predicted time of fig. 5, the time spent for distillation by the method of the present invention was increased by 75.59s/163.32s =46.28% under the YELP data set and by 912.65s/1594.93s =57.22% under the IMDB data set, as compared to the RoBERTa-large teacher model. Meanwhile, the simulation program counts that the proportion of the method for reducing the model access rate of the large instructor (the proportion of the reduction of the model access rate of the large instructor, namely the proportion of the times of unmarked samples passing through the large instructor model under a large instructor mechanism to the times of the unmarked samples passing through the large instructor model in total) is 84.65% lower than that of RoBERTA-large and is 75.56% lower than that of an IMDB data set.
The access frequency is proportional to the occupied resources. All unlabeled samples need to pass through a small instructor model, and a small number of samples screened by the threshold pass through a large instructor model. Compared with the method that all unlabeled samples pass through the large instructor model, the method has the advantages that a large amount of resource consumption can be reduced, the parameters of the small instructor model are relatively small, and occupied computing resources are small, so that only the proportion of the large instructor model with the reduced access rate is analyzed in the simulation.
The simulation result further shows that the method can effectively reduce the frequency of visiting the tutor model and the distillation time in the process of training the student model, and improve the accuracy of classification and identification while reducing resource consumption.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications of the invention may be made without departing from the scope of the invention.

Claims (10)

1. A little sample emotion classification method based on knowledge distillation of a size guide is characterized by comprising the following steps:
s1: dividing the sample into labeled samplesx u And unlabeled samplesx u ' Collection of unlabelled samples across a large number of emotion classification tasksx u ' set up with labeled samplesD l ={x u Set of unlabeled samplesD u ={x u ′};
S2: constructing a big instructor model and a small instructor model, and using a labeled sample setD l Training the tutor model to obtain the trained tutor modelM L Using sets of labelled samplesD l Training the tutor model to obtain the trained tutor modelM B
S3: using a trained mentor modelM B Predicting all unlabeled samplesx u ' obtaining sample probabilities
Figure 56166DEST_PATH_IMAGE001
Calculating uncertainty of probability of each sample
Figure 826676DEST_PATH_IMAGE002
S4: will not be uncertain
Figure 231043DEST_PATH_IMAGE003
And a predetermined thresholdthresholdComparing and screening out samples with uncertain sample probabilityx u ″;
S5: mixing the samplex u ' input training completed little tutor modelM B Soft tag for obtaining little instructor modelPTo samplex u "input training completed tutor modelM L Soft tag for obtaining large tutor modelP' Soft tag incorporating a little mentor modelPAnd soft label of big instructor modelP' obtaining the final Soft Label
Figure 555845DEST_PATH_IMAGE004
S6: constructing a student model using the unlabeled sample setD u And said soft label
Figure 394488DEST_PATH_IMAGE005
Distilling the student model to obtain a distilled student model;
s7: the test set was subjected to classification prediction using a distillation-completed student model.
2. The few-sample emotion classification method based on size teacher knowledge distillation as claimed in claim 1, wherein: the big instructor model and the little instructor model are pre-training language models based on promptsMAnd the parameter quantity of the large instructor model is greater than that of the small instructor model.
3. The method of claim 2 for sentiment classification of small samples based on distillation of the knowledge of the size instructor, which is based onIs characterized in that: the use of labeled sample setsD l Training the tutor model to obtain the trained tutor modelM L The method specifically comprises the following steps:
s21: training set D l ={x u }={x,yIn (c) } the reaction solution is,xa sample of the input is shown,yrepresenting a real tag; for input samplexAdding a prompt template and converting into a complete form filling task form:
P(x)=[CLS] x It is [MASK].[SEP]wherein [ MASK ]]For filling words, P: (x) Is the input of a language model, It is [ MASK]Is a prompt template for input text addition;
s22: will be provided withLAs a set of labels for the classification task,Vand constructing a label mapping function as a label word set of the classification task:
Figure 3193DEST_PATH_IMAGE006
by pre-training language models based on promptsMObtaining [ MASK ]]Corresponding to different labels
Figure 269089DEST_PATH_IMAGE007
Score of (3)
Figure 346767DEST_PATH_IMAGE008
Wherein
Figure 723521DEST_PATH_IMAGE009
Figure 953777DEST_PATH_IMAGE010
Presentation labellThe corresponding label words are then written to the corresponding label words,kis the length of the label word;
s23: establishing predictions by softmax layer [ MASK]In different labelslThe input sample is obtained through the class probabilityxEmotion classification of
Figure 593836DEST_PATH_IMAGE011
S24: establishing a loss function of a great tutor model output layer;
s25: repeating S22-S24 until the tutor model converges, ending the training to obtain the tutor model after the trainingM L
The use of labeled sample setsD l Training the tutor model to obtain the trained tutor modelM B The method specifically comprises the following steps:
s26: training set D l ={x u }={x,yIn (c) } the reaction solution is,xa sample of the input is shown,yrepresenting a real tag; for input samplexAdding a prompt template and converting into a complete form filling task form:
p (x) = [ CLS ] x It is [ MASK ] [ SEP ], where [ MASK ] is a filler;
s27: will be provided withLAs a set of labels for the classification task,Vand constructing a label mapping function as a label word set of the classification task:
Figure 408078DEST_PATH_IMAGE012
(ii) a By pre-training language models based on promptsMTo obtain [ MASK ]]Corresponding to different labels
Figure 588523DEST_PATH_IMAGE013
Score on
Figure 922553DEST_PATH_IMAGE014
Wherein
Figure 264672DEST_PATH_IMAGE015
Figure 67674DEST_PATH_IMAGE016
Presentation labellThe corresponding label words are then written to the corresponding label words,kis the length of the label word;
s28: establishing predictions by softmax layer [ MASK]In different labelslThe input sample is obtained through the class probabilityxEmotion classification of
Figure 520652DEST_PATH_IMAGE017
S29: establishing a loss function of an output layer of the little tutor model;
s210: repeating S27-S29 until the tutor model converges, ending the training to obtain the tutor model after the trainingM B
4. The method of classifying emotion of a small sample based on distillation of knowledge of a size instructor as claimed in claim 3, wherein: the mentor's model completed with trainingM B Predicting all unlabelled samplesx u ' obtaining sample probabilities
Figure 709188DEST_PATH_IMAGE018
Calculating uncertainty of probability of each sample
Figure 2635DEST_PATH_IMAGE019
The method specifically comprises the following steps:
s31: all unlabeled samplesx u ' input training completed little tutor modelM B The predicted probability distribution is
Figure 276622DEST_PATH_IMAGE020
S32: calculating uncertainty of probability of each sample
Figure 798870DEST_PATH_IMAGE021
The calculation formula is as follows:
Figure 373071DEST_PATH_IMAGE022
therein withoutL|The number of the types of the labels in the classification task.
5. The few-sample emotion classification method based on size teacher knowledge distillation as claimed in claim 1, wherein: the preset threshold valuethresholdHas a value range of
Figure 10988DEST_PATH_IMAGE023
6. The few-sample emotion classification method based on size teacher knowledge distillation as claimed in claim 1, wherein: the uncertainty of the future
Figure 506691DEST_PATH_IMAGE024
And a predetermined thresholdthresholdComparing and screening out samples with uncertain sample probabilityx u ", specifically:
uncertainty of probability of sample
Figure 81898DEST_PATH_IMAGE025
Is greater thanthresholdThen the sample is used as the sample with high uncertainty of sample probabilityx u ″。
7. The method of classifying emotion of a small sample based on distillation of knowledge of a size instructor as claimed in claim 3, wherein: the samplex u ' input training completed little tutor modelM B Soft tag for obtaining little instructor modelPTo samplex u "input training completed tutor modelM L Soft tag for obtaining large tutor modelP' Soft tag incorporating a little mentor modelPAnd soft label of big instructor modelP' obtaining the final Soft Label
Figure 245026DEST_PATH_IMAGE026
The method specifically comprises the following steps:
s51: mixing the samplex u ' input training completed little tutor modelM B Soft tag for obtaining little instructor model
Figure 365429DEST_PATH_IMAGE027
S52: mixing the samplex u "input training completed tutor modelM L Soft tag for obtaining large tutor model
Figure 879587DEST_PATH_IMAGE028
S53:
Figure 494370DEST_PATH_IMAGE029
The expression of (a) is:
Figure 246425DEST_PATH_IMAGE030
8. the method for classifying emotion of a small sample based on distillation of knowledge of a size instructor in accordance with any one of claims 1 to 7, wherein: said using said set of unlabeled samplesD u And said soft label
Figure 803308DEST_PATH_IMAGE005
Distilling the student model to obtain a distilled student model, wherein the specific process is as follows:
s61: collecting unlabeled samplesD u As a training set of distillation student models, vectors passing through the student models are represented as
Figure 804762DEST_PATH_IMAGE031
Wherein g () represents a network function of the student model,A u is notAnnotating a sample setD u Corresponding word vector matrix, superscriptsThe model of the student is represented by,
Figure 721772DEST_PATH_IMAGE032
learnable parameters representing a student model;
s62: establishing loss function of student model output layer
Figure 531596DEST_PATH_IMAGE033
WhereinnWhich is indicative of the size of the batch,
Figure 10113DEST_PATH_IMAGE034
representing the passage through the student modeliThe probability of prediction of a sample is,
Figure 498863DEST_PATH_IMAGE035
representing the final sample probability
Figure 439137DEST_PATH_IMAGE005
To middleiThe probability of prediction of a sample is,Tis the temperature parameter of the distillation model, DKLRepresenting a KL divergence loss function;
S63:
Figure 431364DEST_PATH_IMAGE036
sequentially passing through the linear layer and the softmax activation layer to obtain an unlabeled sample setD u Is output according to the probability
Figure 313738DEST_PATH_IMAGE037
W s Representing a weight matrix to be learned on a linear layer of the student model;
s64: using a loss function LKDUpdating the learnable parameters of the student model;
s65: repeating S61-S64 until the loss function LKDAnd (5) converging to obtain a distilled student model.
9. The method of classifying emotion of a small sample based on distillation of knowledge of a size instructor in claim 8, wherein: the word vector matrixA u In, each row is an input samplex u ' A word vector for each character is represented, and the word vector for each character is obtained by word2vec or Glove model training.
10. The method of classifying emotion of a small sample based on distillation of knowledge of a size instructor in claim 8, wherein: the expression of the KL divergence loss function is
Figure 758626DEST_PATH_IMAGE038
Therein without a fluorineL|The number of the types of the labels in the classification task.
CN202210653730.6A 2022-06-10 2022-06-10 Little sample emotion classification method based on size instructor knowledge distillation Active CN114722805B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210653730.6A CN114722805B (en) 2022-06-10 2022-06-10 Little sample emotion classification method based on size instructor knowledge distillation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210653730.6A CN114722805B (en) 2022-06-10 2022-06-10 Little sample emotion classification method based on size instructor knowledge distillation

Publications (2)

Publication Number Publication Date
CN114722805A true CN114722805A (en) 2022-07-08
CN114722805B CN114722805B (en) 2022-08-30

Family

ID=82232411

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210653730.6A Active CN114722805B (en) 2022-06-10 2022-06-10 Little sample emotion classification method based on size instructor knowledge distillation

Country Status (1)

Country Link
CN (1) CN114722805B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115186083A (en) * 2022-07-26 2022-10-14 腾讯科技(深圳)有限公司 Data processing method, device, server, storage medium and product
CN116186200A (en) * 2023-01-19 2023-05-30 北京百度网讯科技有限公司 Model training method, device, electronic equipment and storage medium
CN116861302A (en) * 2023-09-05 2023-10-10 吉奥时空信息技术股份有限公司 Automatic case classifying and distributing method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113762144A (en) * 2021-09-05 2021-12-07 东南大学 Deep learning-based black smoke vehicle detection method
CN113886562A (en) * 2021-10-02 2022-01-04 智联(无锡)信息技术有限公司 AI resume screening method, system, equipment and storage medium
CN114168844A (en) * 2021-11-11 2022-03-11 北京快乐茄信息技术有限公司 Online prediction method, device, equipment and storage medium
CN114283402A (en) * 2021-11-24 2022-04-05 西北工业大学 License plate detection method based on knowledge distillation training and space-time combined attention

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113762144A (en) * 2021-09-05 2021-12-07 东南大学 Deep learning-based black smoke vehicle detection method
CN113886562A (en) * 2021-10-02 2022-01-04 智联(无锡)信息技术有限公司 AI resume screening method, system, equipment and storage medium
CN114168844A (en) * 2021-11-11 2022-03-11 北京快乐茄信息技术有限公司 Online prediction method, device, equipment and storage medium
CN114283402A (en) * 2021-11-24 2022-04-05 西北工业大学 License plate detection method based on knowledge distillation training and space-time combined attention

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115186083A (en) * 2022-07-26 2022-10-14 腾讯科技(深圳)有限公司 Data processing method, device, server, storage medium and product
CN116186200A (en) * 2023-01-19 2023-05-30 北京百度网讯科技有限公司 Model training method, device, electronic equipment and storage medium
CN116186200B (en) * 2023-01-19 2024-02-09 北京百度网讯科技有限公司 Model training method, device, electronic equipment and storage medium
CN116861302A (en) * 2023-09-05 2023-10-10 吉奥时空信息技术股份有限公司 Automatic case classifying and distributing method
CN116861302B (en) * 2023-09-05 2024-01-23 吉奥时空信息技术股份有限公司 Automatic case classifying and distributing method

Also Published As

Publication number Publication date
CN114722805B (en) 2022-08-30

Similar Documents

Publication Publication Date Title
CN114722805B (en) Little sample emotion classification method based on size instructor knowledge distillation
CN111177374B (en) Question-answer corpus emotion classification method and system based on active learning
CN110188358B (en) Training method and device for natural language processing model
CN110188272B (en) Community question-answering website label recommendation method based on user background
CN110619044B (en) Emotion analysis method, system, storage medium and equipment
CN111914885A (en) Multitask personality prediction method and system based on deep learning
CN109062958B (en) Primary school composition automatic classification method based on TextRank and convolutional neural network
CN112364743A (en) Video classification method based on semi-supervised learning and bullet screen analysis
CN115270752A (en) Template sentence evaluation method based on multilevel comparison learning
Jishan et al. Natural language description of images using hybrid recurrent neural network
Cai Automatic essay scoring with recurrent neural network
Nassiri et al. Arabic L2 readability assessment: Dimensionality reduction study
CN116186250A (en) Multi-mode learning level mining method, system and medium under small sample condition
Aksonov et al. Question-Answering Systems Development Based on Big Data Analysis
WO2020240572A1 (en) Method for training a discriminator
Arifin et al. Automatic essay scoring for Indonesian short answers using siamese Manhattan long short-term memory
US20220253694A1 (en) Training neural networks with reinitialization
CN115391520A (en) Text emotion classification method, system, device and computer medium
CN114997175A (en) Emotion analysis method based on field confrontation training
Ma et al. Enhanced hierarchical structure features for automated essay scoring
CN113821571A (en) Food safety relation extraction method based on BERT and improved PCNN
CN112200268A (en) Image description method based on encoder-decoder framework
Ratna et al. Hybrid deep learning cnn-bidirectional lstm and manhattan distance for japanese automated short answer grading: Use case in japanese language studies
LU504829B1 (en) Text classification method, computer readable storage medium and system
Chen et al. An effective relation-first detection model for relational triple extraction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant