CN114722805B - Little sample emotion classification method based on size instructor knowledge distillation - Google Patents

Little sample emotion classification method based on size instructor knowledge distillation Download PDF

Info

Publication number
CN114722805B
CN114722805B CN202210653730.6A CN202210653730A CN114722805B CN 114722805 B CN114722805 B CN 114722805B CN 202210653730 A CN202210653730 A CN 202210653730A CN 114722805 B CN114722805 B CN 114722805B
Authority
CN
China
Prior art keywords
model
sample
instructor
training
tutor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210653730.6A
Other languages
Chinese (zh)
Other versions
CN114722805A (en
Inventor
李寿山
常晓琴
周国栋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN202210653730.6A priority Critical patent/CN114722805B/en
Publication of CN114722805A publication Critical patent/CN114722805A/en
Application granted granted Critical
Publication of CN114722805B publication Critical patent/CN114722805B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2155Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the incorporation of unlabelled data, e.g. multiple instance learning [MIL], semi-supervised techniques using expectation-maximisation [EM] or naïve labelling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Other Investigation Or Analysis Of Materials By Electrical Means (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to a few-sample emotion classification method based on large and small instructor knowledge distillation, which comprises the steps of collecting unmarked samples and marked samples on a large number of emotion classification tasks, and training a large instructor model and a small instructor model by using the marked samples; all unlabeled samples pass through a little instructor model to obtain the uncertainty of the probability of each sample, and then the samples with uncertain sample probability are screened out according to a threshold value and pass through a big instructor model again; and combining probability output of the large instructor model and the small instructor model to form a soft label to distill the student model, and performing classification prediction by using the distilled student model. The invention reduces the frequency of visiting the tutor model, reduces the distillation time in the process of training the student model, reduces the resource consumption and simultaneously improves the accuracy of classification and identification.

Description

Little sample emotion classification method based on knowledge distillation of size guide
Technical Field
The invention relates to the technical field of natural language processing, in particular to a few-sample emotion classification method based on knowledge distillation of a size instructor.
Background
The emotion classification task aims at automatically judging the emotion polarity (such as negative and positive) of the text expression. The task is a research hotspot in the field of natural language processing research, is widely applied to a plurality of application systems such as intention mining, information retrieval and question-answering systems, and is a basic link of the application systems. The low-sample emotion classification in the emotion classification means that only a small number of labeled samples can be used in training the classifier.
When few-sample emotion classification is carried out, machine learning and deep learning algorithms are generally used in the field of artificial intelligence to extract emotion meanings from a section of text, and the most extensive artificial intelligence method at present is to model the problem into a problem of inputting a section of text and outputting a label. The prior art generally comprises the following steps: (1) a professional labels a small amount of texts with different polarity labels, each section of text is used as a sample, and a small amount of corpus of labeled samples with balanced polarity labels is obtained; (2) a large-scale pre-training language model (such as GPT-3) based on prompt utilizes a small amount of labeled samples to train the model, and a classification model is obtained; (3) and testing the text of a certain unknown label by using a classification model to obtain a polarity label of the text segment. During the test, each time the classification model is entered, a single text is entered. Wherein the network structure of the large scale pre-training language model based on the hint in step (2) is shown in fig. 1, where [ CLS ] x [ SEP ] is the input sentence, [ CLS ] marks the beginning of the sentence, [ SEP ] marks the separation of the sentence from the sentence, and x is the classification of the prediction sentence of the original pre-training model. The "MLM head" in FIG. 1 is a fixed usage of the masking language model in the large scale pre-training prompt-based language model. Get the active label "good" by "MLM head" and get the input sentence "[ CLS ] I will recommend them to everyone! It [ MASK ]. The feedback output of SEP is "I will recommend them to everyone! It is good. ".
Few-sample emotion classification because training samples are few, common shallow neural networks (such as CNN, LSTM and the like) and deep pre-training language models (such as BERT, RoBERTA and the like) are difficult to make correct judgment on semantics of some texts, and recognition rate of classification is not high enough. The parameter quantity of a GPT-3 large-scale model in the prior art reaches 1750 hundred million, and the GPT-3 large-scale model can perform excellent on a few-sample learning task by adding a plurality of examples of input and corresponding output as contexts. However, the parameter quantity is too large, expensive computing resources are consumed for calling the model, the reasoning speed is slow, and the practical application is hindered.
Disclosure of Invention
Therefore, the technical problem to be solved by the present invention is to overcome the defects in the prior art, and provide a method for classifying emotion with few samples based on knowledge distillation of a teacher and a teacher, which can effectively reduce the frequency of visiting the teacher and the distillation time in the process of training student models, and improve the accuracy of classification and identification while reducing resource consumption.
In order to solve the technical problems, the invention provides a few-sample emotion classification method based on knowledge distillation of a size instructor, which comprises the following steps of:
s1: dividing the sample into labeled samplesx u And unlabeled samplesx u ' Collection of unlabelled samples across a large number of emotion classification tasksx u ' set up with labeled samples
Figure 300169DEST_PATH_IMAGE001
And collection of unlabeled samplesD u ={x u ′};
S2: constructing a big instructor model and a small instructor model, and using a labeled sample setD l Training the tutor model to obtain the trained tutor modelM L Using sets of labelled samplesD l Training the tutor model to obtain the trained tutor modelM B
S3: using a trained mentor's modelM B Predicting all unlabeled samplesx u ' obtaining sample probabilities
Figure 113405DEST_PATH_IMAGE002
Calculating uncertainty of probability of each sample
Figure 565377DEST_PATH_IMAGE003
S4: will not be uncertain
Figure 602603DEST_PATH_IMAGE004
And a predetermined threshold valuethresholdComparing and screening out samples with uncertain sample probabilityx u ″;
S5: mixing the samplex u ' input training completed little tutor modelM B Soft tag for obtaining little instructor modelPTo samplex u "input training completed tutor modelM L Soft tag for obtaining large tutor modelP' Soft tag incorporating a little mentor modelPAnd soft label of big instructor modelP' obtaining the final Soft Label
Figure 22083DEST_PATH_IMAGE005
S6: constructing a student model, using the unlabeled samplesThis collectionD u And said soft label
Figure 373430DEST_PATH_IMAGE005
Distilling the student model to obtain a distilled student model;
s7: and (4) carrying out classification prediction on the test set by using a distillation completed student model.
Preferably, the large tutor model and the small tutor model are both pre-trained language models based on promptsMAnd the parameter quantity of the large instructor model is greater than that of the small instructor model.
Preferably, the use is of a set of labelled samplesD l Training the tutor model to obtain the trained tutor modelM L The method specifically comprises the following steps:
s21: training set D l ={x u }={x,yIn (c) } the reaction solution is,xa sample of the input is shown,yrepresenting a real tag; for input samplexAdding a prompt template and converting into a complete form filling task form:
P(x)=[CLS] x It is [MASK].[SEP]wherein [ MASK ]]For filling words, P: (x) Is the input of a language model, It is [ MASK]Is a prompt template for input text addition;
s22: will be provided withLAs a set of labels for the classification task,Vand constructing a label mapping function as a label word set of the classification task:
Figure 194755DEST_PATH_IMAGE006
P(x) As input to the language model, by pre-training the language model based on the promptMObtaining [ MASK ]]Corresponding to different labels
Figure 589833DEST_PATH_IMAGE007
Score on
Figure 496610DEST_PATH_IMAGE008
Wherein
Figure 448385DEST_PATH_IMAGE009
Figure 124217DEST_PATH_IMAGE010
Presentation labellThe corresponding label words are then written to the corresponding label words,kis the length of the label word;
s23: establishing predictions [ MASK ] by softmax layer]In different labelslClass probability of (2) to obtain an input sample by the class probabilityxEmotion classification of
Figure 706508DEST_PATH_IMAGE011
S24: establishing a loss function of a great tutor model output layer;
s25: repeating S22-S24 until the tutor model converges, ending the training to obtain the tutor model after the trainingM L
The use of labeled sample setsD l Training the tutor model to obtain the trained tutor modelM B The method specifically comprises the following steps:
s26: training set D l ={x u }={x,yIn (c) } the reaction solution is,xa sample of the input is shown,yrepresenting a genuine label; for input samplexAdding a prompt template and converting into a complete form filling task form:
p (x) = [ CLS ] x It is [ MASK ] [ SEP ], where [ MASK ] is a filler;
s27: will be provided withLAs a set of labels for the classification task,Vand constructing a label mapping function as a label word set of the classification task:
Figure 585733DEST_PATH_IMAGE012
by pre-training language models based on promptsMObtaining [ MASK ]]Corresponding to different labels
Figure 278883DEST_PATH_IMAGE013
Score on
Figure 74801DEST_PATH_IMAGE014
Wherein
Figure 359151DEST_PATH_IMAGE015
Figure 240520DEST_PATH_IMAGE016
Presentation labellThe corresponding label words are then written to the corresponding label words,kis the length of the label word;
s28: establishing predictions by softmax layer [ MASK]In different labelslThe input sample is obtained through the class probabilityxEmotion classification of
Figure 986628DEST_PATH_IMAGE017
S29: establishing a loss function of an output layer of the tutor model;
s210: repeating S27-S29 until the tutor model converges, finishing training and obtaining the trained tutor modelM B
Preferably, the use of a trained mentor's modelM B Predicting all unlabeled samplesx u ' obtaining sample probabilities
Figure 637052DEST_PATH_IMAGE018
Calculating uncertainty of probability of each sample
Figure 561146DEST_PATH_IMAGE019
The method specifically comprises the following steps:
s31: all unlabeled samplesx u ' input training completed little tutor modelM B The predicted probability distribution is
Figure 664231DEST_PATH_IMAGE020
S32: calculating uncertainty of probability of each sample
Figure 449915DEST_PATH_IMAGE021
The calculation formula is as follows:
Figure 220425DEST_PATH_IMAGE022
therein withoutL|The number of the types of the labels in the classification task.
Preferably, the preset threshold valuethresholdHas a value range of
Figure 49841DEST_PATH_IMAGE023
Preferably, the uncertainty is
Figure 640222DEST_PATH_IMAGE024
And a predetermined thresholdthresholdComparing and screening out samples with uncertain sample probabilityx u ", specifically:
uncertainty of sample probability
Figure 728133DEST_PATH_IMAGE025
Is greater thanthresholdThen the sample is used as the sample with high uncertainty of sample probabilityx u ″。
Preferably, the sample isx u ' input training completed little tutor modelM B Soft tag for obtaining little instructor modelPTo samplex u "input training completed tutor modelM L Soft tag for obtaining large tutor modelP' Soft tag incorporating a little mentor modelPAnd soft label of big instructor modelP' obtaining the final Soft Label
Figure 353149DEST_PATH_IMAGE005
The method specifically comprises the following steps:
s51: mixing the samplex u ' input training completed little tutor modelM B Soft tag for obtaining little instructor model
Figure 619046DEST_PATH_IMAGE026
S52: mixing the samplex u "input training completed tutor modelM L Soft tag for obtaining large tutor model
Figure 696723DEST_PATH_IMAGE027
S53:
Figure 824210DEST_PATH_IMAGE028
The expression of (a) is:
Figure 303733DEST_PATH_IMAGE030
preferably, the using the unlabeled sample setD u And said soft label
Figure 740531DEST_PATH_IMAGE031
Distilling the student model to obtain a distilled student model, wherein the specific process is as follows:
s61: collecting unlabelled samplesD u As a training set of distillation student models, vectors passing through the student models are represented as
Figure 305504DEST_PATH_IMAGE032
Wherein g () represents a network function of the student model,A u for unlabeled sample setD u Corresponding word vector matrix, superscriptsThe model of the student is represented by,
Figure 282687DEST_PATH_IMAGE033
a learnable parameter representing a student model;
s62: establishing loss function of student model output layer
Figure 865984DEST_PATH_IMAGE034
WhereinnWhich is indicative of the size of the batch,
Figure 208104DEST_PATH_IMAGE035
representing the passage through the student modeliThe probability of prediction of a sample is,
Figure 260374DEST_PATH_IMAGE036
representing the final sample probability
Figure 510089DEST_PATH_IMAGE037
To middleiThe probability of prediction of a sample is,Tis a temperature parameter of the distillation model, D KL Representing the KL divergence loss function;
S63:
Figure 183778DEST_PATH_IMAGE038
sequentially passing through the linear layer and the softmax activation layer to obtain an unlabeled sample setD u Is output according to the probability
Figure 962379DEST_PATH_IMAGE039
W s Representing a weight matrix to be learned on a linear layer of the student model;
s64: using a loss function L KD Updating the learnable parameters of the student model;
s65: repeating S61-S64 until the loss function L KD And (5) converging to obtain a distilled student model.
Preferably, the word vector matrixA u In, each row is an input samplex u ' A word vector for each character is represented, and the word vector for each character is obtained by word2vec or Glove model training.
Preferably, the expression of the KL divergence loss function is
Figure 501944DEST_PATH_IMAGE040
In which is shownL|The number of the types of the labels in the classification task.
Compared with the prior art, the technical scheme of the invention has the following advantages:
according to the invention, the student model is distilled by establishing the large instructor model and the small instructor model, so that the sample is screened by the small instructor model and then passes through the large instructor model, the distillation time of the student model can be effectively reduced, and the resource consumption is reduced; meanwhile, under the condition that resource consumption is reduced by the large instructor model and the small instructor model, a large number of unlabeled samples in the emotion classification task can be collected, so that the accuracy of classification and identification is improved.
Drawings
In order that the present disclosure may be more readily and clearly understood, reference is now made to the following detailed description of the embodiments of the present disclosure taken in conjunction with the accompanying drawings, in which
FIG. 1 is a network structure of a prompt-based large-scale pre-trained language model;
FIG. 2 is a schematic diagram of the structure of a conventional single teacher and single student knowledge distillation method;
FIG. 3 is a schematic diagram of the structure of the distillation method based on knowledge of the size instructor's mechanism;
FIG. 4 is a graph of the results of experiments performed on the BERT model with YELP and IMDB datasets in an embodiment of the invention;
FIG. 5 is a graph of the results of an experiment of YELP and IMDB datasets on the RoBERTA model in an embodiment of the present invention.
Detailed Description
The present invention is further described below in conjunction with the following figures and specific examples so that those skilled in the art may better understand the present invention and practice it, but the examples are not intended to limit the present invention.
In the optimization process of the model, the large model is often a single complex network or a collection of a plurality of networks, and has good performance and generalization capability; and the small model has limited expression capacity because the network size is small. Therefore, the knowledge learned by the large model (teacher model) can be used to guide the training of the small model (student model), so that the small model has the performance equivalent to that of the large model, but the number of parameters is greatly reduced, thereby realizing the compression and acceleration of the model, and the process is called distillation.
Compared with the conventional single teacher and single student knowledge distillation method shown in FIG. 2, the method of the invention shown in FIG. 3 uses a large number of unlabelled samples on the basis of the conventional method, and introduces two teacher models, namely a big instructor model and a small instructor model based on prompts, in the drawing
Figure 820930DEST_PATH_IMAGE041
Is the output probability of the student model.
The invention relates to a few-sample emotion classification method based on knowledge distillation of a teacher with a large size, which comprises the following steps of:
s1: dividing the sample into labeled samplesx u And unlabeled samplesx u ' Collection of unlabelled samples across a large number of emotion classification tasksx u ' set up with labeled samples
Figure 129552DEST_PATH_IMAGE001
And collection of unlabeled samples
Figure 62742DEST_PATH_IMAGE042
With marked samplex u For samples containing labels, unlabeled samplesx u ' is an unlabeled sample, with a small number of labeled samples in the samplex u And a large number of unlabelled samplesx u ′。
S2: constructing a big instructor model and a small instructor model, wherein the big instructor model and the small instructor model are pre-training language models based on promptsM(i.e., the prompt method), the parameter quantity of the large instructor model is greater than the parameter quantity of the small instructor model, and the parameter quantity of the large instructor model in this embodiment is much greater than the parameter quantity of the small instructor model. Using annotated sets of samplesD l Training a tutor model to obtain a trained tutor modelM L Using sets of labelled samplesD l Training the tutor model to obtain the trained tutor modelM B
Using annotated sets of samplesD l Train big tutor model and little tutor model respectively, the training process of big tutor model and little tutor model is similar, and specific process is:
s21: training set D l ={x u }={x,yIn (c) } the reaction solution is,xa sample of the input is shown and,yrepresenting a real tag; for input samplexAdding a prompt template and converting into a complete form filling task form:
P(x)=[CLS] x It is [MASK].[SEP]wherein [ MASK ]]For filling in words, the purpose is to let the pre-trained language model based on promptsMTo determine [ MASK ]]And (4) filling words, and converting the classification task into a complete blank filling task. Input text adding prompt template' It is [ MASK].”,[MASK]Passing the new input through the language model corresponding to different labels of the classification task, and making the language model determine [ MASK ]]And (4) filling words, thereby realizing the classification of the text.
S22: will be provided withLAs a set of labels for the classification task,Vand constructing a label mapping function as a label word set of the classification task:
Figure 89603DEST_PATH_IMAGE006
(ii) a Mapping task labels to a prompt-based pre-trained language modelMA word or words in the vocabulary of (a). For example: word in word list corresponding to 0 category in emotion two classification task "terrible", 1 Category corresponds to a word in the vocabulary"great”。P(x) As input to the language model, by pre-training the language model based on the promptMObtaining [ MASK ]]Corresponding to different labels
Figure 149963DEST_PATH_IMAGE013
Score on
Figure 109829DEST_PATH_IMAGE043
Wherein
Figure 495811DEST_PATH_IMAGE044
Figure 495122DEST_PATH_IMAGE045
Presentation labellThe corresponding label words are then written to the corresponding label words,kis the length of the tag word.
S23: establishing predictions by softmax layer [ MASK]In different labelslThe input sample is obtained through the class probabilityxEmotion classification of
Figure 93594DEST_PATH_IMAGE046
S24: establishing a loss function of an output layer of a master tutor model; in this embodiment, the loss function is a cross entropy function, which is used to measure the true label of the training sampleyAnd predictive probability
Figure 376808DEST_PATH_IMAGE047
The difference between them.
S25: repeating S22-S24 until the teacher model converges and finishes training to obtain the trained teacher model
Figure 668112DEST_PATH_IMAGE048
S26: training set D l ={x u }={x,yIn (c) } the reaction solution is,xa sample of the input is shown,yrepresenting a real tag; for input samplexAdding a prompt template and converting into a complete form filling task form:
p (x) = [ CLS ] xti [ MASK ] [ SEP ], where [ MASK ] is a stuff word.
S27: will be provided withLAs a set of labels for the classification task,Vand constructing a label mapping function as a label word set of the classification task:
Figure 200724DEST_PATH_IMAGE049
by pre-training language models based on promptsMObtaining [ MASK ]]Corresponding to different labels
Figure 117733DEST_PATH_IMAGE013
Score on
Figure 255454DEST_PATH_IMAGE050
Wherein
Figure 717659DEST_PATH_IMAGE015
Figure 940830DEST_PATH_IMAGE016
Indicating labellThe corresponding label words are then written to the corresponding label words,kis the length of the tag word.
S28: establishing predictions by softmax layer [ MASK]In different labelslClass probability of (2) to obtain an input sample by the class probabilityxEmotion classification of
Figure 897416DEST_PATH_IMAGE051
S29: and establishing a loss function of an output layer of the tutor model.
S210: repeating S27-S29 until the tutor model converges, finishing training and obtaining the trained tutor modelM B
S3: using a trained mentor modelM B Predicting all unlabelled samplesx u ' obtaining sample probabilities
Figure 624063DEST_PATH_IMAGE052
Calculating uncertainty of probability of each sample
Figure 522749DEST_PATH_IMAGE053
S31: all unlabeled samplesx u ' input training completed little tutor modelM B The predicted probability distribution is
Figure 233216DEST_PATH_IMAGE054
S32: calculating uncertainty of probability of each sample
Figure 226449DEST_PATH_IMAGE055
The calculation formula is as follows:
Figure 338762DEST_PATH_IMAGE056
therein aL|By uncertainty, for the number of classes of labels in the classification task
Figure 939507DEST_PATH_IMAGE057
The quality of the sample prediction probability can be measured.
S4: will not be uncertain
Figure 340533DEST_PATH_IMAGE058
And a predetermined thresholdthresholdComparing, screening out the samples with uncertain sample probability and presetting thresholdthresholdHas a value range of
Figure 684926DEST_PATH_IMAGE023
Uncertainty of sample probability
Figure 390759DEST_PATH_IMAGE059
Is greater thanthresholdThis sample is taken as a sample with a high uncertainty in sample probability. Uncertainty of sample probability
Figure 631247DEST_PATH_IMAGE060
Is greater thanthresholdExplain the sample by the tutorx u ' the classification probability result has insufficient confidence, and a new probability distribution needs to be obtained through the master tutor model again.
S5: mixing the samplex u ' input training completed little tutor modelM B Soft tag for obtaining little instructor modelPTo samplex u "input training completed tutor modelM L Soft tag for obtaining large tutor modelP' Soft tag incorporating a little mentor modelPAnd soft label of big instructor modelP' obtaining the final Soft Label
Figure 50727DEST_PATH_IMAGE005
S51: mixing the samplex u ' input training completed little tutor modelM B Soft tag for obtaining little instructor model
Figure 402074DEST_PATH_IMAGE061
S52: mixing the samplex u "input training completed teacher modelM L Soft tag for obtaining large tutor model
Figure 207088DEST_PATH_IMAGE062
S53:
Figure 618478DEST_PATH_IMAGE063
The expression of (c) is:
Figure 525254DEST_PATH_IMAGE064
s6: a student model is constructed, and the student model in the embodiment is composed of a small shallow neural network model. Using unlabeled sample setsD u And soft label
Figure 414713DEST_PATH_IMAGE065
And distilling the student model to obtain a distilled student model.
S61: collecting unlabelled samplesD u As a training set of distillation student models, vectors passing through the student models are represented as
Figure 106856DEST_PATH_IMAGE066
Wherein g () represents a network function of the student model,A u For unlabeled sample setsD u Corresponding word vector matrix, unlabeled samplex u Of length ofkDimension of word vector ofdThen, then
Figure 689147DEST_PATH_IMAGE067
(ii) a Upper labelsThe model of the student is represented by,
Figure 83220DEST_PATH_IMAGE068
the learnable parameters representing the student model.
Word vector matrixA u In, each row is an input samplex u ' A word vector for each character is represented, and the word vector for each character is obtained by word2vec or Glove model training.
S62: establishing a loss function of the output layer of the student model, i.e. a loss function used when the teacher model distills the student model
Figure 776369DEST_PATH_IMAGE069
In whichnWhich is indicative of the size of the batch,
Figure 103445DEST_PATH_IMAGE070
representing the passage through the student modeliThe probability of prediction of a sample is,
Figure 105905DEST_PATH_IMAGE071
representing the final sample probability
Figure 987274DEST_PATH_IMAGE072
To middleiThe probability of prediction of a sample is,Tis a temperature parameter of the distillation model,Tis a parameter carried by a distillation model, the larger T is, the smoother the probability distribution of softmax is, the larger the entropy of the distribution is, the more information is carried, and D is KL Representing the KL divergence loss function.
The expression of the KL divergence loss function is
Figure 484114DEST_PATH_IMAGE073
Therein without a fluorineL|The number of the types of the labels in the classification task.
S63:
Figure 134538DEST_PATH_IMAGE074
Sequentially passing through the linear layer and the softmax activation layer to obtain an unlabeled sample setD u Is output according to the probability
Figure 12627DEST_PATH_IMAGE075
Figure 381291DEST_PATH_IMAGE076
Representing a weight matrix to be learned on a linear layer of the student model;
s64: using a loss function L KD Updating the learnable parameters of the student model;
s65: repeating S61-S64 until the loss function L KD And (5) converging to obtain a distilled student model.
S7: and (4) carrying out classification prediction on the test set by using a distillation completed student model.
The invention has the beneficial effects that:
according to the invention, the student model is distilled by establishing the large instructor model and the small instructor model, so that the sample is screened by the small instructor model and then passes through the large instructor model, the distillation time of the student model can be effectively reduced, and the resource consumption is reduced; meanwhile, under the condition that resource consumption is reduced by the large instructor model and the small instructor model, a large number of unlabeled samples in the emotion classification task can be collected, so that the accuracy of classification and identification is improved.
To further illustrate the beneficial effects of the present invention, in this embodiment, the test set is input into the trained student model to obtain the prediction probability. The effect of the present invention was analyzed from three aspects of (1) the accuracy of classification results obtained by predicting the test set by the student model, (2) the prediction time of the teacher model for all unlabeled samples in the distillation model, and (3) the reduction rate of the visit rate to the tutor model.
In the present embodiment, a sentence-level YELP dataset (see the documents "Zhang X, Zhao J, LeCun Y. Character-level connected networks for text classification [ J ]. Advances in the neural information processing systems, 2015, 28: 649-. In the experimental process, 8 positive and negative balanced samples are respectively selected from each data set as a training set and a verification set, and positive and negative 500 samples are used as a test set. Further, the number of unlabeled samples of the YELP dataset was 10 ten thousand, and the number of unlabeled samples of the IMDB dataset was 9.8 ten thousand.
In order to simulate the knowledge distillation process of a size instructor mechanism, the invention respectively arranges a size instructor model and a small instructor model under a BERT model and a RoBERTA model, and uses the BERT-large (the size instructor model under the BERT), the BERT-base (the small instructor model under the BERT) and the RoBERTA-large (the size instructor model under the RoBERTA) to represent. When the teacher model is trained, the label words are respectively ' terrable ' and ' great; the batch size is set to 4 or 8; the optimizer uses AdamW where the learning rate selects one of {1e-5, 2e-5, 5e-5}, the weight decay is set to 1e-3, and the batch size and learning rate are determined from the manner in which the grid searches for the hyper-parameters. The student model is a CNN model, using 3 different sizes of convolution kernels, respectively (3, 50), (4, 50) and (5, 50). The number of each convolution kernel is 100; each CNN uses glove.6b.50d as a word vector; the batch size is set to 128; the optimizer uses Adam with the learning rate set to 1e-3 and the weight decay to 1 e-5. In order to prevent the overfitting phenomenon in the neural network model training process, a Dropout parameter is set to be 0.5.
The results of the experiments on the BERT model for the YELP and IMDB datasets are shown in fig. 4, where the uncertainty threshold on both the YELP and IMDB datasets was set to 0.85; the results of the experiments on the RoBERTa model for the YELP and IMDB datasets are shown in fig. 5, where the uncertainty threshold on the YELP dataset is set to 0.6 and the uncertainty threshold on the IMDB dataset is set to 0.9. Fine-tuning in FIGS. 4 and 5 indicates that the standard Fine-tuning pre-training language model is used, LM-BFF indicates that the prompt-based Fine-tuning pre-training language model is used, and LM-BFF distills the CNN indicates that the CNN model is distilled using the prompt-based pre-training language model. Because few sample learning is sensitive to data and the results obtained by different data segmentation training are greatly different, the accuracy of the classification result is expressed by using 5 different random seeds to respectively sample different training sets and verification sets so as to alleviate the problem; the data of the accuracy of the classification result is expressed in the form of "average of 5 results (variance of 5 results)".
As can be seen from the correct rate of the classification results in fig. 4, compared with the distillation performance of the BERT-large model, the method of the present invention is improved by 91.13% -90.64% =0.49% under the YELP data set and by 84.14% -84.08% =0.06% under the IMDB data set. Moreover, compared with the distillation performance of the BERT-base model, the distillation performance of the method is 91.13% > 87.18% under the YELP data set and 84.14% >84.08% under the IMDB data set, and the result of the method is far superior to that of the BERT-base model. As can be seen from the predicted time of fig. 4, the time spent for distillation by the method of the present invention was increased by 91.93s/163.27s =56.31% under the YELP data set and 962.37s/1598.34s =60.21% under the IMDB data set, compared to the BERT-large teacher model. Meanwhile, the simulation program counts that the proportion of the method for reducing the model access rate of the large instructor (the proportion of the reduction of the model access rate of the large instructor, namely the proportion of the times of unmarked samples passing through the large instructor model under a large instructor mechanism to the times of the unmarked samples passing through the large instructor model in total) is reduced by 74.40% under a YELP data set and 72.42% under an IMDB data set compared with BERT-large.
As can be seen from the correct rate of the classification results in fig. 5, the distillation performance of the method of the present invention is improved by 93.16% -92.80% =0.36% under YELP data set and 87.84% -87.64% =0.2% under IMDB data set, compared to the distillation performance of RoBERTa-large model. Moreover, compared with the distillation performance of the RoBERTA-base model, the distillation performance of the method is 93.16% > 91.82% under a YELP data set and 87.84% >87.64% under an IMDB data set, and the result of the method is superior to that of the RoBERTA-base model. As can be seen from the predicted time of fig. 5, the time spent in distillation by the method of the present invention was increased by 75.59s/163.32s =46.28% under the YELP data set and by 912.65s/1594.93s =57.22% under the IMDB data set, compared to the RoBERTa-large teacher model. Meanwhile, the simulation program counts that the proportion of the method for reducing the model access rate of the large instructor (the proportion of the reduction of the model access rate of the large instructor, namely the proportion of the times of unmarked samples passing through the large instructor model under a large instructor mechanism to the times of the unmarked samples passing through the large instructor model in total) is 84.65% lower than that of RoBERTA-large and is 75.56% lower than that of an IMDB data set.
The access frequency is proportional to the occupied resources. All unmarked samples need to pass through a little instructor model, and a small part of samples screened by the threshold pass through a big instructor model. Compared with the method that all unlabeled samples pass through the large instructor model, the method has the advantages that a large amount of resource consumption can be reduced, the parameters of the small instructor model are relatively small, and occupied computing resources are small, so that only the proportion of the large instructor model with the reduced access rate is analyzed in the simulation.
The simulation result further shows that the method can effectively reduce the frequency of visiting the master teacher model and the distillation time in the process of training the student model, and improve the accuracy of classification and identification while reducing the resource consumption.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications of the invention may be made without departing from the spirit or scope of the invention.

Claims (10)

1. A few-sample emotion classification method based on knowledge distillation of a size guide is characterized by comprising the following steps of:
s1: dividing the sample into labeled samplesx u And unlabeled samplesx u ' Collection of unlabelled samples across a large number of emotion classification tasksx u ' set up with labeled samplesD l ={x u Set of unlabeled samplesD u ={x u ′};
S2: constructing a large instructor model and a small instructor model, and using a labeled sample setD l Training the tutor model to obtain the trained tutor modelM L Using sets of labelled samplesD l Training the tutor model to obtain the trained tutor modelM B
S3: using a trained mentor's modelM B Predicting all unlabelled samplesx u ' obtaining sample probabilities
Figure 679554DEST_PATH_IMAGE001
Calculating uncertainty of probability of each sample
Figure 414292DEST_PATH_IMAGE002
S4: will not be uncertain
Figure 242702DEST_PATH_IMAGE003
And a predetermined threshold valuethresholdComparing and screening out samples with uncertain sample probabilityx u ″;
S5: mixing the samplex u ' input training completed little tutor modelM B Soft tag for obtaining little instructor modelPTo samplex u "input training completed tutor modelM L Soft tag for obtaining large tutor modelP' Soft tag incorporating a little mentor modelPSoft label of large tutor modelLabel (Bao)P' obtaining the final Soft Label
Figure 885036DEST_PATH_IMAGE004
S6: constructing a student model using the unlabeled sample setD u And said soft label
Figure 364558DEST_PATH_IMAGE005
Distilling the student model to obtain a distilled student model;
s7: and (4) carrying out classification prediction on the test set by using a distillation completed student model.
2. The few-sample emotion classification method based on size teacher knowledge distillation as claimed in claim 1, wherein: the big instructor model and the little instructor model are pre-training language models based on promptsMAnd the parameter quantity of the large instructor model is greater than that of the small instructor model.
3. The few-sample emotion classification method based on size teacher knowledge distillation as claimed in claim 2, wherein: the use of labeled sample setsD l Training a tutor model to obtain a trained tutor modelM L The method specifically comprises the following steps:
s21: training set D l ={x u }={x,yIn (c) } the (c) is,xa sample of the input is shown,yrepresenting a real tag; for input samplexAdding a prompt template and converting into a complete form filling task form:
P(x)=[CLS] x It is [MASK].[SEP]wherein [ MASK ]]For filling words, P: (x) Is the input of a language model, It is [ MASK]Is a prompt template for input text addition; [ CLS]Symbol sentence leader, x is the classification of the original pre-trained model predicted sentence, [ SEP ]]Separation of a marker sentence from a sentence;
s22: will be provided withLAs a set of labels for the classification task,Vtags as classification tasksWord set, constructing a label mapping function:
Figure 801356DEST_PATH_IMAGE006
by pre-training language models based on promptsMObtaining [ MASK ]]Corresponding to different labels
Figure 631909DEST_PATH_IMAGE007
Score on
Figure 61622DEST_PATH_IMAGE008
Wherein
Figure 661231DEST_PATH_IMAGE009
Figure 268929DEST_PATH_IMAGE010
Presentation labellThe corresponding label words are then written to the corresponding label words,kis the length of the label word;
s23: establishing predictions by softmax layer [ MASK]In different labelslThe input sample is obtained through the class probabilityxEmotion classification of
Figure 321199DEST_PATH_IMAGE011
S24: establishing a loss function of a great tutor model output layer;
s25: repeating S22-S24 until the tutor model converges, finishing training and obtaining the trained tutor modelM L
The use of labeled sample setsD l Training the tutor model to obtain the trained tutor modelM B The method specifically comprises the following steps:
s26: training set D l ={x u }={x,yIn (c) } the reaction solution is,xa sample of the input is shown,yrepresenting a real tag; for input samplexAdding a prompt template and converting into a complete form filling task form:
p (x) = [ CLS ] x It is [ MASK ] [ SEP ], where [ MASK ] is a filler;
s27: will be provided withLAs a set of labels for the classification task,Vand constructing a label mapping function as a label word set of the classification task:
Figure 787559DEST_PATH_IMAGE012
(ii) a By pre-training language models based on promptsMTo obtain [ MASK ]]Corresponding to different labels
Figure 241674DEST_PATH_IMAGE013
Score on
Figure 285854DEST_PATH_IMAGE014
Wherein
Figure 825419DEST_PATH_IMAGE015
Figure 347667DEST_PATH_IMAGE016
Presentation labellThe corresponding label words are then written to the corresponding label words,kis the length of the label word;
s28: establishing predictions by softmax layer [ MASK]In different labelslThe input sample is obtained through the class probabilityxEmotion classification of
Figure 905557DEST_PATH_IMAGE017
S29: establishing a loss function of an output layer of the tutor model;
s210: repeating S27-S29 until the tutor model converges, finishing training and obtaining the trained tutor modelM B
4. The method of claim 3The few-sample emotion classification method based on the knowledge distillation of a size guide is characterized by comprising the following steps of: the mentor's model completed with trainingM B Predicting all unlabeled samplesx u ' obtaining sample probabilities
Figure 855058DEST_PATH_IMAGE018
Calculating uncertainty of probability of each sample
Figure 616341DEST_PATH_IMAGE019
The method specifically comprises the following steps:
s31: all unlabeled samplesx u ' input training completed little tutor modelM B The predicted probability distribution is
Figure 942280DEST_PATH_IMAGE020
S32: calculating uncertainty of probability of each sample
Figure 856140DEST_PATH_IMAGE021
The calculation formula is as follows:
Figure 976543DEST_PATH_IMAGE022
therein withoutL|The number of the types of the labels in the classification task.
5. The few-sample emotion classification method based on size teacher knowledge distillation as claimed in claim 1, wherein: the preset threshold valuethresholdHas a value range of
Figure 756280DEST_PATH_IMAGE023
6. The few-sample emotion classification method based on size teacher knowledge distillation as claimed in claim 1, wherein:the uncertainty of the future
Figure 620331DEST_PATH_IMAGE024
And a predetermined threshold valuethresholdComparing and screening out samples with uncertain sample probabilityx u ", specifically:
uncertainty of sample probability
Figure 887233DEST_PATH_IMAGE025
Is greater thanthresholdThen the sample is used as the sample with high uncertainty of sample probabilityx u ″。
7. The method of classifying emotion of a small sample based on distillation of knowledge of a size instructor as claimed in claim 3, wherein: the samplex u ' input training completed little tutor modelM B Soft tag for obtaining little instructor modelPTo samplex u "input training completed tutor modelM L Soft tag for obtaining large tutor modelP' Soft tag incorporating a little mentor modelPAnd soft label of big instructor modelP' obtaining the final Soft Label
Figure 709696DEST_PATH_IMAGE026
The method specifically comprises the following steps:
s51: mixing the samplex u ' input training completed little tutor modelM B Soft tag for obtaining little instructor model
Figure 445571DEST_PATH_IMAGE027
T is the temperature parameter of the distillation model;
s52: mixing the samplex u "input training completed teacher modelM L Soft tag for obtaining large tutor model
Figure 113312DEST_PATH_IMAGE028
S53:
Figure 985453DEST_PATH_IMAGE029
The expression of (a) is:
Figure 720760DEST_PATH_IMAGE030
8. the method for classifying emotion of a small sample based on distillation of knowledge of a size instructor in accordance with any one of claims 1 to 7, wherein: said using said set of unlabeled samplesD u And said soft label
Figure 943931DEST_PATH_IMAGE005
Distilling the student model to obtain a distilled student model, wherein the specific process is as follows:
s61: collecting unlabelled samplesD u As a training set of distillation student models, vectors passing through the student models are represented as
Figure 149785DEST_PATH_IMAGE031
Wherein g () represents a network function of the student model,A u for unlabeled sample setsD u Corresponding word vector matrix, superscriptsThe model of the student is represented by,
Figure 142012DEST_PATH_IMAGE032
learnable parameters representing a student model;
s62: establishing loss function of student model output layer
Figure 289965DEST_PATH_IMAGE033
WhereinnWhich is indicative of the size of the batch,
Figure 734853DEST_PATH_IMAGE034
showing the history ofThe first of the raw modeliThe probability of prediction of a sample is,
Figure 744397DEST_PATH_IMAGE035
representing the final sample probability
Figure 122289DEST_PATH_IMAGE005
To middleiThe probability of prediction of a sample is,Tis a temperature parameter of the distillation model, D KL Representing the KL divergence loss function;
S63:
Figure 926297DEST_PATH_IMAGE036
sequentially passing through the linear layer and the softmax activation layer to obtain an unlabeled sample setD u Is output according to the probability
Figure 874792DEST_PATH_IMAGE037
W s Representing a weight matrix to be learned on a linear layer of the student model;
s64: using a loss function L KD Updating the learnable parameters of the student model;
s65: repeating S61-S64 until the loss function L KD And (5) converging to obtain a distilled student model.
9. The method of classifying emotion of a small sample based on distillation of knowledge of a size instructor in claim 8, wherein: the word vector matrixA u In, each row is an input samplex u ' A word vector for each character is represented, and the word vector for each character is obtained by word2vec or Glove model training.
10. The method of classifying emotion of a small sample based on distillation of knowledge of a size instructor in claim 8, wherein: the expression of the KL divergence loss function is
Figure 688028DEST_PATH_IMAGE038
Therein without a fluorineL|The number of the types of the labels in the classification task.
CN202210653730.6A 2022-06-10 2022-06-10 Little sample emotion classification method based on size instructor knowledge distillation Active CN114722805B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210653730.6A CN114722805B (en) 2022-06-10 2022-06-10 Little sample emotion classification method based on size instructor knowledge distillation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210653730.6A CN114722805B (en) 2022-06-10 2022-06-10 Little sample emotion classification method based on size instructor knowledge distillation

Publications (2)

Publication Number Publication Date
CN114722805A CN114722805A (en) 2022-07-08
CN114722805B true CN114722805B (en) 2022-08-30

Family

ID=82232411

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210653730.6A Active CN114722805B (en) 2022-06-10 2022-06-10 Little sample emotion classification method based on size instructor knowledge distillation

Country Status (1)

Country Link
CN (1) CN114722805B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116186200B (en) * 2023-01-19 2024-02-09 北京百度网讯科技有限公司 Model training method, device, electronic equipment and storage medium
CN116861302B (en) * 2023-09-05 2024-01-23 吉奥时空信息技术股份有限公司 Automatic case classifying and distributing method

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113762144B (en) * 2021-09-05 2024-02-23 东南大学 Deep learning-based black smoke vehicle detection method
CN113886562A (en) * 2021-10-02 2022-01-04 智联(无锡)信息技术有限公司 AI resume screening method, system, equipment and storage medium
CN114168844A (en) * 2021-11-11 2022-03-11 北京快乐茄信息技术有限公司 Online prediction method, device, equipment and storage medium
CN114283402B (en) * 2021-11-24 2024-03-05 西北工业大学 License plate detection method based on knowledge distillation training and space-time combined attention

Also Published As

Publication number Publication date
CN114722805A (en) 2022-07-08

Similar Documents

Publication Publication Date Title
CN108984526B (en) Document theme vector extraction method based on deep learning
CN111177374B (en) Question-answer corpus emotion classification method and system based on active learning
CN109376242B (en) Text classification method based on cyclic neural network variant and convolutional neural network
CN110188358B (en) Training method and device for natural language processing model
CN109766277B (en) Software fault diagnosis method based on transfer learning and DNN
CN114722805B (en) Little sample emotion classification method based on size instructor knowledge distillation
US11900250B2 (en) Deep learning model for learning program embeddings
CN111914885A (en) Multitask personality prediction method and system based on deep learning
CN113128233B (en) Construction method and system of mental disease knowledge map
CN111858896A (en) Knowledge base question-answering method based on deep learning
CN115309910B (en) Language-text element and element relation joint extraction method and knowledge graph construction method
CN113988079A (en) Low-data-oriented dynamic enhanced multi-hop text reading recognition processing method
CN115270752A (en) Template sentence evaluation method based on multilevel comparison learning
CN116737581A (en) Test text generation method and device, storage medium and electronic equipment
CN114722198A (en) Method, system and related device for determining product classification code
EP3977392A1 (en) Method for training a discriminator
Arifin et al. Automatic essay scoring for Indonesian short answers using siamese Manhattan long short-term memory
CN113283605B (en) Cross focusing loss tracing reasoning method based on pre-training model
Lin et al. Robust educational dialogue act classifiers with low-resource and imbalanced datasets
CN114997175A (en) Emotion analysis method based on field confrontation training
CN113821571A (en) Food safety relation extraction method based on BERT and improved PCNN
LU504829B1 (en) Text classification method, computer readable storage medium and system
CN116205217B (en) Small sample relation extraction method, system, electronic equipment and storage medium
KR102535417B1 (en) Learning device, learning method, device and method for important document file discrimination
CN114218923B (en) Text abstract extraction method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant