CN115840820A - Small sample text classification method based on domain template pre-training - Google Patents

Small sample text classification method based on domain template pre-training Download PDF

Info

Publication number
CN115840820A
CN115840820A CN202211598846.0A CN202211598846A CN115840820A CN 115840820 A CN115840820 A CN 115840820A CN 202211598846 A CN202211598846 A CN 202211598846A CN 115840820 A CN115840820 A CN 115840820A
Authority
CN
China
Prior art keywords
training
template
data
target task
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211598846.0A
Other languages
Chinese (zh)
Inventor
王廷
贾晨阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China Normal University
Original Assignee
East China Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China Normal University filed Critical East China Normal University
Priority to CN202211598846.0A priority Critical patent/CN115840820A/en
Publication of CN115840820A publication Critical patent/CN115840820A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a small sample text classification method based on domain template pre-training, which is characterized in that a domain template-related data set is used for template construction, then a pre-training language model is further pre-trained by using constructed data, a mixed template is constructed for a target task data set, the target data set data is preprocessed, the model after the further pre-training is used for training and verifying a target task to obtain a predicted word, and the predicted word is mapped to be a final target label by using a label word mapper. Compared with the prior art, the method has the advantages of higher training speed, lower requirement on hardware performance, better utilization of the pre-training language model, better classification effect by using less data, improvement of the classification accuracy of the target task and technical support for technical development in the related field.

Description

Small sample text classification method based on domain template pre-training
Technical Field
The invention relates to the technical field of natural language processing, in particular to a small sample text classification strategy based on domain template pre-training and improved Prompt.
Background
With the continuous development of natural language processing, model algorithms for text classification tasks are also diversified, ranging from machine learning models based on probabilities to deep learning models composed of deep neural networks. Although the classification accuracy of the model methods is gradually improved, the models are generally trained on a task data set directly from the beginning, a large amount of labeled data and high-performance processor support are needed, a large amount of training time is needed, besides, the trained models have poor adaptability to new tasks, and data labeling and model training are often needed again for the new tasks. The small sample learning method based on the pre-training model is developed rapidly in recent years, the problems can be solved well, and the small sample learning method based on the pre-training model is applied to the Wen Benfen class and has research value. The small sample learning method based on the pre-training model can well obtain general common language representation knowledge and initialization parameters of the model from a mass of label-free data sets, and then can obtain a good effect by training with little data in a target task.
At present, the text classification method in the field of natural language processing is mainly divided into two categories, namely a deep learning-based language model and a pre-training-based language model. Classical methods based on deep learning models, such as neuron-based Recurrent Neural Networks (RNN) (Mikolov T, karafi a T M, target L, et al. Improved long-short term memory Neural networks (LSTMs) (Zhang, yanbo. Research on Text Classification Method Based on LSTM Neural Network model.2021IEEE Asia-Pacific Conference on Image Processing, electronics and Computers (IPEC) (2021): 1019-1022); text convolutional Neural network model (Text CNN) (Kim, bamboo. Structural Neural Networks for sequence Classification [ C ]. Empirical Methods in Natural Language Processing, 2014. For the training corpora with a large number of labels, the models can repeatedly adjust model parameters through training, and a good classification effect can be obtained. However, the methods based on deep learning all require training of the model from scratch and require a large amount of training data sets to establish the mathematical mapping between the input X and output Y variables. In practical application scenarios, due to privacy and security issues or collection tagging cost issues, a large amount of data cannot be obtained to train and learn the model, or training on the deep learning model is difficult to perform due to the limitation of computer hardware level. The performance and effectiveness of such methods are therefore unsatisfactory in resource-constrained situations. With transformers (Vaswani A, shazer N, parmar N, et al.Attention is all you Processing Systems,2017, (30): 6000-6010) and Bert (Vaswani A, shazer N, parmar N, et al.Attention is all you Processing [ J ]. Neural Information Processing Systems,2017, (30): 6000-6010). The pre-training language model is provided to accelerate a new round of natural language field development, and the method based on the pre-training language model can be divided into a fine-tuning strategy for enabling the model to adapt to the task and a prompt learning strategy for enabling the task to adapt to the model. According to the solution of pre-training and fine-tuning, firstly, a training object is designed for a downstream task based on a pre-training language model, and the model is fine-tuned to obtain semantic information of a corpus and initialization parameters of the pre-training model, so that the model is suitable for various downstream tasks. The pre-training model-based Prompt learning method can fully exert the potential of the pre-training language model, a Prompt description is added in data reconstruction, a task is converted into a well-known complete gap filling task of the pre-training language model, a classifier does not need to be redesigned, only different Prompt needs to be designed, the target task can be adapted to the pre-training language model, and good classification effect can be shown.
Although the current field of natural language processing is rapidly developing and there are a number of excellent algorithms to perform the task of text classification, there are some unsolved problems. For example, the pre-training language model and the fine-tuning method are more and more complex in design, the field of the target task is too different from the field of the pre-training language task, the pre-training language model is difficult to learn the specific knowledge in the field, structural deviation exists between the input and output of the pre-training language model and the target task, semantic information, redundant information and the like are lost in the aspect of data processing of a data set, how to reasonably acquire field data information and fully improve the effect of the pre-training model on the target task, and the method is still one of the key problems needing to be researched in the field of small sample text classification.
Disclosure of Invention
The invention aims to provide a small sample text classification method based on domain template pre-training, which aims at the defects of the prior art, adopts a method combining domain template pre-training and prompt learning to perform a small sample text classification task, utilizes a target data set to construct a template, utilizes a pre-training language model to train an MLM task, obtains domain information, and performs mixed template construction and multi-label mapping to achieve better classification by using less data, thereby shortening the training time of the target task and reducing the requirements on the hardware performance of a computer. The method is simple and convenient, the training speed is higher, the requirement on hardware performance is lower, the pre-training language model is better utilized, the classification accuracy of the target task is greatly improved, and the technical support is provided for the technical development of the related field.
The specific technical scheme for realizing the purpose of the invention is as follows: a small sample text classification method based on domain template pre-training is characterized in that a domain template is adopted for building and further pre-training a pre-training language model, and a target task is improved and prompted to learn, and the method for classifying the small sample text mainly comprises the following steps:
step 1: constructing a prompt template through a field data set related to a target task, and if input data is X, adding prompt information by using fprompt as a prompt function to construct X defined by the following formula (a):
x=fprompt(x)(a)。
wherein x is domain data set text data; fprompt is a template construction function; and x' is the data of the domain data set after template construction.
Step 2: and (3) further pre-training the selected pre-training language model aiming at the MLM task by using the data constructed by the template in the step (1), so that the pre-training language model obtains the field information related to the target task.
And step 3: the same number of data samples are taken for each category of the target task, head and tail truncation processing is carried out on the long text in the data set to obtain semantic information with summarization of the head and the tail, dynamic filling is carried out on the short text, and a large amount of useless filling is reduced.
And 4, step 4: constructing a prompt mixed template for a target data set by using a natural language template which can be understood by human beings and an encoding language template which can be understood by a machine, using training data Xtarget of a target task, designing the training data Xtarget into { soft: this } topic { soft: is } { mask } { Xtarget }, wherein soft is an adjustable template understood by the machine, initializing according to the task, topic is a natural language template and can be converted into a corresponding Embedding form, mask is a value to be predicted, and Xtarget is an original input sequence.
And 5: and (4) training and predicting the language model generated in the step (2) and subjected to further pre-training by using the mixed template constructed by the target task data set in the step (4), and adjusting parameters such as learning rate and the like. The probability of token of the position of the Mask predicted by the pre-training language at this time is shown as the following formula (b):
Figure BDA0003997898080000031
wherein, X' is input data constructed by a template; y is f The final output probability; y is a current predicted word; z (X) is an answer space; y' is the answer space that does not contain the current predictor.
And 6: obtaining a probability maximum prediction answer through an argmax function, and then obtaining an output label required by the final target task according to an answer space Z by a label word mapper by the following formula (c):
Ylabel=Z(argmaxY ∈% (P(Yf|X’)))(c)。
wherein, Y label P (Yf | X') is the probability of the predicted word, and Z is the answer label mapper, mapping the predicted word to the final output label.
Compared with the prior art, the invention has the following remarkable technical progress and beneficial technical effects:
1) According to the method, the prompt information is constructed by using the data in the field, and the pre-training language model is further pre-trained, so that semantic information and field knowledge related to the target field are fully obtained.
2) The method provided by the invention modifies the target task to adapt to the pre-training language model, and when the target task is subjected to template construction, a mixed template mode is used, so that different advantages of the prompt template are fully exerted, and the cost of template construction is reduced.
3) The invention has higher accuracy and shorter training time in the classification research of small samples than a deep learning method using a larger data set and a method using pre-training and fine-tuning.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
The invention will be described and illustrated in further detail with reference to specific embodiments:
referring to fig. 1, the present invention performs small sample text classification by the following steps:
step 1: constructing a prompt template through a field data set related to a target task, and if input data is X, adding prompt information by using fprompt as a prompt function to construct X', if an input sentence is: the I like this toy is prepared by template construction as follows: it is [ Mask ], I like this toy.
Step 2: and (2) further pre-training the selected pre-training language model aiming at the MLM task by using the data constructed by the template in the step (1), so that the pre-training language model obtains the field information related to the target task.
And 3, step 3: the same number of data samples are taken for each category of the target task, head and tail truncation processing is carried out on the long text in the data set, head and tail semantic information with summarization is obtained, dynamic filling is carried out on the short text, and a large amount of useless filling is reduced.
And 4, step 4: constructing a prompt mixed template for a target data set by using a natural language template which can be understood by human beings and an encoding language template which can be understood by a machine, using training data Xtarget of a target task, designing the training data Xtarget into { soft: this } topic { soft: is } { mask } { Xtarget }, wherein soft is an adjustable template understood by the machine, initializing according to the task, topic is a natural language template and can be converted into a corresponding Embedding form, mask is a value to be predicted, and Xtarget is an original input sequence.
And 5: and (4) training and predicting the language model generated in the step (2) and subjected to further pre-training by using the template constructed by the target task data set in the step (4), adjusting parameters such as learning rate and the like, setting the preheating step to be 0.1 of the total step length, setting the model update rate to be 0.00002, and setting the weight attenuation to be 0.01. At this time, the probability of token of the pre-training language prediction Mask position is calculated by the following formula (b):
Figure BDA0003997898080000041
wherein, X' is input data constructed by a template; y is f Is the final output probability; y is the current predicted word, Z (X) is the answer space, and Y' is the answer space that does not contain the current predicted word.
And 6: obtaining a probability maximum prediction answer through an argmax function, and then obtaining an output label required by the final target task according to an answer space Z by a label word mapper by the following formula (c):
Ylabel=Z(argmax ∈% (P(Y|X’)))(c)。
wherein, Y label Is the last output label; p (Y | X') is the probability of the predicted word; z is the answer label mapper.
The predicted words are mapped to the final output labels, e.g., for the binary case the answer space and the labeled words may be (positive: wonderful, green, interactive.; negative: bad, boring, terrible.).
According to the invention, the domain data set related to the target task is utilized to construct the template and further pre-train the pre-training language model, so that the pre-training model can better learn the domain information related to the target task, and the gap between the pre-training language model and the target task is greatly reduced. The method constructs an input/output structure similar to that of a pre-training language model during training by using a template-constructed Prompt method, so that a target task is adaptive to the pre-training language model, the potential of the pre-training language model is fully exerted, and hybrid template construction is performed by using human-understandable natural language and machine-understood coding language, so that optimizable embedded words are enhanced, and the representation capability of the template is improved.
The above description is only exemplary of the present invention and should not be taken as limiting the scope of the present invention, and any modifications, equivalents, improvements and the like that are within the spirit and principle of the present invention should be included in the present invention.

Claims (5)

1. A small sample text classification method based on domain template pre-training is characterized in that a method for further training a used pre-training language model by adopting a domain data set related to a target task is adopted, and small sample text classification is carried out through data preprocessing, parameter processing, a mixed template and multi-label mapping, and the method specifically comprises the following steps:
1) Constructing a prompt template by using a data set related to the target task field to obtain field data;
2) Pre-training a pre-training language model selected by field data by taking MLM as a target task to generate a further pre-training language model;
3) The training data set is subjected to class balance sampling, the long text is cut off in the same length from head to tail, and the short text is dynamically filled;
4) A mixed template construction method for combining a discrete template and a continuous template by using a target data set is used for constructing a prompt mixed template;
5) Training and predicting the target task by using the generated further pre-training language model, and adjusting learning rate parameters to obtain a predicted answer;
6) And performing label mapping conversion on actual labels of the target task on the words predicted by the model according to the answer space by using a multi-label mapper according to the predicted answers to obtain the final output label, thereby realizing small sample text classification.
2. The domain template pre-trained small sample classification method according to claim 1, wherein the step 1) uses a data set related to the target task domain to construct a prompt template, and if the input data is X, uses f as a prompt function for adding prompt information, and constructs X defined by the following formula (a):
x=f(x) (a);
wherein x is domain data set text data; f is a template construction function; and x' is the data of the domain data set after template construction.
3. The domain template pre-trained small sample text classification method according to claim 1, wherein the step 4) uses a natural language template understood by human and a coding language template understood by machine to construct a prompt mixed template for a target data set, uses a training data Xtarget of a target task, and is designed as { soft: this } topic { soft: is } { mask } { Xtarget }, wherein soft is a tunable template understood by machine, and is initialized according to the task; topic is a natural language template for converting a language into a corresponding Embedding form; mask is the value to be predicted; xtarget is the original input sequence.
4. The method for classifying small samples based on domain template pre-training as claimed in claim 1, wherein the step 5) uses the generated further pre-training language model to train and predict the target task, the preheating step of the training is set to 0.1 of the total step size, the model update rate is 0.00002, and the weight attenuation is set to 0.01; the probability of predicting token of Mask position by using the pre-training language is calculated by the following formula (b):
Figure FDA0003997898070000021
wherein, X' is input data constructed by a template; y is f The final output probability; y is a current predicted word; z (X) is an answer space; y' is the answer space that does not contain the current predictor.
5. The domain template pre-trained small sample classification method according to claim 1, wherein said step 6) obtains a probabilistic maximum predictive answer through an argmax function, and then according to an answer space Z, the label word mapper obtains an output label required for calculating a final target task by the following formula (c):
Y= Z(argmax ∈% (P(Y|X’))) (c);
wherein Y is label Is the last output label; p (Y | X') is the probability of the predicted word; z is an answer label mapper that maps the predicted word to the final output label.
CN202211598846.0A 2022-12-14 2022-12-14 Small sample text classification method based on domain template pre-training Pending CN115840820A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211598846.0A CN115840820A (en) 2022-12-14 2022-12-14 Small sample text classification method based on domain template pre-training

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211598846.0A CN115840820A (en) 2022-12-14 2022-12-14 Small sample text classification method based on domain template pre-training

Publications (1)

Publication Number Publication Date
CN115840820A true CN115840820A (en) 2023-03-24

Family

ID=85578520

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211598846.0A Pending CN115840820A (en) 2022-12-14 2022-12-14 Small sample text classification method based on domain template pre-training

Country Status (1)

Country Link
CN (1) CN115840820A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116994099A (en) * 2023-09-28 2023-11-03 北京科技大学 Feature decoupling small amount of sample pre-training model robustness fine adjustment method and device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116994099A (en) * 2023-09-28 2023-11-03 北京科技大学 Feature decoupling small amount of sample pre-training model robustness fine adjustment method and device
CN116994099B (en) * 2023-09-28 2023-12-22 北京科技大学 Feature decoupling small amount of sample pre-training model robustness fine adjustment method and device

Similar Documents

Publication Publication Date Title
WO2021047286A1 (en) Text processing model training method, and text processing method and apparatus
WO2022037256A1 (en) Text sentence processing method and device, computer device and storage medium
CN109190120B (en) Neural network training method and device and named entity identification method and device
WO2022057776A1 (en) Model compression method and apparatus
CN109885824B (en) Hierarchical Chinese named entity recognition method, hierarchical Chinese named entity recognition device and readable storage medium
EP3913542A2 (en) Method and apparatus of training model, device, medium, and program product
JP2021152963A (en) Word meaning feature generating method, model training method, apparatus, device, medium, and program
US11636272B2 (en) Hybrid natural language understanding
US20220343139A1 (en) Methods and systems for training a neural network model for mixed domain and multi-domain tasks
CN112905795A (en) Text intention classification method, device and readable medium
WO2022095354A1 (en) Bert-based text classification method and apparatus, computer device, and storage medium
WO2023137911A1 (en) Intention classification method and apparatus based on small-sample corpus, and computer device
CN113987147A (en) Sample processing method and device
CN111581970B (en) Text recognition method, device and storage medium for network context
CN112883193A (en) Training method, device and equipment of text classification model and readable medium
CN110874411A (en) Cross-domain emotion classification system based on attention mechanism fusion
CN113051914A (en) Enterprise hidden label extraction method and device based on multi-feature dynamic portrait
CN112417092A (en) Intelligent text automatic generation system based on deep learning and implementation method thereof
CN111597807B (en) Word segmentation data set generation method, device, equipment and storage medium thereof
CN115840820A (en) Small sample text classification method based on domain template pre-training
CN116821307B (en) Content interaction method, device, electronic equipment and storage medium
CN110717316B (en) Topic segmentation method and device for subtitle dialog flow
WO2023159759A1 (en) Model training method and apparatus, emotion message generation method and apparatus, device and medium
CN115906854A (en) Multi-level confrontation-based cross-language named entity recognition model training method
Siddique Unsupervised and Zero-Shot Learning for Open-Domain Natural Language Processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination