CN115840820A

CN115840820A - Small sample text classification method based on domain template pre-training

Info

Publication number: CN115840820A
Application number: CN202211598846.0A
Authority: CN
Inventors: 王廷; 贾晨阳
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2022-12-14
Filing date: 2022-12-14
Publication date: 2023-03-24

Abstract

The invention discloses a small sample text classification method based on domain template pre-training, which is characterized in that a domain template-related data set is used for template construction, then a pre-training language model is further pre-trained by using constructed data, a mixed template is constructed for a target task data set, the target data set data is preprocessed, the model after the further pre-training is used for training and verifying a target task to obtain a predicted word, and the predicted word is mapped to be a final target label by using a label word mapper. Compared with the prior art, the method has the advantages of higher training speed, lower requirement on hardware performance, better utilization of the pre-training language model, better classification effect by using less data, improvement of the classification accuracy of the target task and technical support for technical development in the related field.

Description

Small sample text classification method based on domain template pre-training

Technical Field

The invention relates to the technical field of natural language processing, in particular to a small sample text classification strategy based on domain template pre-training and improved Prompt.

Background

With the continuous development of natural language processing, model algorithms for text classification tasks are also diversified, ranging from machine learning models based on probabilities to deep learning models composed of deep neural networks. Although the classification accuracy of the model methods is gradually improved, the models are generally trained on a task data set directly from the beginning, a large amount of labeled data and high-performance processor support are needed, a large amount of training time is needed, besides, the trained models have poor adaptability to new tasks, and data labeling and model training are often needed again for the new tasks. The small sample learning method based on the pre-training model is developed rapidly in recent years, the problems can be solved well, and the small sample learning method based on the pre-training model is applied to the Wen Benfen class and has research value. The small sample learning method based on the pre-training model can well obtain general common language representation knowledge and initialization parameters of the model from a mass of label-free data sets, and then can obtain a good effect by training with little data in a target task.

At present, the text classification method in the field of natural language processing is mainly divided into two categories, namely a deep learning-based language model and a pre-training-based language model. Classical methods based on deep learning models, such as neuron-based Recurrent Neural Networks (RNN) (Mikolov T, karafi a T M, target L, et al. Improved long-short term memory Neural networks (LSTMs) (Zhang, yanbo. Research on Text Classification Method Based on LSTM Neural Network model.2021IEEE Asia-Pacific Conference on Image Processing, electronics and Computers (IPEC) (2021): 1019-1022); text convolutional Neural network model (Text CNN) (Kim, bamboo. Structural Neural Networks for sequence Classification [ C ]. Empirical Methods in Natural Language Processing, 2014. For the training corpora with a large number of labels, the models can repeatedly adjust model parameters through training, and a good classification effect can be obtained. However, the methods based on deep learning all require training of the model from scratch and require a large amount of training data sets to establish the mathematical mapping between the input X and output Y variables. In practical application scenarios, due to privacy and security issues or collection tagging cost issues, a large amount of data cannot be obtained to train and learn the model, or training on the deep learning model is difficult to perform due to the limitation of computer hardware level. The performance and effectiveness of such methods are therefore unsatisfactory in resource-constrained situations. With transformers (Vaswani A, shazer N, parmar N, et al.Attention is all you Processing Systems,2017, (30): 6000-6010) and Bert (Vaswani A, shazer N, parmar N, et al.Attention is all you Processing [ J ]. Neural Information Processing Systems,2017, (30): 6000-6010). The pre-training language model is provided to accelerate a new round of natural language field development, and the method based on the pre-training language model can be divided into a fine-tuning strategy for enabling the model to adapt to the task and a prompt learning strategy for enabling the task to adapt to the model. According to the solution of pre-training and fine-tuning, firstly, a training object is designed for a downstream task based on a pre-training language model, and the model is fine-tuned to obtain semantic information of a corpus and initialization parameters of the pre-training model, so that the model is suitable for various downstream tasks. The pre-training model-based Prompt learning method can fully exert the potential of the pre-training language model, a Prompt description is added in data reconstruction, a task is converted into a well-known complete gap filling task of the pre-training language model, a classifier does not need to be redesigned, only different Prompt needs to be designed, the target task can be adapted to the pre-training language model, and good classification effect can be shown.

Although the current field of natural language processing is rapidly developing and there are a number of excellent algorithms to perform the task of text classification, there are some unsolved problems. For example, the pre-training language model and the fine-tuning method are more and more complex in design, the field of the target task is too different from the field of the pre-training language task, the pre-training language model is difficult to learn the specific knowledge in the field, structural deviation exists between the input and output of the pre-training language model and the target task, semantic information, redundant information and the like are lost in the aspect of data processing of a data set, how to reasonably acquire field data information and fully improve the effect of the pre-training model on the target task, and the method is still one of the key problems needing to be researched in the field of small sample text classification.

Disclosure of Invention

The invention aims to provide a small sample text classification method based on domain template pre-training, which aims at the defects of the prior art, adopts a method combining domain template pre-training and prompt learning to perform a small sample text classification task, utilizes a target data set to construct a template, utilizes a pre-training language model to train an MLM task, obtains domain information, and performs mixed template construction and multi-label mapping to achieve better classification by using less data, thereby shortening the training time of the target task and reducing the requirements on the hardware performance of a computer. The method is simple and convenient, the training speed is higher, the requirement on hardware performance is lower, the pre-training language model is better utilized, the classification accuracy of the target task is greatly improved, and the technical support is provided for the technical development of the related field.

The specific technical scheme for realizing the purpose of the invention is as follows: a small sample text classification method based on domain template pre-training is characterized in that a domain template is adopted for building and further pre-training a pre-training language model, and a target task is improved and prompted to learn, and the method for classifying the small sample text mainly comprises the following steps:

step 1: constructing a prompt template through a field data set related to a target task, and if input data is X, adding prompt information by using fprompt as a prompt function to construct X defined by the following formula (a):

x＝fprompt(x)(a)。

wherein x is domain data set text data; fprompt is a template construction function; and x' is the data of the domain data set after template construction.

Step 2: and (3) further pre-training the selected pre-training language model aiming at the MLM task by using the data constructed by the template in the step (1), so that the pre-training language model obtains the field information related to the target task.

And step 3: the same number of data samples are taken for each category of the target task, head and tail truncation processing is carried out on the long text in the data set to obtain semantic information with summarization of the head and the tail, dynamic filling is carried out on the short text, and a large amount of useless filling is reduced.

And 4, step 4: constructing a prompt mixed template for a target data set by using a natural language template which can be understood by human beings and an encoding language template which can be understood by a machine, using training data Xtarget of a target task, designing the training data Xtarget into { soft: this } topic { soft: is } { mask } { Xtarget }, wherein soft is an adjustable template understood by the machine, initializing according to the task, topic is a natural language template and can be converted into a corresponding Embedding form, mask is a value to be predicted, and Xtarget is an original input sequence.

And 5: and (4) training and predicting the language model generated in the step (2) and subjected to further pre-training by using the mixed template constructed by the target task data set in the step (4), and adjusting parameters such as learning rate and the like. The probability of token of the position of the Mask predicted by the pre-training language at this time is shown as the following formula (b):

wherein, X' is input data constructed by a template; y is _f The final output probability; y is a current predicted word; z (X) is an answer space; y' is the answer space that does not contain the current predictor.

And 6: obtaining a probability maximum prediction answer through an argmax function, and then obtaining an output label required by the final target task according to an answer space Z by a label word mapper by the following formula (c):

Ylabel＝Z(argmaxY _∈％ (P(Yf|X’)))(c)。

wherein, Y _label P (Yf | X') is the probability of the predicted word, and Z is the answer label mapper, mapping the predicted word to the final output label.

Compared with the prior art, the invention has the following remarkable technical progress and beneficial technical effects:

1) According to the method, the prompt information is constructed by using the data in the field, and the pre-training language model is further pre-trained, so that semantic information and field knowledge related to the target field are fully obtained.

2) The method provided by the invention modifies the target task to adapt to the pre-training language model, and when the target task is subjected to template construction, a mixed template mode is used, so that different advantages of the prompt template are fully exerted, and the cost of template construction is reduced.

3) The invention has higher accuracy and shorter training time in the classification research of small samples than a deep learning method using a larger data set and a method using pre-training and fine-tuning.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

The invention will be described and illustrated in further detail with reference to specific embodiments:

referring to fig. 1, the present invention performs small sample text classification by the following steps:

step 1: constructing a prompt template through a field data set related to a target task, and if input data is X, adding prompt information by using fprompt as a prompt function to construct X', if an input sentence is: the I like this toy is prepared by template construction as follows: it is [ Mask ], I like this toy.

Step 2: and (2) further pre-training the selected pre-training language model aiming at the MLM task by using the data constructed by the template in the step (1), so that the pre-training language model obtains the field information related to the target task.

And 3, step 3: the same number of data samples are taken for each category of the target task, head and tail truncation processing is carried out on the long text in the data set, head and tail semantic information with summarization is obtained, dynamic filling is carried out on the short text, and a large amount of useless filling is reduced.

And 5: and (4) training and predicting the language model generated in the step (2) and subjected to further pre-training by using the template constructed by the target task data set in the step (4), adjusting parameters such as learning rate and the like, setting the preheating step to be 0.1 of the total step length, setting the model update rate to be 0.00002, and setting the weight attenuation to be 0.01. At this time, the probability of token of the pre-training language prediction Mask position is calculated by the following formula (b):

wherein, X' is input data constructed by a template; y is _f Is the final output probability; y is the current predicted word, Z (X) is the answer space, and Y' is the answer space that does not contain the current predicted word.

Ylabel＝Z(argmax _∈％ (P(Y|X’)))(c)。

wherein, Y _label Is the last output label; p (Y | X') is the probability of the predicted word; z is the answer label mapper.

The predicted words are mapped to the final output labels, e.g., for the binary case the answer space and the labeled words may be (positive: wonderful, green, interactive.; negative: bad, boring, terrible.).

According to the invention, the domain data set related to the target task is utilized to construct the template and further pre-train the pre-training language model, so that the pre-training model can better learn the domain information related to the target task, and the gap between the pre-training language model and the target task is greatly reduced. The method constructs an input/output structure similar to that of a pre-training language model during training by using a template-constructed Prompt method, so that a target task is adaptive to the pre-training language model, the potential of the pre-training language model is fully exerted, and hybrid template construction is performed by using human-understandable natural language and machine-understood coding language, so that optimizable embedded words are enhanced, and the representation capability of the template is improved.

The above description is only exemplary of the present invention and should not be taken as limiting the scope of the present invention, and any modifications, equivalents, improvements and the like that are within the spirit and principle of the present invention should be included in the present invention.

Claims

1. A small sample text classification method based on domain template pre-training is characterized in that a method for further training a used pre-training language model by adopting a domain data set related to a target task is adopted, and small sample text classification is carried out through data preprocessing, parameter processing, a mixed template and multi-label mapping, and the method specifically comprises the following steps:

1) Constructing a prompt template by using a data set related to the target task field to obtain field data;

2) Pre-training a pre-training language model selected by field data by taking MLM as a target task to generate a further pre-training language model;

3) The training data set is subjected to class balance sampling, the long text is cut off in the same length from head to tail, and the short text is dynamically filled;

4) A mixed template construction method for combining a discrete template and a continuous template by using a target data set is used for constructing a prompt mixed template;

5) Training and predicting the target task by using the generated further pre-training language model, and adjusting learning rate parameters to obtain a predicted answer;

6) And performing label mapping conversion on actual labels of the target task on the words predicted by the model according to the answer space by using a multi-label mapper according to the predicted answers to obtain the final output label, thereby realizing small sample text classification.

2. The domain template pre-trained small sample classification method according to claim 1, wherein the step 1) uses a data set related to the target task domain to construct a prompt template, and if the input data is X, uses f as a prompt function for adding prompt information, and constructs X defined by the following formula (a):

x＝f(x) (a)；

wherein x is domain data set text data; f is a template construction function; and x' is the data of the domain data set after template construction.

3. The domain template pre-trained small sample text classification method according to claim 1, wherein the step 4) uses a natural language template understood by human and a coding language template understood by machine to construct a prompt mixed template for a target data set, uses a training data Xtarget of a target task, and is designed as { soft: this } topic { soft: is } { mask } { Xtarget }, wherein soft is a tunable template understood by machine, and is initialized according to the task; topic is a natural language template for converting a language into a corresponding Embedding form; mask is the value to be predicted; xtarget is the original input sequence.

4. The method for classifying small samples based on domain template pre-training as claimed in claim 1, wherein the step 5) uses the generated further pre-training language model to train and predict the target task, the preheating step of the training is set to 0.1 of the total step size, the model update rate is 0.00002, and the weight attenuation is set to 0.01; the probability of predicting token of Mask position by using the pre-training language is calculated by the following formula (b):

5. The domain template pre-trained small sample classification method according to claim 1, wherein said step 6) obtains a probabilistic maximum predictive answer through an argmax function, and then according to an answer space Z, the label word mapper obtains an output label required for calculating a final target task by the following formula (c):

Y＝ Z(argmax _∈％ (P(Y|X’))) (c)；

wherein Y is _label Is the last output label; p (Y | X') is the probability of the predicted word; z is an answer label mapper that maps the predicted word to the final output label.