CN113836895A

CN113836895A - Unsupervised machine reading understanding method based on large-scale problem self-learning

Info

Publication number: CN113836895A
Application number: CN202111151305.9A
Authority: CN
Inventors: 赵天成
Original assignee: Honglong Technology Hangzhou Co ltd
Current assignee: Hangzhou Linker Technology Co ltd; Honglong Technology Hangzhou Co ltd
Priority date: 2021-02-08
Filing date: 2021-09-29
Publication date: 2021-12-24

Abstract

The invention discloses an unsupervised machine reading understanding method based on large-scale problem self-learning, which comprises the following steps of firstly, dividing data into four types: then the method comprises the following steps: s1, training the unmarked general data by using a standard pre-training model to obtain a pre-training language model; s2, training the labeled general data by using a pre-training language model to obtain a problem generator, and generating a specific task general field model; s3, generating synthesized intra-domain data for the unmarked intra-domain data by using a problem generator, filtering by using a specific task general domain model, and training the high-quality synthesized intra-domain data set obtained by filtering to obtain a new pre-training model; s4, mixing the labeled intra-domain data with a low-quality synthetic data set obtained by filtering, marking answers, and then training by using a new pre-training model to obtain a final model; based on the final model, the input data is the result of the machine reading understanding.

Description

Unsupervised machine reading understanding method based on large-scale problem self-learning

Technical Field

The invention relates to the field of machine reading understanding, in particular to an unsupervised machine reading understanding method based on large-scale problem self-learning.

Background

Many of the latest algorithms for Natural Language Processing (NLP) tasks require manually labeled data. We generally did not have any domain-specific tagged data sets in the early days, and annotating a sufficient amount of such data was generally expensive and laborious. Thus, for many NLP applications, even resource-rich languages (such as english) have data tagged in only a few domains.

In many NLP applications, it is very difficult to obtain a large amount of tagged data. Thus, in many cases, we will train the model from a small amount of data. However, the trained model is usually over-fit and needs to be generalized to invisible data. Thus, researchers have utilized large unmarked data sets through pre-training language models, which can generally alleviate the problem of network weights for random initialization, finding better local optima and improving the robustness of agents in invisible environments.

Significant advances in machine reading understanding (MRC) have recently been achieved by pre-training the Transformer language model over large amounts of unlabeled text data, and fine-tuning the pre-trained model over manually labeled QA datasets. In the context of a pre-trained language model, Gururangan shows the importance of using intra-domain data for additional pre-training to improve performance of downstream specific tasks.

Disclosure of Invention

The invention mainly provides an unsupervised machine reading understanding method based on large-scale problem self-learning, so that cold start can be realized in a brand new field.

The invention mainly solves the technical problems through the following technical scheme: data is first classified into four types: the method comprises the following steps of marking unmarked general data, marked general data, unmarked intra-domain data and marked intra-domain data, and then:

s1, aiming at the unmarked general data, training by using a standard pre-training model to obtain a pre-training language model based on a Transformer as the bottommost layer of the architecture;

s2, aiming at the labeled general data, training by using the pre-training language model obtained in the step S1 to obtain a problem generator, and generating a specific task general field model by using the labeled general data;

s3, aiming at unmarked intra-domain data, generating synthesized intra-domain data by using the problem generator constructed in the step S2, then filtering by using a specific task general domain model to obtain a high-quality synthesized intra-domain data set and a low-quality synthesized data set, and then training the high-quality synthesized intra-domain data set to obtain a new pre-training model;

s4, aiming at the labeled intra-domain data, mixing the low-quality synthetic data set obtained by filtering, marking answers, and then training by using a new pre-training model to obtain a final machine reading understanding model;

and inputting data to obtain a result of the machine reading understanding based on the final machine reading understanding model.

Preferably, in step S1, a GPT-2 model or a T5 model is used for model learning.

Preferably, the problem generation based on the trained T5 model is specifically as follows: extracting answers; generating a question according to the extracted answer; receiving the question and generating an answer; comparing the extracted answers with the generated answers, and judging whether the generated questions are correct or not;

the problem generation based on the trained GPT-2 model specifically comprises the following steps: given the natural order of the language, the sequence s is given as(s)₁,…,s_n) Is decomposed into the product of conditional expressions:

after the GPT-2 model training is finished, for each new word, the model calculates the probability of the next word according to all the existing characters; then, selecting high-probability words of the front K bits according to the probability, and randomly sampling the K candidate words; this process is repeated until a special symbol or sentence end symbol appears;

for the question generation scenario, the position of the potential answer in the source text is marked with a special symbol, and for a paragraph C ═ C₁,...c_n]And one of the potential answers a ═ a₁,..,a_n]It will be expressed as:

X＝([CLS]，C，[SEP]，A)

given the above X, we input it into the GPT-2 model after training or T5 after training to get the hidden vector:

H＝Model(x)

x is the input length, h is the magnitude of the hidden vector; and finally, H inputs a layer of full link network to obtain a final result:

where W is a word, W is a matrix, b is a coefficient, and the final result is the best word for argmax output. Both W and b are obtained by learning.

Preferably, in step S3, the generated data with round-trip consistency is actively learned, so as to actively screen out weak links in the training data distribution according to the advantages and disadvantages of the existing model at different latitudes, and suggest the next batch of data to be labeled.

Preferably, in step S3, data filtering is performed by round-trip consistency, and learning efficiency is improved by active learning.

The method has the substantial effect that the method is suitable for the condition without any mark and very small mark data, and the accuracy of the model is obviously improved.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

The technical scheme of the invention is further specifically described by the following embodiments and the accompanying drawings.

Example (b): we use a variety of pre-trained language models (e.g., GPT-2 and T5) to generate a large amount of potential question and answer data from unlabeled paragraphs of text within a domain, and this approach allows us to implement cold-start in a completely new domain. Then we pre-train the model according to these generated samples and finally fine-tune the specific labeled data set.

Although the model trained domain-specific on the SQuAD1.1 training dataset achieved the most advanced performance on the SQuAD1.1 Dev dataset (EM score of 85%), it was completely impossible to make the same level of reasoning in the completely new domain, i.e., NewQA (EM score of 32%). We have found that when using a synthetic dataset to pre-train a model, it is important to prevent the synthetic dataset from being over-fitted, since it typically contains many noisy samples. However, these synthetic datasets are very useful when there is no or very little training data in the domain at an early stage, because we can automatically generate "machine" labeled training data in a completely new domain by this method.

By this method, 80% of the final performance is obtained without any marking data. Moreover, when we injected a small amount of labeled data (10% of the original data), the final performance level equivalent to 94% could be reached quickly by the pre-trained model. Finally we evaluate Data Dream by the NLP Checklist test framework used to rigorously test the NLP model. Our method reduces errors by 18% in the universal language capability test project (e.g., synonyms, problem spelling, time variation, etc.) in NLP Checklist.

Problem generation is a long-standing research topic, and the use of generated question-answer pairs to improve quality assurance systems shows great improvement in low-resource environments with only a small sample number. However, verifying and improving the accuracy of these generated QA pairs is relatively unexplored.

In machine translation, consistency of modeling by double learning or translation back in both translation directions can improve the quality of the translation model. The reverse translation adds the parallel data generated by synthesis as a training example, which is the inspiration of the work and achieves the best performance in both supervised and unsupervised settings. The joint distribution of questions and answers can be modeled given the context and used directly, while our work uses generative models to generate synthetic data for pre-training. Combining these two methods may be a fruitful future field of work.

QG is used to augment training data used to answer questions and focuses on text-based quality inspection tasks, aiming at selecting one or more answer sentences from the text of a given input question. The weight of each data point at the time of training is configured by comparing the generated question with the original question when ranking the sentences.

Translation-based data enhancement mechanisms may be introduced to answer questions. However, these methods are highly dependent on the availability and quality of the translation system. Although we can add more data in training with MT, it has not been significantly improved because of the difficulty in finding domain specific data in other languages.

The use of Synthetic QA corporation can improve the overall MRC task by round-trip consistency. In order to make the round trips consistent, the model should have been trained. The main difference from our work is that we assume our dataset is small and it is difficult to build an initial model. However, they assume that they already have a model and when to further refine the model. Thus, it is difficult to show improvements to new domain datasets and to consistently improve cross-domain performance.

The main contribution we propose to Data drag is four-fold:

1. i propose four steps to construct the NLP system for small sample cases.

2. We constructed Synthetic QA corporation using a number of different heterogeneous pre-trained language models and showed performance improvements over the new domain.

3. We have tested on NLP Checklist, this evaluation method can be used for the rigorous test to NLP model, the method that we propose exceeds the accuracy of the base line, and find the error rate in the function of general language is reduced greatly.

4. If the predicted answers are different, we further improve performance by actively learning the generated questions.

Our overall process is divided into four phases based on different data sets. First, for any NLP domain or task, we can classify datasets into four types:

1. unlabeled general data (e.g., BookCorpus, Wikipedia, etc.).

2. Annotated general data (or out-of-domain data) (e.g., SquAD, TriviaQA, HotpotQA, etc.).

3. Unlabeled intra-domain data (e.g., judicial cases, insurance clauses, technical specifications, etc.).

4. Annotated intradomain data (e.g., manually annotated legal portfolio).

These 4 steps are based on the size of the data set, and we take different processing modes:

first step (general data not labeled): research on unlabeled universal domain datasets has been actively conducted for 3 years. A large amount of text data is used for constructing pre-training models of converter-based pre-training languages, BERT, GPT-2, T5, and the like, which become standard NLP processing. We use a transform-based language model as the lowest layer of our architecture.

Second step (labeled general data): our goal is to build a machine-reading understanding model so there are many publicly available data sets. We use this data set to make a synthetic data generator to make a large-scale, in-domain data set. Furthermore, we use the labeled domain generic dataset to make task specific (MRC task in this work) a generic domain model.

Third step (unlabeled industry data): we generate a lot of synthetic in-domain data using the problem generator constructed in step 2. After a large amount of data is generated, we use a domain model for filtering, which uses the idea of round-trip consistency problem generation for reference. High quality samples will be used to build the pre-training model and we use further filtering methods to improve performance. Also, when we manually label these data, the pre-trained model can be used as an annotation assistant.

Fourth step (labeled industry data): in the last step, we apply active learning, which is sent to the human annotator to mark the answer using negative synthetic data from the general model. If the generated question is not grammatically correct and difficult to understand, we will ask the annotator to modify the generated question as much as possible and to annotate the answer. Finally, we train the final model using the intra-domain labeled data set.

In the following sections, we will explain in detail the implementation of each step.

The first step is as follows: self-learning of unmarked general data

At this step we have used two different strategies for model learning on unlabeled generic data. The first method is proposed by GPT-2. GPT-2 is a transformer-based large-scale language model published by OpenAI in 2019 in 2 months, contains 15 hundred million parameters, and is trained on an 800 ten thousand webpage data set. The GPT model is directly expanded, training is carried out on data quantity which exceeds 10 times, and the parameter quantity is increased by 10 times. In terms of performance, the model is able to produce coherent text passages, achieving SOTA performance on many language modeling benchmarks. And the model can achieve preliminary reading understanding, machine translation, question answering and automatic summarization without task specific training.

The second strategy is proposed by T5. The training data for T5 includes a Colossal Clean Crawled Corpus (i.e., C4 Corpus) that crawls hundreds of gigabytes of Clean English text from the Common Crawl website. The model of T5 is a standard Transformer-based Encoder-Decoder model, and the number of model parameters reaches 110 hundred million.

The second step is that: and training the generated model through the labeled general data.

Question generation is the task of automatically generating questions from text paragraphs. The simplest method is to answer the question. In answer-aware question generation, the model is provided with answers and paragraphs and asked to generate questions for the answers by considering the paragraph context. One reason for this is that most of the earlier papers use complex models/processing pipelines and there are no pre-trained models available. Thus, the problem of machine generation is often non-syntactic and difficult to understand, and thus it is difficult to use the generated data in practical applications. However, recent advances in text generation techniques supported by pre-trained converter models enable us to generate reasonable synthetic data. We used the most robust generation methods available: generation based on T5 and generation based on GPT-2.

Problem generation based on T5: t5 is a very large, novel neural network model that is trained on a mixture of unlabeled text and labeled data from popular natural language processing tasks, and then fine-tuned individually for each task that its author is addressing. T5 is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and supervised tasks, for which each task is converted to text-to-text format. To generate a question that knows the answer, we usually need 3 models, the first one will extract the answer as a span, the second one will generate the question on top of the answer, the third one will be a QA model that will accept the question and produce an answer, and we can then compare the two to see if the generated question is correct. Having 3 models for a single task is very complex, so the goal is to create a multitask model that can accomplish these 3 tasks simultaneously.

Problem generation based on GPT-2: for the generation of problems using GPT-2, we follow the original standard text generation strategy. Given the natural order of the language model, the joint probability of the sequence s ═ (sl., sn) can be decomposed into the product of conditional expressions

After the above training of the probabilistic model is completed, the problem generation part can be realized by various random sampling strategies, including the sequence top-k. For each new word, the model calculates the probability of the next word based on all the existing characters. And then selecting the high-probability words with the first K bits according to the probability, and randomly sampling the K candidate words. This process is repeated until a special symbol, containing "? "or a sentence end symbol occurs.

In addition, for the question generation scenario, we label the position of the potential answer in the source text with a special symbol, for example, for a paragraph C ═ C₁，...C_n]And one of the potential answers a ═ a₁....A_n]It will be expressed as:

X＝([CLS]，C，[SEP]，A)

given the above X, we can input it into GPT-2 or T5 to get the hidden vector:

H＝Model(x)

x is the input length and h is the magnitude of the hidden vector. And finally, H inputs a layer of full link network to obtain a final result:

the third step: and marking the industry data through the model in the second step.

Training against AI models requires a large number of manual annotations. The process of manual labeling is costly. In addition, it is difficult for an annotator to decide what to ask in machine learning understanding, and a human annotator has many duplicate items. If the active learner produces the most divergence in the prediction, it is decided to query oracle to mark the data sample. This can be measured by entropy and KL-divergence. A high variance in the output prediction represents the most informative data sample. In this context, we actively learn generated data with round-trip consistency, so as to actively screen out weak links in training data distribution according to advantages and disadvantages of the existing model at different latitudes, and suggest the next batch of data to be labeled, thereby reducing the cost of data labeling and increasing the value of each manually labeled data point.

After obtaining the question model, we can label any unlabeled industry data with a question model generation model based on T5 or GPT-2, automatically generating potential relevant questions to train the question-answer model of the fourth part. However, it is not ideal to use all generated problems directly to achieve the training effect, because the generated problems cover a lot of noise. Therefore, we invented a round-trip consistency method to realize the control of data quality.

Data filtering by round trip consistency: round-trip consistency may be used to filter data. If the model is unable to answer the generated question, the example may be filtered. We also used this method to filter the data. However, there are some differences between the existing work and our work:

we assume that no thunder chain data exists, so our MRC is trained directly on top of the generated data.

Their approach assumes that there is training data and the goal is to improve performance with training data.

During training, we use indicator function I (q):

wherein

Is a problem of the generation of the gas,

is the given answer. And i (q) is used to filter whether a data point is used.

The learning efficiency is improved through active learning: first, we generate a question for a named entity or noun phrase, and then run the trained MRC model from the generic domain. If the model cannot predict the answer, we will save all samples for active learning. We implement active learning by selecting the data that the model has the least confidence through the following strategy.

Wherein

Is a problem of the generation of the gas,

and

is the given answer and context. And i (q) is used to filter whether a data point is used.

The fourth step: fine tuning with labeled industry data

Training details: when we have a large amount of unlabeled data in the domain in most cases in the real world, our approach can perform task-specific pre-training on those large unlabeled data sets. The whole training process follows the following steps;

1. on publicly available QA datasets (e.g., SQUAD, NQ, and MARCO), multiple question generators are built from multiple pre-trained language models (e.g., GPT-2 and T5).

2. Using a problem generator generates a large number of problems.

3. Pre-training is performed using the generated data set.

4. And fine-tuning the model of the previous step on the labeled data set.

We use a large number of generated quality check data sets for pre-training. We use the Span Bert framework for pre-training and fine-tuning. The goal function of the fine tuning process is to use only labeled data to reduce training errors. The main purpose of the fine tuning step is to re-adjust the weights, which may be erroneously trained due to generation errors.

The final model was evaluated using SQuAD and NewsQA. SQuAD is used to explore the effect of QG pre-training in the domain, which means that the same dataset is used for problem generation and span prediction models. To validate the new domain, which is completely different from the QG model source, we assume that the NewsQA dataset is a new domain dataset and does not contain any training, neither generating problems nor pre-training. The evaluation index includes a standard MRC index: EM and F1 scores.

Exact Match (EM) that the range of Top-1 answers matches exactly the correct answer.

Fl-Score we compute the word overlap between the returned span and the ground truth answer at the word level.

Intra-domain vs. out-of-domain: recent natural language processing models have achieved impressive performance when training and testing examples from the same dataset, but tend to perform poorly on out-of-domain (OOD) examples because many unseen events can occur during testing.

We use the SpanBERT architecture, which focuses on pre-training the span representation to achieve the current up-to-date results to show how the performance difference between the in-domain and out-of-domain datasets is. Let us assume that

The SQuADl.l training dataset is an in-domain dataset generated and pre-trained using a training problem. We use the newsga dataset as the out-of-domain corpus, which does not contain any training samples. We found that the EM score decreased by 78.5% (80.40%)>17.26％)₀However, with the help of the problem generator, we can generate in-domain quality check data on unlabeled samples. Without any marker data on SQuAD1.1 we can reach 75% of the final performance, and without any marker data on NewsQA we can reach 60% of the final performance. Since we included training data in SQuAD1.1 when the build problem was generated.

Checklist evaluation: while measurement retention accuracy has been the primary method of evaluating generalization, it often overestimates the performance NLP model, while alternatives to evaluating models focus on individual tasks or specific behaviors. To enlighten the principle of behavioral testing in software engineering, a CheckList, a task independent method for testing NLP models, can be introduced. The Checklist includes a matrix of generic language capabilities and test types that facilitate a comprehensive test formulation. They demonstrated the utility of Checklist by testing three tasks, determining key failures in the business model and the latest model. The proposed method, based on pre-training of problem generation, achieves a failure rate reduction of 18%, especially in terms of animal vs vehicle v2 (39% reduction), fairness (44% reduction), time (93% reduction).

Influence of annotation data size: to explore the effectiveness of data size pre-training, we tested QG pre-training on 10% and 100% of the data set. The results show that when we have enough data sets, the model converges faster, but not much different from the final score. This indicates that QG pre-training is more useful in early stages than in later stages.

Influence of generated data size we found that the performance of the pre-trained model using the T5-based generation was better than the GPT-2-based generation. However, when we add both data at the same time, the performance will be greatly improved. The generated questions are typically longer than humans and, using both GPT and T5 generation, we can add more different questions and answers to train. On the same answer "Moninder Singh Pandher", T5, GPT and humans have completely raised questions? (T5: who was deceased by a lower court. Thus, the diversity of the models improves the generalization of subsequent MRC models.

The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.

Although terms such as label, domain, etc. are used more often herein, the possibility of using other terms is not excluded. These terms are used merely to more conveniently describe and explain the nature of the present invention; they are to be construed as being without limitation to any additional limitations that may be imposed by the spirit of the present invention.

Claims

1. A large-scale problem self-learning-based unsupervised machine reading understanding method is characterized in that data are firstly divided into four types: the method comprises the following steps of marking unmarked general data, marked general data, unmarked intra-domain data and marked intra-domain data, and then:

2. The unsupervised machine reading understanding method based on large-scale problem self-learning of claim 1, wherein in step S1, the standard pre-training model is GPT-2 model or T5 model.

3. The unsupervised machine reading understanding method based on large-scale problem self-learning as claimed in claim 2, wherein the problem generation based on the trained T5 model is specifically as follows: extracting answers; generating a question according to the extracted answer; receiving the question and generating an answer; comparing the extracted answers with the generated answers, and judging whether the generated questions are correct or not;

X＝([CLS],C,[SEP],A)

where W is a word, W is a matrix, b is a coefficient, and the final result is the best word for argmax output.

4. The unsupervised machine reading understanding method based on large-scale problem self-learning as claimed in claim 1 or 3, wherein in step S3, the generated data with round-trip consistency is actively learned, so that weak links in the training data distribution are actively screened out according to the advantages and disadvantages of the existing model at different latitudes, and the next batch of data to be labeled is suggested.

5. The unsupervised machine reading understanding method based on large-scale problem self-learning of claim 4, wherein in step S3, data filtering is performed through round trip consistency, and learning efficiency is improved through active learning.