CN113988079A

CN113988079A - Low-data-oriented dynamic enhanced multi-hop text reading recognition processing method

Info

Publication number: CN113988079A
Application number: CN202111144082.3A
Authority: CN
Inventors: 伍赛; 任雪峰; 陈刚; 陈珂; 寿黎但
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2021-09-28
Filing date: 2021-09-28
Publication date: 2022-01-28

Abstract

The invention discloses a low-data-oriented dynamic enhanced multi-hop text reading identification processing method. Carrying out correction preprocessing on a data set of a document; constructing a dynamically enhanced answer prediction model; using a training set to train the dynamically enhanced answer prediction model as a teacher model; randomly selecting a part of unlabeled data sets, inputting the part of unlabeled data sets into a teacher model, predicting to obtain a label result, establishing pseudo labels, and adding the data sets with the pseudo labels into a training set to form a new training set; retraining the teacher model with the new training set to obtain a student model; continuously repeating the steps for iteration until the model precision result of the verification set meets the requirement of a preset threshold; and predicting the reading document to be tested by using the final student model, and outputting the prediction to obtain the answer of the reading document to be tested. The invention uses a dynamic enhancement method to expand data, can reduce the input length, solves the multi-hop reading understanding problem under the condition of less tag data and enhances the generalization capability of the model.

Description

Low-data-oriented dynamic enhanced multi-hop text reading recognition processing method

Technical Field

The invention belongs to a text data processing method in the technical field of natural language processing, and particularly relates to a low-data-oriented dynamic enhanced multi-hop text reading, identifying and processing method.

Background

The machine reading understanding task requires the machine to answer questions through given context, can be used in the fields of search engines, intelligent assistants and the like, and provides high-quality consulting services for users. Early reading understood that the system was small in scale and limited to a particular field, and was not well suited for use. With the superior performance of deep learning in context information acquisition and the proposal of many large reference data sets, some machine-read understanding models have greatly improved performance on partial single-hop machine-read understanding data sets, but these models still lack the ability to reason across multiple sentences. In recent years, the multi-hop reading understanding data set is proposed to require that a model can carry out reasoning across a plurality of incoherent sentences or even documents, and at present, a large pre-training model is used as a feature extractor for fine adjustment on a specific reading understanding task. This approach requires a large amount of data to drive in the training process of the model. However, the process of labeling data in the real world is very time consuming and laborious, and in some fields, there is not enough sample for labeling.

At present, most of multi-hop reading and understanding tasks have a large amount of data as support, and the low-data condition is rarely researched. However, in the real situation of excessively large annotation cost, the direct use of the conventional reading and understanding model cannot achieve good results. Data enhancement is a good choice for low data situations. The current data enhancement method in the text field focuses on the text classification task, but no paper has proved to have significant effect in the reading and understanding task. The sliding window is used as a reading understanding task data enhancement means and is mostly applied to single-hop reading understanding tasks, however, the sliding window cannot guarantee that all supporting sentences are in the window, and therefore the sliding window is not suitable for a multi-hop situation.

Disclosure of Invention

Training of neural networks requires large-scale data sets for support, however, labeling of data sets is time-consuming and labor-consuming. In order to solve the problems of the multi-hop reading recognition understanding task and the background technology in the low data scene of insufficient data quantity, the invention focuses on the low data set condition, provides a dynamic context enhancement multi-hop text reading recognition processing method facing low data, introduces external knowledge by using a self-training method, and increases an auxiliary data set to improve the model expression.

As shown in fig. 1, the object of the present invention is achieved by the following technical solutions:

step 1: carrying out correction preprocessing on a data set of a document to eliminate the semantic ambiguity of a sample caused by desensitization;

step 2: constructing a dynamically enhanced answer prediction model;

and step 3: taking the data set with known labels processed in the step 1 as a training set, training a dynamically enhanced answer prediction model by using the training set, and taking the trained answer prediction model as a teacher model;

and 4, step 4: randomly selecting a part of unlabeled data set, inputting the part of unlabeled data set into the teacher model obtained in the step 3 to predict a label result, using the label obtained by prediction as a pseudo label, attaching the label to the unlabeled data set to form a data set with the pseudo label, and further adding the data set with the pseudo label into a training set to form a new training set;

and 5: the teacher model obtained in the step 3 is retrained by using the new training set obtained in the step 4, and the trained teacher model is used as a student model;

step 6: adopting a data set with other known labels as a verification set, inputting the verification set into the student model obtained in the step 5 to test and obtain the precision result of the model,

if the model precision result meets the requirement of a preset threshold value, carrying out the next step;

if the model precision result does not meet the preset threshold requirement, returning to the step 3, performing iteration by taking the student model as a teacher model, and continuously repeating the steps 3-5 until the model precision result obtained through the verification set meets the preset threshold requirement;

and 7: and (6) predicting the reading document to be tested by using the student model finally obtained in the step (6), and outputting the label of the reading document to be tested and the answer in the label.

The data set comprises documents, questions and labels, the labels comprise answers, supporting sentences and answer categories, and the answers are composed of answer starting positions and answer ending positions.

The question is a question of the text data in reading understanding, and the answer is an answer result corresponding to the question, specifically a character appearing in the document.

The support sentence is a sentence supporting answer reasoning in reading comprehension and appears at one position in the document.

The answer categories are generally classified into two categories, three categories, and four categories, but are not limited thereto.

In step 2, the dynamically enhanced answer prediction model specifically includes:

2.1, carrying out sentence segmentation from the documents of the data set to obtain each sentence, extracting part or all of the sentences from each sentence, and then combining the problems corresponding to the documents to splice to form a context text;

2.2, inputting the context text into a pre-trained answer prediction model, wherein the answer prediction model can be a BERT model, acquiring a whole feature vector CLS _ ALL of the context text, a problem in the context text and a decoding feature vector of each word of each sentence, forming a problem vector Q by the decoding feature vector of each word of the problem, and forming a context vector C by the decoding feature vector of each word of ALL sentences;

2.3、

A. the overall feature vector CLS _ ALL is subjected to linear layer processing to obtain a predicted answer type, which is expressed as the following formula processing:

type＝softmax(Linear(CLS_ALL)/τ)∈R^1×4

wherein softmax () represents a softmax activation function, τ represents a hyper-parameter;

B. processing the question vector Q and the context vector C according to the following mode to obtain a context feature vector C ', and extracting two results respectively serving as a predicted answer starting position start and an answer ending position end through a multilayer perceptron according to the context feature vector C':

Q′＝Attention_pooling(Q)∈R^1×d

C′＝Norm(w₁·CLS_ALL+w₂·Q′+w₃·C)∈R^l×d

in the formula, w₁、w₂、w₃Respectively representing a first weight, a second weight and a third weight, d representing the dimension of a hidden vector, l representing the length of a sentence, Q' representing a dimension reduction problem vector, Norm () representing a normalization function, MLP () representing a multilayer perceptron, R^1×4Representing four dimensions corresponding to four types of answers respectively, d representing the dimension of a hidden layer, and l representing the length of a context;

representing the probability of whether each position in the context is an answer start position or an answer end position, and respectively corresponding to an answer start position start and an answer end position end;

C. and processing a support sentence for obtaining prediction output according to the overall feature vector CLS _ ALL, the problem vector Q and the context vector C together in the following way:

SFeature＝W·AttentionPooling(C，CLS_ALL，Q)

wherein W represents a weight matrix; sigmoid () represents a sigmoid activation function, attentionPooling () represents an attention pooling operation,

a supporting sentence representing the predicted output.

The step 7 specifically includes:

step 7.1, segmenting the to-be-detected reading document to obtain sentences, screening the obtained sentences and the problems through a screening model respectively to obtain the relevancy between each sentence and each problem, and arranging K sentences with the highest relevancy according to the sequence of the sentences appearing in the to-be-detected reading document to form a new document;

and 7.2, inputting the new document into the student model finally obtained in the step 6, outputting a label of the new document obtained through prediction, and extracting an answer in the label as an answer of the document to be read.

The screening model specifically uses a Chinese BERT pre-training model, a Roberta model and a Chinese BERT pre-training model to encode the question sentence pairs.

The method of the invention first trains a two-stage reading model as a teacher network. The reading model is a two-stage model. The first stage is a sentence screening model, considering that other sentences do not participate in the reasoning process except for real supporting sentences, the screening model screens Top K sentences which are strongly related to the problem from the sentences, so that the interference of the irrelevant sentences on the model is reduced, and the defect that the text input is too long and can not be directly input into a pre-training model is overcome; and the second stage of the answer prediction model carries out reasoning of the answers and the support sentences, adds an answer type prediction task, and carries out the answer extraction process only when the answers are positioned in the text. Because the training data set is less, a dynamically enhanced answer prediction model is provided, each batch of training data is composed of a real supporting sentence and a plurality of sentences randomly extracted from the context in the training process of the model, the generalization capability of the model is increased by dynamically updating the input, and the sentences screened by the screening model are directly used as the input of the answer prediction model in the reasoning process. In addition, the invention also uses a self-training mode to label the label-free data to expand the data set.

Compared with the prior art, the invention has the following beneficial effects:

according to the method, a set of low-data-oriented dynamic enhancement multi-hop reading recognition processing model is established, and a data set is expanded by using a dynamic enhancement method in the training process of an answer prediction model aiming at the problem of less label data;

aiming at the problem that all documents cannot be input into the model under the condition of low resources, randomly inputting a non-support sentence in a training stage, and screening sentences related to answers of the problem by using a screening model in a prediction stage to reduce the input length; the prediction part of the support sentence part does not use a mainstream graph network for coding any more, but carries out recoding on the basis of an improved Transformer to learn the relation between sentences; and an external data set is introduced, and the generalization capability of the model is enhanced by using a self-training pseudo-labeling method.

The invention is tested to greatly improve the reading and understanding of the data set by CAIL 2020 in the field of Chinese law, proves the effectiveness of the solution, and has certain application value and practical significance for text data in the network.

Drawings

FIG. 1 is a block diagram of the self-training learning method employed in the present invention;

FIG. 2 is a sentence screening model architecture diagram according to the present invention;

FIG. 3 is a diagram of an answer prediction model architecture according to the present invention;

FIG. 4 is a diagram of a contextual sentence feature extraction architecture in accordance with the present invention;

fig. 5 is a sample of data used by the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, the present invention shall be described in further detail with reference to the following detailed description and accompanying drawings.

As shown in fig. 1, the embodiment of the present invention and its implementation process are as follows:

the embodiment of the invention is implemented and tested on a CAIL 2020 reading comprehension data set in the field of Chinese law.

The CAIL 2020 reading understanding data set is a reading understanding data set in the Chinese judicial field, part of data in the training set is a CJRC training set, part of the training set is 5100 re-labeled question-answer pairs, the CJRC training set comprises three fields of civil affairs, criminal affairs and administration, the verification set and the test set respectively comprise 1900 question-answer pairs and 2600 question-answer pairs, and the data volume is small. An external data set CJRC used in the experiment is provided by CAIL 2019 for reading and understanding the race course, the data mainly comes from a referee document published by a referee document network and comprises two fields of criminals and civil affairs, a training set comprises 40000 problems, and a verification set and a test set respectively comprise 5000 problems.

in the step 1, for each document in the data set, traversing each sentence of the document, matching the triples by adopting a regular matching expression, adding the extracted triples into a name list, then traversing the whole document once to split the name and the digital part, and matching from short to long during matching. Therefore, disambiguation processing aiming at the document is realized, and the aim of semantic ambiguity when word segmentation is brought by data desensitization is fulfilled.

Step 2: constructing a dynamically enhanced answer prediction model;

as shown in fig. 3, the answer prediction model for dynamic enhancement specifically includes:

2.1, obtaining each sentence by sentence division from the document of the data set, extracting part or all of the sentences, and combining the problems corresponding to the document to splice to form a context text. (ii) a As shown in fig. 3, the SEP in the figure indicates a spacer.

2.2, inputting the context text into a pre-trained BERT model, obtaining a whole feature vector CLS _ ALL of the context text, a problem in the context text and a decoding feature vector of each word of each sentence, forming a problem vector Q by the decoding feature vector of each word of the problem, and forming a context vector C by the decoding feature vector of each word of ALL sentences;

2.3、

type＝softmax(Linear(CLS_ALL)/τ)∈R^1×4

wherein softmax () represents a softmax activation function, τ represents a weight;

Q′＝Attention_pooling(Q)∈R^1×d

C′＝Norm(w₁·CLS_ALL+w₂·Q′+w₃·C)∈R^l×d

C. as shown in fig. 4, the support sentence for obtaining the prediction output is processed according to the overall feature vector CLS _ ALL, the problem vector Q, and the context vector C together in the following manner:

SFeature＝W·AttentionPooling(C,CLS_ALL,Q)

wherein, W represents a weight matrix, initialized to a constant; sigmoid () represents a sigmoid activation function, attentionPooling () represents an attention pooling operation,

a supporting sentence representing the predicted output.

It can be seen from the above that, the answer category prediction of the present invention is to perform a four-classification task, and a weight τ is used to punish each output value, and the bit with the highest probability is taken as the answer category.

If the answer is in the context, the answer is extracted from the document by using the answer extraction task. The answer extraction task is to perform weighted summation on the Q dimension reduction obtained by recoding and CLS _ ALL in each dimension of C, then change the output into 2 dimensions by using a multilayer perceptron, and predict the starting position and the ending position of the answer respectively.

In a specific implementation, the dimension sizes of q and k during calculation are increased to re-encode sentence vectors, a parameter matrix is used to superpose sentence information obtained from each re-encoding to perform sentence feature fusion, then a multilayer perceptron is used to convert the sentence information into output of n × 1 dimensions, and the sentence with each dimension higher than the threshold is determined as a supporting sentence, as shown in fig. 4.

thus, the self-training method is used for marking pseudo labels on the label-free data set, adding the pseudo labels into the label data, increasing the capacity of the training set, improving the training effect and realizing a model with higher precision and prediction thereof.

in the concrete implementation of the invention, a teacher model and a student model use models with the same topological structure, dropout noise is added in the stage of the student model to increase the learning difficulty, and noise is not added when a pseudo label is generated in the stage of the teacher model.

And 7: and (6) predicting the reading document to be tested by using the student model finally obtained in the step (6), outputting the label of the reading document to be tested and the answer contained in the label, wherein the label is obtained by prediction.

And (4) the reading text to be detected is also input into the student model after being subjected to correction preprocessing in the step (1).

Step 7.1, performing sentence segmentation on the reading document to be detected to obtain each sentence, respectively performing screening processing on the obtained sentences and the problems through a screening model to obtain the correlation degree between each sentence and the problem, and arranging K sentences with the highest correlation degree according to the sequence of the sentences appearing in the reading document to be detected to form a new document, as shown in FIG. 2;

The screening model implemented specifically uses a Chinese BERT pre-training model that encodes the question sentence pair. The Chinese BERT pre-training model is loaded with pre-training weights, and the training process of the Chinese BERT pre-training model all uses a fine-tuning form. The loss value calculation method is a binary cross entropy loss function and is optimized by using a gradient descent optimizer of Adam self-adaptive learning rate of a Warmup mechanism.

Therefore, the method and the device can better reduce the interference of irrelevant noise aiming at the situation to be detected by processing the reading document to be detected through the screening model, and realize the effects and advantages of utilizing the effective information of the long text information and reducing the interference of the irrelevant information.

And training an answer prediction model by using a dynamic enhancement method on all context sentences based on the idea of multi-task learning, wherein the training comprises three subtasks of answer category prediction, answer extraction and support sentence prediction.

In the training process of the answer prediction model, randomly selecting a part of sentences from the documents of the data set except the support sentences to form context texts together with the support sentences as dynamically enhanced context sentences;

and in the process of processing to be tested, the answer prediction model takes the new document obtained by screening the model in the step 7.1 as the context text.

All random seeds are fixed in the training process, and the batch equal super parameters are kept the same; the loss function L is composed of three parts of answer categories, span prediction and support sentences:

wherein L represents the overall loss function, α represents the weight occupied by the answer category loss function, β represents the weight occupied by the support data prediction,

indicates the probability of the predicted answer belonging to each category, y_typeThe true answer category is represented by the number of answers,

probability of whether each predicted sentence is a supporting sentence, y_sfA tag representing a real supporting sentence is displayed,

representing the probability of whether the respective position in the predicted context is the answer start position,

denotes the probability of whether each position in the predicted context is the answer end position, y^startPosition in the context, y, representing the starting position of the real answer^endRepresents the position of the end position of the real answer in the context, CE () represents a multivariate cross entropy loss function, BCE () represents a dyadic cross entropy loss function, L_ansA loss function representing the answer.

In the loss function, the starting position and the ending position respectively calculate cross entropy loss, and the cross entropy loss is added into the whole loss calculation after the average value is taken. Because the learning difficulty of each task is different, two weights of alpha and beta are added to control different differences, and the values in the experiment are respectively 0.5 and 0.8.

The example conditions were as follows:

fig. 5 is a sample data in the CAIL 2020 reading understanding data set, and the sample is taken as an example to describe the dynamic enhanced multi-hop reading understanding method facing low data according to the present invention.

(1) And preprocessing the document and performing semantic disambiguation.

(1) Training a sentence screening model and an answer prediction model by using the labeled data in the training set, and taking the sentence screening model and the answer prediction model as a teacher model; randomly selecting a part of label-free data by using a teacher model, printing pseudo labels by using a self-training method, and adding the pseudo labels into the label data; training a student model by using the obtained new training set, and iterating the student model to a teacher model; this process is repeated until the metrics are no longer elevated on the validation set; and testing the test sample.

(1) For the example shown in fig. 5, after the documents are divided into sentences according to punctuations, several desensitized names of wu x0, wu x2, chen x3, wu x6, wu x9 and changyu 17 are obtained, and then the disambiguation is performed on wu x22.1 mu, chen x31.95 mu, wu x61.98 mu, wu x90.99 mu and changyu 171.47 mu in the sentence [5] according to the obtained names, so as to obtain a sentence of "original wu x0 subcontracts itself to wu x22.1 mu, chen x31.95 mu, wu x61.98 mu, wu x90.99 mu and changyu 171.47 mu in 2001";

(1) and respectively screening the context of each sentence and the problem through a screening model to obtain the relevance score of each clause and the problem, and storing K sentences with the highest relevance to the answer of the problem according to the sequence of the sentences. In the data set K, 15 is selected, and the selected sentences obtained in the example are [2], [3], [4], [5], [9], [10], [11], [12], [13], [14], [15], [17], [18], [19] and [21 ].

(1) Splicing the question and the sentence screened in the step (4) and inputting the spliced question and the sentence into an answer prediction model to obtain an integral feature vector CLS _ ALL, a question vector Q and a context vector C; the CLS _ ALL vector is used for performing a four-classification task to predict the answer type, and the answer type is extracted, so that the answer needs to be extracted from the document. And carrying out weighted summation on Q dimensionality reduction obtained by recoding the problem vector and CLS _ ALL in each dimension of C, and then changing the output into 2 dimensions by using a multilayer perceptron to respectively predict the initial position of the answer.

Q′＝Attention_pooling(Q)∈R^1×d

C′＝Norm(w₁·CLS_ALL+w₂·Q′+w₃·C)∈R^l×d

In the formula, w represents weight, d represents dimension of hidden vector, and l represents length of sentence.

(6) And respectively taking 10 positions with the maximum scores at the starting position and the ending position, matching two positions, removing the position pair of which the answer position is not in the context section part or the ending position is in the starting position, and taking the starting position and the ending position with the maximum scores as the starting position and the ending position of the final answer to obtain the answer 'Wu x 6' of the question.

(7) On the basis of a Transformer structure, increasing the dimension size of q and k during calculation to recode sentence vectors, simultaneously using a parameter matrix to superpose sentence information obtained from each recoding for sentence characteristic fusion, then using MLP to convert the sentence information into output of n multiplied by 1 dimension, and judging that each dimension is higher than a threshold value of 0.5 as a support sentence;

SFeature＝W·AttentionPooling(C,CLS_ALL,Question)

in the formula, W represents a weight matrix, and is initialized to a constant.

(8) And (3) obtaining support sentences [3], [8], [9] and [11] from the support sentences [ 7], mapping the support sentences back to [5], [13], [14] and [17], and obtaining a final support sentence.

The foregoing description of the embodiments is provided to enable one of ordinary skill in the art to make and use the invention, and it is to be understood that other modifications of the embodiments, and the generic principles defined herein may be applied to other embodiments without the use of inventive faculty, as will be readily apparent to those skilled in the art. Therefore, the present invention is not limited to the above embodiments, and those skilled in the art should make improvements and modifications to the present invention based on the disclosure of the present invention within the protection scope of the present invention.

Claims

1. A low-data-oriented dynamic enhancement multi-hop text reading recognition processing method is characterized by comprising the following steps:

step 1: carrying out correction preprocessing on a data set of a document;

step 2: constructing a dynamically enhanced answer prediction model;

2. The low-data-oriented dynamically-enhanced multi-hop text reading identification processing method as claimed in claim 1, wherein: the data set includes documents, questions, and tags including answers, supportive sentences, and answer categories.

3. The low-data-oriented dynamically-enhanced multi-hop text reading identification processing method as claimed in claim 1, wherein: in step 2, the dynamically enhanced answer prediction model specifically includes:

2.2, inputting the context text into a pre-trained answer prediction model, obtaining a whole feature vector CLS _ ALL of the context text, a problem in the context text and a decoding feature vector of each word of each sentence, forming a problem vector Q by the decoding feature vector of each word of the problem, and forming a context vector C by the decoding feature vector of each word of each sentence;

2.3、

type＝softmax(Linear(CLS_ALL)/τ)∈R^1×4

wherein softmax () represents a softmaax activation function, τ represents a hyper-parameter;

Q′＝Attention_pooling(Q)∈R^1×d

C′＝Norm(w₁·CLS_ALL+w₂·Q′+w₃·C)∈R^l×d

in the formula, w₁、w₂、w₃Respectively representing a first weight, a second weight and a third weight, d representing the dimension of a hidden vector, l representing the length of a sentence, Q' representing a dimension reduction problem vector, Norm () representing a normalization function, MLP () representing a multilayer perceptron, R^1×4Representing four dimensions corresponding to four types of answers respectively, d representing the dimension of a hidden layer, and 1 representing the length of a context;

SFeature＝W·AttentionPooling(C，CLS_ALL，Q)

a supporting sentence representing the predicted output.

4. The low-data-oriented dynamically-enhanced multi-hop text reading identification processing method as claimed in claim 1, wherein: the step 7 specifically includes:

5. The low-data-oriented dynamically-enhanced multi-hop text reading identification processing method as claimed in claim 4, wherein: the screening model specifically uses a Chinese BERT pre-training model, which encodes question sentence pairs.