CN114912419A

CN114912419A - Unified machine reading understanding method based on reorganization confrontation

Info

Publication number: CN114912419A
Application number: CN202210407939.4A
Authority: CN
Inventors: 廖劲智; 唐九阳; 赵翔; 陈子阳; 谭真; 黄宏斌; 吴继冰
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2022-04-19
Filing date: 2022-04-19
Publication date: 2022-08-16

Abstract

The invention belongs to the field of natural language processing, and discloses a reorganization confrontation-based unified machine reading understanding method, which solves the problems of diversified acquisition types and unspecific candidate item number; recombining the problems and the candidate items to form combined candidate items, judging whether the new candidate items are established or not according to the reference text, and uniformly converting the judgment questions, the selection questions and the matching questions in the actual reading understanding scene into single-choice judgment questions; coding the combined candidate item obtained by recombination and the reference text together to obtain a recombined candidate item vector representation after interaction with the reference text and other recombined candidate items; integrating reference text information related to the recombination candidate item vector representation into the recombination candidate item vector representation; and regarding each candidate item vector as a single sample, carrying out probability judgment on whether the candidate item vector is established or not, and outputting a result. The invention converts the multi-selection problem into the correct and wrong judgment of candidate items one by one; two mechanisms of fusion layer and counterstudy improve the ability to distinguish different candidates.

Description

Unified machine reading understanding method based on reorganization confrontation

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to a unified machine reading understanding method based on recombination confrontation.

Background

The ability of a machine to read comprehension is an important criterion in measuring its level of intelligence. The selective machine reading comprehension puts higher requirements on the semantics of the machine understanding text because most of the candidates of the selective machine reading comprehension are synonymous rewriting of the content of the reference text. The most advanced current approaches to solution of selective machine-read understanding are primarily to customize the design model structure in a single problem type scenario. However, in a real-world reading understanding scene, the same article is often examined from different angles, so that multiple types of problems can exist, and the number of corresponding candidate items is not specific. In view of this, it is not difficult to find that the conventional customized model cannot expand the requirement of covering a real scene because the characteristics of a single problem scene are excessively considered.

Machine reading and understanding (MRC) is an important basic task in natural language processing, and helps to further improve the intelligence level of a machine. In this task, the machine needs to solve a specific question on the basis of a sufficient understanding of the provided reference text (sentence, paragraph or article). The MRC may be further divided into two types of extract reading comprehension (EMRC) and multi-choice reading comprehension (MCMRC) according to whether the answer appears in the related document. EMRC needs to extract text segments capable of answering questions from the original text as predicted answers, and part of tasks additionally provide candidates. Different from the former, the candidate content in the MCMRC is mostly a synonymous rewriting or summary of the original text content, so that a more severe requirement is put on the comprehension capability of the machine when reading comprehension is performed.

The existing research work on MCMRC can be roughly divided into three categories: the first method realizes interaction among texts, problems and candidate items by designing an attention mechanism, and further promotes discovery and fusion of information so that a model can predict results more accurately. The second method is based on a pre-training language model, the first stage performs pre-training on pre-training corpus, and the second stage performs fine-tuning in downstream specific MCMRC task. In addition to the two methods, technical means such as cognitive sciences, evidence sentence screening, graph neural networks and the like are also introduced into the MCMRC to achieve better effects.

Specifically, a representative method ALBERT (non-patent document: A Lite BERT for Self-aided Learning of Language representation) first pre-trains on a corpus by an auto-supervision task; secondly, designing classifiers according to the characteristics of downstream tasks, such as a four-classifier on a four-to-one single-selection type data set RACE (non-patent document RACE: Large-scale Reading comparison data set from external) and a DREAM (non-patent document DREAM: A Change data set and Models for Dialogue-Based Reading comparison) or a three-classifier on a three-to-one data set ReClo (non-patent document ReClor: A Reading comparison data set Reading Logical Reading) and the like; and finally, predicting a result by using the classifier and defining a _ tuning pre-training model. ALBERT successfully achieves SOTA results on various tasks by customizing structures such as classifiers to specific types of data sets.

It is noted that in a real-world reading understanding scene, the same article is examined from different angles, which results in the possibility of various types of mixed problems, such as selection questions, judgment questions, matching questions and the like. What appears with the diversification of the problem types is the undecidality of the number of candidate items, and there may be two candidate items of positive and false judgment, four candidate items of traditional one-out-of-four, an indefinite number of candidate items in the matching question, and the like. For example, a reading comprehension data set TQA derived from a middle school student science textbook contains 1,076 reference texts and 26,260 questions, with an average of 24 questions per article. In order to fully examine the mastering degree of students on the texts, the types of the post-course questions relate to multiple dimensions such as judgment questions, selection questions and matching questions, and the question types are examined alternately and distributed without obvious blocks.

Therefore, how to make the model adaptively solve the mixed reading understanding task still remains a challenging problem. In the existing method, a combined input mode of texts, problems and candidate items needs to be designed based on problem types during encoding, for example, the selection of the problems needs to be performed by connecting texts, problems and candidate items in series as input, and the judgment of the problems only needs to be performed by matching the texts and the problems without considering the candidate items. In addition, the output form of the classifier, i.e. whether the classifier is a second class, a fourth class or the like, needs to be designed according to the number of the candidates in the final probability prediction. Although the problems can be alleviated to a certain extent by designing an additional type discriminator to judge the problem types and performing reclassification processing, the model framework under the design needs to be customized and constructed according to the problem types and the change of the number of candidate items, and cannot cope with the real scene of dynamic change. Therefore, special further research is urgently needed for the real scene-driven mixed reading and understanding task.

The existing models for solving MCMRC related tasks can be classified into three types, namely attention-based models, pre-training language-based models and other models according to different design ideas. The traditional model picks the best answer by collecting and summarizing evidence (text snippets) in an article that is relevant to the question, followed by matching the evidence to candidates. Zhu Haichao, Wei Furu, Qin Bing et al in the non-patent literature in the hierarchy entry Flow for Multiple-Choice Reading comparison proposed this simple matching search pattern did not fully utilize the information in the text, so a neural network based Hierarchical Attention Flow framework was designed to fully utilize the candidates to model the word-level and sentence-level interactions between the reference text, the question and the candidates. Wang Shuohang, Yu Mo, Jiang king, et al treat the question and candidate as two sequences in non-patent document a Co-Matching Model for Multi-choice Reading comparison, compute two attention-weighted vectors from the question and candidate for each character in the reference text, then encode the location information of the question and candidate answers that match a particular context of the reference text and jointly match them with the given text, enabling information aggregation from word level to sentence level, and then from sentence level to document level. Chen Zhipen, Cui Yiming, Ma Wentao et al in the non-patent literature the relational Spatial orientation Model for Reading compatibility with Multiple-Choice Questions first encodes the reference text, question and candidate as a word representation enhanced by the addition of POS tags and matching features. And then enriching the representation of the candidate items by combining the reference text and the question information, and dynamically extracting the spatial correlation information between adjacent areas under different window sizes by using a convolutional neural network to form spatial attention. Duan Liguo, Gao Jiianyig, Li Aiping, etc. obtains the comprehensive semantics of articles and Problems in the non-patent document A Study on Solution Strategy of operation-schemes in Machine Reading comparison, on the basis of 4 kinds of common attention of splicing, bilinear, dot product and difference, the attention of two directions of query2context and query2 is fused, the key information of the articles and Problems is strengthened, and the irrelevant information is weakened.

With the advent of BERT, pre-trained language models have proven to have powerful performance in machine language understanding. Therefore, researchers consider this as a model coding layer to design structures that incorporate attention mechanisms. Ran Qiu, Li Peng, Hu Weiwei et al propose a Comparison Network in non-patent literature, Option Comparison Network for Multiple-choice Reading Comparison, which displays Comparison candidates at the character level, first encoding the candidates into a vector sequence using BERT as its feature. Each candidate is then compared one-to-one at the word level with the other candidates using an attention-based mechanism in vector space to identify their relevance. Zhang Shuailiang, Zhao Hai, Wu Yuwei et al also propose a two-way Matching Network for an encoder using BERT in non-patent document DCMN +: Dual Co-Matching Network for Multi-Choice Reading comparison, which bi-directionally combines all pairwise relations between < reference text, problem, candidate > triplets, and uses a gating mechanism to fuse representations from two directions. Furthermore, two reading strategies commonly used by humans are integrated. One is text sentence selection, which helps to extract evidence from the reference text and then match the evidence to the candidates. The other is candidate interaction, where the comparison information is encoded into each candidate. Such methods focus on designing attention mechanisms to explore associations between reference texts, problems, and candidates. However, different types of questions have different characteristics and require targeted design, for example, in the case of judgment questions, it is not necessary to design the interaction of the candidate item ("right" or "wrong") with the reference text, question.

As mentioned above, the field of natural language processing has been a step forward in the age of pre-trained language models, marked by the presence of BERT. Besides being used as a coding module of the whole model, the pre-training language model can also be directly used for performing Fine-tuning on the vector representation and applying the Fine-tuning to the downstream tasks of the MCMRC. Devlin J, Chang Ming-Wei, Lee K et al, in non-patent document a Co-Matching Model for Multi-choice Reading company, proposed a self-coding language Model BERT, whose overall framework is based on a 12-layer transform encoder, designed a random mask and predicted corresponding position characters and sentence continuity to predict two self-supervision tasks to pre-train the Model on a generic corpus. And finally, predicting the probability of each candidate item through a classifier with the same number as the candidate items. XLNET has been proposed because it is believed that the mask mechanism in BERT causes the data input in the two stages of training and fine-tuning to be non-uniform in the non-patent document XLNET, Generalized autoregegressive prediction for the purposes of Wide advancement. The mask mechanism is abandoned, and a double-flow self-attention mechanism is proposed. Inside the Transformer, n-1 words are randomly selected from the upper and lower words of the selected word, put in the upper position of the word, and the input of other words is hidden through an attention mask, so that the characteristics of the self-coding model and the self-regression model are combined. RoBERTA was designed by Liu Yinhan, Ott M, Goyal N et al in the non-patent document RoBERTA A Robusly Optimized BERT prediction Approach, and the BERT was improved mainly in three ways: 1) the sentence continuity prediction task is deleted after experimental verification; 2) a dynamic mask scheme is provided, and random mask is performed after a sentence is input into a model; 3) the size of the batch-size of each round of training is expanded. Lan Zhenzhong, Chen Mingda, Goodman S et al propose ALBERT in non-patent document ALBERT for Self-aided Learning of Language retrieval for the purpose of making the model lighter, faster in training and better in effect, a matrix decomposition method is adopted to map a word one-hot vector to a low-dimensional space and then to a hidden layer, all coding layers share parameters, and finally theme-based association is designed to predict whether two sentences exchange the sequence of the Self-supervision task.

Although the method achieves good results on various large list of MCMRC, due to the fact that specific input and output forms need to be designed according to specific downstream tasks during fine-tuning, the pre-training language model cannot be directly used in a scene with unfixed option types due to the characteristic.

In addition to the two main research routes described above, researchers have explored how to solve MCMRC in other directions. Sun Kai, Yu Dian, Yu Dong et al, in non-patent literature Improving Machine Reading compsition with General Reading Strategies was inspired by Reading Strategies that have been proven to effectively improve the level of human reader's Comprehension in cognitive science research, proposed three relatively independent Reading Strategies to promote MCMRC: 1) repeatedly reading, and comprehensively considering the positive sequence and the negative sequence of the input sequence; 2) highlighting, adding a trainable word vector representation to a textual representation of a token associated with the question and the candidate; 3) and self-evaluation, namely evaluating the effect of the model by directly generating questions and options from the text in an unsupervised mode. Wang Hai, Yu Dian, Sun Kai et al in non-patent literature discovery sequence Extraction for Machine Reading comparison efforts to find Evidence sentences in text. Since real evidence sentence labels are in most cases lacking, remote supervision is employed to generate inaccurate labels, which are then used to train the evidence sentence extractor. And then, cleaning noisy labels by utilizing a deep probability logic learning framework to improve the quality, and then combining sentence-level and cross-sentence-level language indexes to perform semi-supervised training. Huang Yinya, Fang Meng, Cao Yu et al, in the non-patent literature DAGN, Discoure-Aware Graph Network for Logical reading, proposed an utterance perception Graph Network that infers based on the structure of the text-based utterance. The model first encodes the utterance information into a logical graph, where the basic utterance units are nodes and the utterance relations are edges. And secondly, learning high-level speech features by using a graph neural network to represent the text, and finally combining the speech features with context label features from a pre-training language model.

Although the models have good performance on different MCMRC tasks, the tasks cannot be directly migrated to application scenes with multiple types of problems and unspecified number of candidate items under a unified framework due to the limitation of the tasks.

Disclosure of Invention

The unified machine reading understanding method based on the reorganization countermeasure, provided by the invention, can be used for converting all types of problems into correct and wrong judgment of one-by-one option for related problems of MCMRC, and can be used for better solving the problems. The invention provides a unified machine reading understanding framework based on reorganization confrontation, and a mixed type reading understanding problem is solved by using one unified framework. Firstly, recombining candidate items through a recombination layer to integrate all types of problems into one form; secondly, the interaction between the reference text and the recombination candidate items is fully realized through the coding layer and the fusion layer, so that the vector representation of the recombination candidate items contains more support text information; and finally, random disturbance in the training process is increased through counterlearning so as to avoid the problem of vector convergence and further obtain the differential representation of the recombination candidate items. Experimental results on a reading understanding data set derived from a scientific textbook of a middle school student show that the designed method can better solve the problems of diversified problem types and unspecific candidate number.

Therefore, the invention researches a unified model framework oriented to reading and understanding of a hybrid machine, and provides a recombination countermeasure model (recombination adversarial model, RAM), by designing a model framework to deal with reading understanding scenarios with multiple types of questions and no specific number of candidates.Aiming at the pain point that the existing method cannot be suitable for new problems, a new reading understanding frame is constructed by the RAM, all the problems are uniformly converted into a single-choice judging form, and the two problems are solved by carrying out correct and wrong judgment on each option. Specifically, the RAM first splits the candidates and combines them with the questions one by one to form a question setting sentence. Subsequently, the reference text and the newly generated candidates are combined, so that different types of problems are converted into judgment questions for predicting the probability of establishment of each new candidate under the support of the reference text. The transformation overcomes the difficulty that the reading and understanding of multiple types of problems can not be uniformly carried out in the existing research. In order to further enrich the information capacity in the candidate items, after the RAM finishes encoding, attention interaction results of the reference text and the candidate items are blended into the vector representation of the candidate items, so that the judgment of the success probability is more reliable. In addition, in this framework, since the new candidate under the same problem contains the same component (problem), the content in the original candidate may be smoothed during vector encoding, which may lead to a problem that the vector representation of the new candidate tends to be consistent. Therefore, a method for counterlearning is proposed to be adopted in the training stage, and the difference of the new candidate vector representation is expanded by randomly disturbing the vector representation to force the vector representation to pay more attention to different parts in the new candidate in the learning process. Finally, the model based RAM implementation achieves superior test performance on public data sets.

Specifically, the invention discloses a reorganization confrontation-based unified machine reading understanding method, which comprises the following steps:

the problems of diversified types and unspecific candidate number are obtained;

recombining the problems and the candidate items to form combined candidate items, judging whether the new candidate items are established or not according to the reference text, and uniformly converting the judgment questions, the selection questions and the matching questions in the actual reading understanding scene into single-choice judgment questions;

coding the combined candidate item obtained by recombination and the reference text together to obtain the vector representation of the recombined candidate item after interaction with the reference text and other recombined candidate items;

integrating reference text information related to the recombination candidate item vector representation into the recombination candidate item vector representation so as to enrich information contained in the recombination candidate item vector representation;

and regarding each candidate item vector as a single sample, carrying out probability judgment on whether the candidate item vector is established or not, and outputting a result.

Further, uniformly converting the judgment questions, the selection questions and the matching questions in the actual reading understanding scene into the single selection judgment questions comprises:

splitting the candidate items, and combining the candidate items with the questions one by one to form a question setting sentence;

and combining the reference text with the newly generated candidate items, thereby converting different types of problems into judgment questions for predicting the probability of establishment of each new candidate item under the support of the reference text.

Further, the mathematical form of the judgment question is as follows: the three-way set of < P, Q, a >,

wherein P represents a reference text, formally a sentence, paragraph, or whole article; q denotes a question in natural language, A ═ a ¹ ，a ² ，...，a ⁿ Denotes a set of candidates corresponding to a particular question, n denotes the number of candidates, of which there is one and only one correct answer.

Further, for any input problem, after the candidate items are split and connected in series to the problem, new candidate items with the same number as the original candidate items are formed, and the method is specifically represented as follows:

C＝{c ¹ ，c ² ，...，c ⁿ }＝{Q+a ¹ ，Q+a ² ，...，Q+a ⁿ }

wherein C represents the recombined candidate item set;

the reference text P is used as background knowledge and is combined with the candidate items to be used for a machine to judge whether the factual statement in the candidate items is true or not, and the specific combination form is as follows:

P，c ¹ ，c ² ，...，c ⁿ

the above equation concatenates all the candidates, considers the information contained in the reference text P, and focuses on the key information in other candidates to find the association relationship between different candidates.

Further, the jointly encoding the combined candidate obtained by recombining and the reference text includes:

the method comprises the following steps of performing word segmentation and word list mapping on a text by using a pre-training language model RoBERTA as a word segmentation device to obtain a text and candidate item code, wherein the specific input form is as follows:

[CLS]P[SEP]c ¹ [SEP]c ² [SEP]...[SEP]c ⁿ [SEP]

wherein [ CLS ] represents a special character for marking the initial position of an input text in RoBERTA, [ SEP ] represents interval characters between different texts in RoBERTA, and n is the number of texts;

obtaining a word vector representation of the reference text and the candidate:

E＝(e ¹ ,e ² ,...,e ^m )∈R ^m×d

where E denotes the word vector encoding of the input text sequence, E ^m A word vector representing the mth character in the sequence, R representing the real space, m representing the text length, and d representing the dimension size.

Further, for data with the length exceeding 512 after the text and the candidate items are connected in series, expanding the input upper limit to 4096 by adopting a Longformer sliding window mechanism, an expanded sliding window mechanism and a sliding window mechanism integrating global information;

or the complexity in BERT is reduced to linearity by using Bigbird's random attention, window attention and global attention methods, and the input upper limit is further expanded to 4096.

Further, the step of incorporating the reference text information related to the recomposition candidate vector representation includes:

attention mechanism between design text and candidates:

s＝softmax(a)

where tanh represents a nonlinear activation function to achieve a nonlinear mapping, W _pc Representing a learnable weight matrix, E _p In order to refer to the text vector representation,

as a candidate vector representation, b _pc Representing an offset item, j is the number of a candidate item, and the value in the softmax realization matrix is [0,1]]Performing upper normalization to finally obtain a candidate item c ^j Attention vector representation with reference to text

Mean _ posing is used to obtain the mean vector representation for each candidate, i.e.:

where | represents the number of specific objects, |, sum represents a linear sum, and the candidate mean vector represents C ^j ∈R ^d Will be used for information fusion and final probability prediction;

after obtaining the attention vector representation, the information fusion is realized by using a highway network mechanism, which is as follows:

wherein, W _r And W _g Represents a learnable weight matrix, [;]a vector representing the input in series is represented,

representing the multiplication of elements at corresponding positions of two matrices, relu representing the nonlinear activation function, O ^j Representing input vectors and intermediate vectors

G represents a gating threshold value to control and adjust the proportion of each part in the linear interpolation,

is an average vector representation of the reference text;

obtaining candidate final representation M obtained after fusion layers through linear splicing ^j ∈R ^2d ：

Further, the determining whether each candidate vector is considered as a single sample with probability includes:

probability prediction between [0,1] is performed on the candidate vectors by a two-classifier:

P＝sigmoid(f _p (M))

wherein P represents the probability output of the candidate item, sigmoid is used to smooth the neural network output to [0, 1%]F is _p Representing a two-classification full-connection prediction network with the output dimension of 1;

during training, probability prediction is carried out on each candidate item, loss values are calculated one by one, and the mathematical formula is as follows:

wherein BCE is twoValue cross entropy loss function, L _p Representing the calculated loss value, L represents the set of true labels, k represents the kth candidate, N represents the set of all candidates in the current batch _ size, | · | represents the number of specific objects, p represents the prediction probability of the model, and L represents the true label of the sample.

Furthermore, a counterstudy mechanism is introduced, random interference is added into the candidate item vector representation, the model is forced to pay attention to regular key semantic distribution, and the difference representation of the candidate item vector is realized.

Further, in the confrontational training, the input is represented by E, the parameter of the model is represented by θ, and the confrontational training adds the following loss to the loss function of the two classifiers:

wherein r is _adv Expressing the finally input random disturbance, r expressing the random disturbance, | | cndot | | | expressing a two-norm, epsilon expressing a hyperparameter, and L expressing a loss function;

the operation of the above formula in the neural network is realized by adopting the following formula, and the specific process is as follows:

r _adv ＝-εg/||g||

where g denotes the gradient of the loss function L to the input word vector representation E,

representing gradient operations, f model operations, and y sample labels.

The invention has the beneficial effects that:

the defects in the traditional MCMRC task setting are found, new reading and understanding tasks with multiple types of problems and unspecific number of candidate items are provided according to a real reading and understanding scene, and the multi-selection problem is converted into the correct and wrong judgment of the candidate items one by one.

The model for uniformly solving the problems comprises a candidate item recombination layer for uniformly solving different types of problems, a coding layer for realizing long text semantic representation, an attention fusion layer for enriching candidate item information, a confrontation training module for expanding semantic representation difference and a prediction layer;

in order to improve the distinguishing capability of different candidate items, two mechanisms of a fusion layer and counterstudy are provided, and the effectiveness of the RAM in solving the problems through a unified framework and the reasonability of the design of different components are proved.

Drawings

FIG. 1 is a diagram of a model framework of the present invention.

Detailed Description

The invention is further described with reference to the accompanying drawings, but the invention is not limited in any way, and any alterations or substitutions based on the teaching of the invention are within the scope of the invention.

Definition 1 (mixed-type alternative reading understanding): the MCMRC task may be expressed as providing<P,Q,A>A triplet, where P represents reference text, which may be in the form of a sentence, paragraph, or whole article; q represents a problem in a natural language form, multiple problems may exist in the same reference text, and the types of the problems are diverse; a ═ a ¹ ,a ² ,...,a ⁿ Denotes a set of candidates corresponding to a particular question, n denotes the number of candidates, of which there is one and only one correct answer. The machine is asked to read and understand the problem according to the reference text and finally predict the correct result from the candidate items.

The model of the invention comprises a recombination layer, a coding layer, a fusion layer, a confrontation training module and a prediction layer, and the framework of the model is shown in figure 1.

In order to solve the problems of problem diversity and uncertain number of candidate items, the invention forms the combined candidate item by recombining the layer recombination problem and the candidate items, and judges whether the new candidate item is established or not according to the reference text, thereby realizing the unified conversion of judgment questions, selection questions, matching questions and the like in the actual reading and understanding scene into the single selection judgment problem.

In order to enable the model to have more sufficient support information when the recombination candidate items are judged one by one, the invention designs a coding layer and a fusion layer. The coding layer obtains vector representation after interaction with the reference text and other recombination candidate items by jointly coding the recombination candidate items obtained by the recombination layer and the reference text; the fusion layer further enriches the information contained in the recombination candidate item vector representation by fusing the reference text information related to the recombination candidate item vector representation into the recombination candidate item vector representation.

In order to avoid the problem that vector representation converges due to excessive coincident parts in the recombination candidate items, the invention introduces a counterstudy mechanism, and forces the model to pay more attention to regular key semantic distribution by adding random interference in the candidate item vector representation, thereby realizing the differential representation of the candidate item vector.

A recombination layer: in the prior art, a data input mode and a probability prediction classifier of a model need to be customized and designed according to problem types and the number of candidate items. Specifically, for the judgment questions, only texts and questions need to be input, and then, two classification judgment is carried out according to the generated text vectors; for the choice question, all candidate items are required to be connected in series after the text and the question are combined to serve as input, and specific classification judgment is carried out on the candidate item vectors through internal interaction of the model.

In order to make the model invariant to the specific type of input question and the actual number of corresponding candidates under that question, the RAM first converts all questions uniformly into a single-choice decision form. For any input problem, after the candidate items are split and connected in series to the problem, new candidate items with the same number as the original candidate items are formed, and the method is specifically represented as follows:

C＝{c ¹ ，c ² ，...，c ⁿ }＝{Q+a ¹ ，Q+a ² ，...，Q+a ⁿ } (1)

wherein, C represents the recombined candidate item set, and for the sake of brevity,

hereinafter, the candidate is referred to as a recombined candidate.

Then, the reference text P is used as background knowledge to be combined with the candidate item for the machine to judge whether the factual statement in the candidate item is true, and the specific combination is as follows:

P，c ¹ ，c ² ，...，c ⁿ

the reason why all the candidates are connected in series is that the RAM is expected to consider information contained in the reference text P when encoding the candidates, and key information in other candidates needs to be concerned, so that a model can find the association relation between different candidates to make more reasonable prediction.

After the internal interaction of the model is completed, the RAM is different from the existing work, and the RAM judges whether each candidate item is classified into two types one by one.

Neural networks cannot directly process discrete natural language symbolic information, and therefore word vector mapping of natural languages is required through an encoding layer. The invention firstly adopts a pre-training language model RoBERTA as a word segmentation device to carry out word segmentation and word list mapping of a text, and the specific input form is as follows:

[CLS]P[SEP]c ¹ [SEP]c ² [SEP]...[SEP]c ⁿ [SEP]

where [ CLS ] represents the special character in RoBERTA that marks the beginning of the input text, and [ SEP ] represents the space character between different texts in RoBERTA. By RoBERTa, a coding of text and candidates can be obtained.

The phenomenon that the length of the reference text is generally long exists in the reading understanding of the actual scene. While RoBERTa follows the BERT setting, each character is encoded with a self-attention mechanism operation of all characters in the input text, so that the encoded information can capture context information in the whole sequence. This also causes the memory and computational power required for the computation to grow quadratically with the sequence length, with computational complexity up to O (m) ² ) Where m represents the number of characters contained in the input text. Therefore, when the word vector coding is carried out, the length upper limit constraint of 512 characters exists, which means that the data with the length exceeding 512 after the text and the candidate are connected in series cannot be processed.

To overcome this problem, The present invention employs a Longformer (Longformer: The Long-Document Transformer) capable of processing Long texts as an encoding mechanism. Longformer improves the traditional self-attention mechanism, and each character only carries out local attention operation (local attention) on other characters near the size of a fixed window; and calculating global attention (global attention) for a specific character fragment in combination with a specific task. Specifically, Longformer proposes three new attention modes, namely a sliding window mechanism, an expanded sliding window mechanism and a sliding window mechanism fusing global information.

In the sliding window mechanism, only the attention of each character with w characters nearby is calculated, which reduces the calculation complexity to O (w × m). On the basis of a sliding window mechanism, the Longformer designs an expanded sliding window mechanism by using the idea of void convolution, and inserts a gap with the size of d between adjacent characters related in a sliding window on the premise of not increasing the calculation load, so that the 'visual field' of attention is expanded to w multiplied by d from the context of w. Furthermore, Longformer introduced global attentions that allow the same global attention calculations as RoBERTa on a small number of specific characters or segments, depending on the specific task served. Through the proposal of the specific mechanisms, Longformer raises the upper limit of the length of the coded text to 4096, and can basically meet the requirement of coding analysis on long texts.

In addition, BigBird (Big Bird: transformations for pointer Sequences) also achieves the complexity reduction to linearity in BERT by designing and combining random attention (random attention), window attention (window attention), and global attention (global attention), thereby extending the input upper limit to 4096.

At this point, the word vector representations of the reference text and the candidates may be obtained:

E＝(e ¹ ,e ² ,...,e ^m )∈R ^m×d

Through the coding layer, can obtainReference text vector representation E _p ∈R ^x×d Candidate vector representation

Where x denotes the length of the reference text and y ^j Indicating the length of the jth candidate.

In order to further capture the incidence relation between the candidate item and the reference text and enrich the information contained in the candidate item representation, the invention designs an attention mechanism between the text and the candidate item:

s＝softmax(a) (3)

where tanh represents a nonlinear activation function to achieve a nonlinear mapping, W _pc Representing a learnable weight matrix, b _pc Representing an offset term, softmax realizes that the value in the matrix is [0,1]]And (4) carrying out upper normalization. Finally obtain candidate c ^j Attention vector representation with reference to text

Then, the present invention adopts mean _ posing to obtain the average vector representation of each candidate, namely:

where, | · | represents the number of specific objects, and sum represents a linear sum. Candidate mean vector representation C ^j ∈R ^d Will be used for information fusion and final probability prediction.

After obtaining the attention vector representation, the invention uses a highway network mechanism to realize the fusion of information, which is as follows:

is an average vector representation of the reference text.

Candidate final representation M obtained after fusion layers can be obtained through linear splicing ^j ∈R ^2d ：

Although the attention fusion mechanism allows the RAM portion to have a mastery of the ability to distinguish key information in candidates, it can be seen from equation (1) that the problem is a large part of the composition of candidates. This may cause the model to smooth the original candidate information due to too many components that are the same among different candidates when the candidate vector is represented, and the representation similarity obtained by the final learning is higher, thereby affecting the prediction effect.

Countermeasure learning aims to improve the model's ability to identify raw samples and confront samples. Wherein an antagonistic sample (adaptive samples) is a sample generated by adding a small random perturbation to the input original sample. Counterlearning was first proposed in image classification, the purpose of which was to add to an image perturbations that are difficult to resolve by the human eye but very disturbing to the machine, in order to change the prediction of the image class by the model. Unlike adding perturbations directly to the original input samples in image classification, counterperturbations in text classification do not work on natural language text because discrete symbol inputs do not meet the continuity requirements for perturbation insertion and therefore work on continuous word vector representations.

Therefore, random disturbance is actively introduced into the training in the training process to increase the data volatility in the training process of the model, so that the model is forced to pay attention to the difference between different candidate representations. In the confrontational training, the input is represented by E, the parameters of the model are represented by theta, and the confrontational training adds the following loss to the loss function of the original classifier:

wherein r is _adv Representing the random disturbance finally input, r representing the random disturbance, | | | · | | |, representing a two-norm, epsilon representing a hyperparameter, and L representing a loss function.

The RAM adopts linear approximation in non-patent document Big Bird: transformations for Longer Sequences to realize the operation of the formula in the neural network, and the specific process is as follows:

r _adv ＝-εg/||g|| (10)

wherein g represents a loss letterThe number L represents the gradient of E for the input word vector,

representing gradient operations, f model operations, and y sample labels.

After the process, the model regards each candidate vector as a single sample, and makes a probability judgment on whether the candidate vector is true or not. By this operation, the RAM achieves unified prediction for different types of problems, different numbers of candidates. Specifically, the model finally performs probability prediction between [0,1] on the candidate vector through a two-classifier:

P＝sigmoid(f _p (M)) (12)

wherein P represents the probability output of the candidate item, sigmoid is used to smooth the neural network output to [0, 1%]F is _p Representing a two-class fully-connected prediction network with an output dimension of 1.

During training, different from multi-classification cross entropy loss calculation in a traditional model, the RAM carries out probability prediction on each candidate item and calculates loss values one by one, and the operation follows binary cross entropy loss function (BCE). The mathematical expression is:

wherein BCE is a binary cross entropy loss function, L _p Representing the calculated loss value, L representing the set of true labels, k representing the kth candidate, N representing the set of all candidates in the current batch _ size, | · | representing the number of particular objects, p representing the prediction probability of the model, L representing the true labels of the exemplars.

Similar to the conventional loss value calculation process, the training process of the counterlearning only needs to add random disturbance r when the vector is input _adv The specific operation process is as follows:

P _adv ＝sigmoid(f _p (C+r _adv )) (14)

combining the two, the final loss function of RAM training is:

L＝L _p +L _adv (16)

during testing, the prediction result in the formula (12) is directly used as the probability prediction of the candidate item, and the candidate item with the maximum probability value is selected as the final prediction answer under the same question.

In the section, firstly, experimental setting details such as data sets, models and parameters designed in experiments are introduced, secondly, the overall effect of the RAM is tested on TTQA, results of different types of problems and different numbers of candidate items are analyzed, then, ablation experiments are carried out, and case analysis is given.

In order to reflect the characteristics of multiple types of questions and unspecific candidate number in a real reading and understanding scene, a TQA text question-answer data set (hereinafter, TQA is abbreviated as TQA, and is a multi-mode textbook question data set and comprises 13693 pure text question-answer questions and 12567 image-text question-answer questions is adopted in an experiment, in order to fully evaluate the effectiveness of a model, a pure text question-answer part of the TQA is used as an experiment data set, abbreviated as TTQA, data set link https:// allenai.org/data/TQA), relevant data only related to the text question-answer part is extracted from the experiment data set, and pictures and noise data related to the text are removed. 1073 documents and 13049 questions and candidates corresponding to the documents are finally obtained, and the specific statistics of the data characteristics are shown in table 1, wherein "#" represents the number, "-" represents that the corresponding data set does not have a sample of the category, and T/F represents a judgment question. In TTQA, 663 documents are used for training, 200 documents for verification and 210 documents for testing, as per the original training/verification/test set data partitioning in TQA. Compared with the data set of the traditional MCMRC, the TTQA comprises two types of judgment questions and selection questions, three numbers (2, 4 and 7) of candidates exist, the average length of the text is longer, and the ratio of the average length of the question to the average length of the candidates is higher.

TABLE 1 MCMRC data set statistics

Because the task of the type is provided for the first time by the invention, the related problems in TTQA are difficult to solve by using a unified model in the existing method, random selection is designed to be used as a baseline model, and the RAM under 3 different settings is used for experimental analysis.

Random selection (random). Randomly generating result predictions for different problems by using a random function; RAM (Bi-LSTM). Replacing a coding layer in the model by using Bi-LSTM, and operating the rest part according to the model;

RAM (Bigbird). Using Big Bird as a word vector coder in a coding layer, and operating the rest parts according to a model;

RAM (longformer). Longformer is used as a word vector encoder in the encoding layer, and the rest is operated according to the model.

Roberta, BigBird and Longformer used in the present invention all employ versions encapsulated in the officially open-sourced transformations package of hugging face. In Bi-LSTM, Bigbird and Longformer, the maximum text length is set to 2500. In Bi-LSTM, catch _ size is 64, hide _ dim is 100, and GoLve (Global: Global Vectors for Word retrieval) is used. As an initialization word vector. In BigBird and Longformer, batch _ size is 9, and word vector dimension d is 768. The attention block size block _ size in BigBird is 64, and the number of random attention blocks num _ random _ blocks is 3. The attention window size in Longformer is attribute _ window 256, while the candidate text sequence is selected as the global attention object. The hyper-parameter epsilon in the confrontation learning is 1. The model test environment is a video memory server of 3 Tesla V10032G. The test adopts precision (precision) as an evaluation index, namely the proportion of the number of questions in the answer pair in all the questions.

TABLE 2 Overall Utility results

Table 2 shows the experimental results, All being the total result of All the problems in TTQA. Overall, RAM results were less than satisfactory, with 69.56% of the subjects in the best setting (Longformer), and there was still much room for improvement over the senior in humans. In addition, it can be seen from the table that the difference of the coding structure has a large effect on the result, wherein the Bi-LSTM has a poor effect in long text due to the difficulty of capturing the context information in long sequences. The reason why the effect of Longformer exceeds Big bird 17.72% may be that the setting of the sliding window expands the boundary of attention, so that RAM learns more semantic association during training; meanwhile, setting the candidate as an object of global attention is more helpful for modeling the interaction of the candidate with the reference text. In order to investigate the performance of the RAM in solving the multi-type problem and different numbers of candidate items, the invention also carries out refinement and splitting on the data set, and carries out experiments on different settings respectively, and the specific analysis is given below.

TTQA includes judgment questions (T/F), multiple choice questions (MC) and matching questions (matching), and the matching questions are also correct ones selected from multiple choices, so the invention incorporates the matching into the multiple choice questions for uniform analysis, and the final result is shown in Table 3.

TABLE 3 Multi-type problem results

Generally, the accuracy of the judgment questions is higher than that of the multiple choice questions. However, compared with the random selection in which the accuracy of the judgment questions is 2 times that of the multiple choice questions, the difference between the two parties in the RAM is obviously reduced, which proves that the structure provided by the invention is effective in processing multiple types of problems. In addition, the judgment problem is better than the multiple choice problem, except for the difference of the number of the candidate items, another possible reason is that the framework of the RAM is essentially a structure for converting the MCMRC into the single choice wrong judgment, so that the framework is naturally suitable for solving the judgment problem.

Different number of candidate analyses: since there are no double-choice questions among the choice questions of TTQA, the result of option number 2 is referred to the self-judgment question column, and the result is shown in table 4.

TABLE 4 different number of candidate results

It can be seen from the table that as the number of candidates increases, the accuracy of model prediction decreases correspondingly, which is consistent with the real situation of human being in reading comprehension test, and the increase of the number of options usually means the increase of the interference items, which will further affect the judgment of human being. However, for RAM, the increased number of candidates means that the model has an increased difficulty in learning the dissimilarity between different candidate representations. Specifically, the model tends to distinguish the locations of positive and negative samples in the vector space during learning, and the candidate length (2.6) in TTQA is short compared to the average length of the problem (9.8), which tends to result in too large an overlap of the results produced by equation (1). At this time, the increase of the number of candidates means that the number of training negative samples is increased, and the coincidence degree between the negative samples with the same label is higher, so that the negative samples are located closer in the vector space, which leads to convergence of representations of different candidates, further increasing the difficulty of the model in distinguishing the key information in different candidates. The invention designs an attention fusion and confrontation learning module, and solves the problem.

In order to further analyze the effect of each part in the RAM in the whole model and the influence on the final result, an ablation experiment is performed in this section, and RAM (longformer) is used as an analysis model to adjust different components in the frame one by one. The final results are shown in table 5, where Enc denotes the coding layer, Fus denotes the fusion layer, and Adv denotes the antagonistic learning. Since Bi-LSTM can be considered as a result of the RAM deleting the coding layer, the entry results were tested in w/oEnc using Bi-LSTM as the word vector coding.

TABLE 5 ablation test results

The large decrease in effect (27.26%) after deleting the coding layer is predictable because Bi-LSTM cannot overcome the long-distance dependency problem, i.e., there is an excessive concern about newly added text during transmission, resulting in the loss of long-distance information during transmission. Although it does information fusion by maintaining a global vector information, some important information is lost in the fusion process. RoBERTa greatly alleviates this problem by the process of attention coding sequences through the transform's self-encoder, and structural designs such as sliding window attention, global attention, etc. for long text in Longformer are also extremely efficient.

The addition of the fusion layer enables the effect to be improved by 1.56%, and the cutting diversity of different numbers of candidate items is improved, so that the design of the module can help the model to overcome the problem of representation convergence caused by the increase of the number of the candidate items caused by the structure to a certain extent. This phenomenon also shows that although the RAM is set by mechanisms such as sliding window attention and global attention in Longformer, interaction between candidate items and reference text vector representation is realized, but the model effect can still be improved by extracting information of different granularities for attention calculation.

The increase in effect of 1.60% contributed to the counterlearning, meaning that it was equally effective in solving the convergence problem. This is probably because random perturbation increases the dissimilarity between candidates, enhancing the ability of the RAM to decide on different candidates. Specifically, candidates having a large number of overlapping portions (problems) have large differences in their position distributions in the vector space due to the addition of random perturbations after entering the model. In the learning process of the model, although the processing of the negative samples still follows the principle of shortening the distance, the random disturbance is contained in the negative samples, so that the model has to pay more attention to the text features existing regularly. Thus, key information in different candidates can be learned.

This section gives examples of errors in the problem of different number of candidates to visualize the data characteristics of TTQA and the erroneous judgment of RAM, and the results are shown in Table 6.

TABLE 6 case analysis of misprediction results

As can be seen from the table, the questions of judgment in TTQA cannot be left out from the reference text to predict it by external knowledge alone, for example, "is angiosperm the most successful plant? ", the object of the angiosperm comparison and the time limitation are considered. The same question may result in different answers in different text contexts. Therefore, the invention aims to enhance the capturing capability of the RAM on the text information through the design of a coding layer, a fusion layer and the like. In contrast, parts of the choice questions may be answered on common sense.

The invention has the beneficial effects that:

The word "preferred" is used herein to mean serving as an example, instance, or illustration. Any aspect or design described herein as "preferred" is not necessarily to be construed as advantageous over other aspects or designs. Rather, use of the word "preferred" is intended to present concepts in a concrete fashion. The term "or" as used in this application is intended to mean an inclusive "or" rather than an exclusive "or". That is, unless specified otherwise or clear from context, "X employs A or B" is intended to include either of the permutations as a matter of course. That is, if X employs A; b is used as X; or X employs both A and B, then "X employs A or B" is satisfied in any of the foregoing examples.

Also, although the disclosure has been shown and described with respect to one or an implementation, equivalent alterations and modifications will occur to others skilled in the art based upon a reading and understanding of this specification and the annexed drawings. The present disclosure includes all such modifications and alterations, and is limited only by the scope of the appended claims. In particular regard to the various functions performed by the above described components (e.g., elements, etc.), the terms used to describe such components are intended to correspond, unless otherwise indicated, to any component which performs the specified function of the described component (e.g., that is functionally equivalent), even though not structurally equivalent to the disclosed structure which performs the function in the herein illustrated exemplary implementations of the disclosure. In addition, while a particular feature of the disclosure may have been disclosed with respect to only one of several implementations, such feature may be combined with one or other features of the other implementations as may be desired and advantageous for a given or particular application. Furthermore, to the extent that the terms "includes," has, "" contains, "or variants thereof are used in either the detailed description or the claims, such terms are intended to be inclusive in a manner similar to the term" comprising.

Each functional unit in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or a plurality of or more than one unit are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium. The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Each apparatus or system described above may execute the storage method in the corresponding method embodiment.

In summary, the above-mentioned embodiment is an implementation manner of the present invention, but the implementation manner of the present invention is not limited by the above-mentioned embodiment, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be regarded as equivalent replacements within the protection scope of the present invention.

Claims

1. The unified machine reading understanding method based on the reorganization confrontation is characterized by comprising the following steps:

the problems of diversified types and unspecific candidate number are obtained;

2. The reorganization countermeasure-based unified machine reading understanding method of claim 1, wherein the unified conversion of judgment questions, selection questions and matching questions in the actual reading understanding scene into single choice judgment questions comprises:

3. The reorganization confrontation-based unified machine reading understanding method of claim 1, wherein the mathematical form of the judgment questions is: less than P, Q, A > the triplet,

wherein P represents a reference text in the form of a sentence, paragraph, or whole article; q denotes a question in natural language, A ═ a ¹ ,a ² ,...,a ⁿ Denotes a set of candidates corresponding to a particular question, n denotes the number of candidates, of which there is one and only one correct answer.

4. The reorganization confrontation-based unified machine reading understanding method of claim 3, wherein for any input question, after splitting and concatenating the candidate items to the question, new candidate items with the same number as the original candidate items are formed, which is specifically represented as:

C＝{c ¹ ，c ² ，...，c ⁿ }＝{Q+a ¹ ，Q+a ² ，...，Q+a ⁿ }

wherein, C represents the recombined candidate item set;

P,c ¹ ,c ² ,...,c ⁿ

5. The reorganization countermeasure-based unified machine reading understanding method of claim 1, wherein the joint encoding of the reorganization-derived combined candidates and the reference text comprises:

[CLS]P[SEP]c ¹ [SEP]c ² [SEP]...[SEP]c ⁿ [SEP]

obtaining a word vector representation of the reference text and the candidate:

E＝(e ¹ ,e ² ,...,e ^m )∈R ^m×d

6. The reorganization confrontation-based unified machine reading understanding method according to claim 5, wherein for the data with length exceeding 512 after the text and the candidate items are concatenated, a Longformer sliding window mechanism, an expanded sliding window mechanism, and a global information fused sliding window mechanism are adopted to expand the input upper limit to 4096;

7. The reorganization countermeasure-based unified machine reading understanding method of claim 1, wherein the merging of the reference text information related to the reorganization candidate vector representation into the reorganization candidate vector representation comprises:

attention mechanism between design text and candidates:

s＝softmax(a)

wherein tanh represents a non-linear lineActivation of a function to achieve a non-linear mapping, W _pc Representing a learnable weight matrix, E _p In order to refer to the text vector representation,

wherein, W _r And W _g Represents a learnable weight matrix, [;]to representThe vectors that are input in series are then,

is an average vector representation of the reference text;

obtaining candidate final representation M obtained after passing through fusion layer through linear splicing ^j ∈R ^2d ：

8. The reorganization confrontation-based unified machine reading understanding method of claim 1, wherein regarding each candidate vector as a single sample, the probabilistically determining whether it is true or not comprises:

P＝sigmoid(f _p (M))

9. The reorganization confrontation-based unified machine reading understanding method of claim 8, wherein a confrontation learning mechanism is introduced, random interference is added into the candidate item vector representation, the model is forced to pay attention to regular key semantic distribution, and the difference representation of the candidate item vector is realized.

10. The method of claim 9, wherein in the countermeasure training, the input is represented by E and the model parameters are represented by θ, and the countermeasure training adds the following penalty to the penalty function of the two classifiers:

wherein r is _adv Representing the finally input random disturbance, r represents the random disturbance, | | | |, represents a two-norm, epsilon represents a hyperparameter, and L represents a loss function;

r _adv ＝-εg/||g||

g＝▽ _E L(f _θ (E),y)

wherein g represents the gradient of the loss function L to the input word vector representation E + _E Representing gradient operations, f model operations, and y sample labels.