CN112579739A

CN112579739A - Reading understanding method based on ELMo embedding and gating self-attention mechanism

Info

Publication number: CN112579739A
Application number: CN202011542671.2A
Authority: CN
Inventors: 任福继; 张伟伟; 鲍艳伟
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2020-12-23
Filing date: 2020-12-23
Publication date: 2021-03-30

Abstract

The invention discloses a reading understanding method based on an ELMo embedding and gating self-attention mechanism, which is based on a model related to an ELMo embedding and gating self-attention function. In addition, the method multiplexes the feature representations of all layers at the answer layer, and carries out the position prediction of the final answer by using a bilinear function, thereby further improving the overall performance of the system. In experiments on SQuAD data sets, the model is proved to be greatly superior to a plurality of baseline models, the performance is improved by about 5 percent compared with the original baseline, the performance is close to the average level of human tests, and the effectiveness of the method is fully proved.

Description

Reading understanding method based on ELMo embedding and gating self-attention mechanism

Technical Field

The invention relates to the technical field of computers, in particular to a reading understanding method based on an ELMo embedding and gating self-attention mechanism.

Background

Machine reading understanding is always an important component of artificial intelligence and is a research hotspot in the field of natural language processing. A great deal of human knowledge is transmitted in the form of unstructured natural language texts, so that a machine can read and understand the texts, and the method has an important significance and has a direct application value for search engines, intelligent customer service and the like. Machine-reading understanding has received widespread attention in the field of natural language processing in recent years, and one of the reasons for this has been the development and application of attention mechanisms that enable models to focus on more relevant portions of the context given a problem. The Stanford SQuAD dataset requires answering the corresponding questions of a given article, the answers can be any possible span in the context. To answer the corresponding question of an article, complex interactions between the question and the context need to be encoded. And extracting a segment from the original text as an answer according to the interactive fusion information, wherein the specific extraction method is to output a start index and an end index of the predicted answer in the article.

With the continuous development of neural networks in recent years and the wide application of LSTM to the task of machine reading and understanding, good performance effects are achieved by combining with an attention mechanism. However, some classical baseline models have a certain improvement space in accuracy, which does not consider the context dependency problem of long text, i.e. the associated information of long context cannot be well captured, and the ambiguity problem of words in different contexts is ignored.

Disclosure of Invention

The invention aims to make up the defects of the prior art, provides a reading understanding method based on an ELMo embedding and gating self-attention mechanism, introduces ELMo word embedding to obtain more accurate context word embedding representation, and adds a self-attention layer with a gating function to relieve the problem of needing further reasoning related to long context. In addition, the answer layer adopts a characteristic reuse method, and a bilinear function is used for calculating the final index position, so that the performance of the system is further improved. In experiments on the SQuAD data set, the model is proved to be greatly superior to most of baseline models, the performance of the model is close to the average level of human tests, and the effectiveness of the model is fully proved.

The invention is realized by the following technical scheme:

a reading comprehension method based on an ELMo embedding and gating self-attention mechanism, comprising the following steps:

s1, performing word segmentation and pretreatment on the article and the question respectively, and establishing a glove word vocabulary and a character list in words appearing in the article and the question after word segmentation;

s2, inputting each word to obtain an ELMo embedded representation containing context information by using a pre-trained ELMo coder;

s3, mapping each word into a corresponding word vector in a glove word vocabulary to obtain the word level representation of the word;

s4, finding out the corresponding representation in the character table for each letter of the word, taking the character vector as the input of the convolutional neural network, and obtaining the character embedded representation with fixed length of each word through the maximal pooling of the output of the convolutional layer;

s5, splicing the direct vectors of the representations obtained in the steps S2, S3 and S4, and respectively carrying out primary processing on the vectors by using a Highway network to obtain primary vector representations of articles and problems;

s6, in step S5, the question vector representation and the article vector representation use a bidirectional BilSTM sharing parameters to carry out context information fusion, so that the representation of each word is adjusted according to the context information;

s7, matching the text and the question by using the bidirectional attention layer for the representation obtained in the step S6 to obtain the article word representation after the article and the question are mutually sensed;

s8, further fusing and reasoning the text representation through the LSTM modeling layer of the two-way double layer to obtain modeling representations of articles and problems respectively, wherein the representations are obtained in the step S7;

s9, carrying out association matching of long context on the text representation obtained in the step S8 by gating a self-attention layer to obtain self-attention representation of the word;

and S10, the output layer combines the representation obtained in the steps S7, S8 and S9 to deduce a starting index and an ending index of the final answer by using a bilinear function, wherein the answer is a phrase between the indexes.

The step S1 is described in detail as follows:

firstly, word lists and character lists in words appearing in articles and problems after word segmentation are established, and the subsequent steps can obtain corresponding indexes of the words and the characters according to the two lists and then obtain corresponding embedded representations of the words and the characters according to the indexes. Secondly, taking each question-answer pair of the data set as a sample, and dividing the sample into a plurality of batches according to the specified batch size as model input.

The step S2 is described in detail as follows:

ELMo embedding is derived from a pre-trained two-layer bi-directional LSTM, which targets bi-directional language models, trained on a large corpus, and easily integrated into existing models. ELMo uses multi-layer LSTM, the upper layer LSTM state extracts context semantic information, and the lower layer LSTM state extracts grammar information. The final ELMo representation is a linear combination of the LSTM states of each layer. The obtained ELMo embedding, character embedding and glove word embedding are spliced to be used as model input, and fine adjustment is carried out on the model to improve the performance, which means that the ELMo embedding is updated in the training process. ELMo allows the vector representation of the vocabulary to consider both context and grammar, addressing the word-ambiguous case.

The detailed process by the step S5 is as follows:

concatenating the ELMo embedded representation, the word level representation and the character embedded representation as an input of a two-layer highway network to obtain a d-dimensional vector of each word, wherein the highway network formula is as follows:

y＝F(x，W_H)·G(x，W_G)+x·(1-G(x，W_G))

wherein F represents a forward neural network and G represents gating of an input;

thereby obtaining a context vector matrix

And a problem vector matrix

Wherein T is the number of article words, J is the number of question words, d is the number of one-dimensional convolution filters, then the matrix X and the matrix Q are respectively input into an LSTM with d-dimensional output to summarize articles and questions from two directions, and two matrices are obtained:

the step S7 bidirectional attention matching mechanism is as follows:

this layer matches the articles and question vectors in both directions using an attention mechanism and generates a context representation matrix G from the inputs H and U that fuses the question information for each word in the article. The following equation is given:

the formula a for calculating the attention score between the formula matrix H and the matrix U is as follows:

and respectively obtaining attention matrixes in two directions from the obtained attention fractional matrixes:

first, the attention matrix calculation method in the direction of the question is as follows:

wherein

Representing the relevance vector of the tth context word and the question word, and finally obtaining the question representation corresponding to the word as the weighted sum of all the word representations of the question

The attention matrix calculation method of the question to article direction is as follows:

obtaining weighted and vector representation of article words most relevant to the question

Then the vector is tiled for T times according to columns to obtain

Finally, a contextual representation fused with the problem information is obtained by:

the gate control attention in the step S9 is described in detail as follows:

the reason for this layer is that some of the problems involve longer contextual contexts, requiring more complex reasoning, and to alleviate these problems, the contextual representations of the words obtained by the modeling layer are directly matched with all other contextual word representations, i.e. each word representation M is first calculated from the text matrix representation obtained by the S8 modeling layer_tWith all other words representing M_jAttention to the score of (1), after normalizing the scoreThe final weighted sum representation for each word is calculated:

in addition, a gate function is used to reduce the concern for less relevant information, resulting in the final representation P^*：

g＝sigmoid(W_g[P；M])

P^*＝g⊙[P；M]

The answer layer in said step S10 is described in detail as follows:

using a feature multiplexing method, i.e. using a bi-directional attention layer representation G, a modeling layer representation M and a self-attention layer representation P simultaneously^*To obtain the probability of each word as the beginning and ending positions of the answer, the probability distribution of the answer starting position s is calculated by a bilinear function:

then, word weighting and representation are calculated according to the starting position probability, information fusion is carried out through BILSTM, representation containing starting position information is obtained, and finally probability distribution of the ending position is obtained by the same formula as the previous layer representation based on the obtained representation.

The invention has the advantages that: the invention introduces the self-attention layer with the gating function, can further carry out matching and information fusion on the long text, and carries out filtering on unimportant information to a certain degree, thereby relieving the problem of designing longer context and improving the accuracy of the model to a certain degree;

the invention combines ELMo embedding, can obtain more accurate word embedding representation through an encoder pre-trained on large data volume, can make the word embedding representation contain more context information, can effectively solve the problem that polysemous words and the like need context, and improves the model performance.

Drawings

FIG. 1 is a basic flow diagram of the present invention.

FIG. 2 is a diagram of a neural network model according to the present invention.

Detailed Description

As shown in fig. 1, a reading comprehension method based on the ELMo embedding and gating self-attention mechanism includes the following steps:

s8: the representation obtained in the step S7 is further fused and reasoned to the text representation through the two-way double-layer LSTM modeling layer, and modeling representations of articles and problems are respectively obtained;

s9: performing long-context associative matching on the text representation obtained in the step S8 through a gated self-attention layer to obtain a self-attention representation of the word;

s10: the output layer combines the representation obtained in steps S7, S8, S9 to use bilinear function to deduce the start index and the end index of the final answer, i.e. the answer is the phrase between the indexes.

The step S1 is described in detail as follows:

The step S2 is described in detail as follows:

The detailed process by the step S5 is as follows:

y＝F(x，W_H)·G(x，W_G)+x·(1-G(x，W_G))

thereby obtaining a context vector matrix

And a problem vector matrix

the step S7 bidirectional attention matching mechanism is as follows:

wherein

Then the vector is tiled for T times according to columns to obtain

the gate control attention in the step S9 is described in detail as follows:

the reason for this layer is that some of the problems involve longer contextual contexts, requiring more complex reasoning, and to alleviate these problems, the contextual representations of the words obtained by the modeling layer are directly matched with all other contextual word representations, i.e. each word representation M is first calculated from the text matrix representation obtained by the S8 modeling layer_tWith all other words representing M_jAttention to (1)And (3) calculating a final weighted sum representation of each word after normalizing the scores:

g＝sigmoid(W_g[P；M])

P^*＝g⊙[P；M]

The answer layer in said step S10 is described in detail as follows:

The specific implementation process of the invention is as follows:

1. an appropriate data set is selected.

This section uses a Stanford question and answer dataset (SQuAD), which is created manually by crowdsourcing. The SQuAD dataset is a regional predictive reading comprehension dataset, i.e. given articles and questions, the machine needs to find the corresponding region (span) of answers in the articles and predict the start and end positions. The length of the region is generally not limited. It is constructed from 536 articles randomly selected from wikipedia, english, including 107785 answers to questions. Typically, articles vary from 50 to 250 words, and questions contain about 10 words. This data set is one of the largest MRC data sets to date.

2. And selecting a model performance evaluation index.

This section uses two indices to evaluate the model: f1 and Exact Match (EM) score. These two scores may be obtained by comparing the model predicted answer with the candidate answers using an official script, i.e., comparing each of the three candidate answers with the predicted answer and selecting the highest score. Where EM is the full correct predictive score, the F1 score is defined as the average overlap score between the predicted answer and the candidate answer.

3. And constructing a model according to the prior art scheme and experience.

As shown in fig. 2, the main core content of the present invention includes the following hierarchical structure: (1) and the ELMo embedded layer obtains an embedded representation that the word contains the context information by using a pre-trained ELMo language model. (2) The self-attention layer with the gating function matches the word representation of the article with all other words of the article, and filters unimportant information by the gating function. (3) And a bilinear function answer layer based on a feature multiplexing method is used for predicting the start and end positions of the answer.

Wherein (1), (2), and (3) are further described:

(1) the ELMo embedding in (1) is obtained by pre-training two-layer bidirectional LSTM, and the model takes a bidirectional language model as a target, is trained on a large corpus, and can be easily integrated into the existing model. ELMo uses multi-layer LSTM, the upper layer LSTM state extracts context semantic information, and the lower layer LSTM state extracts grammar information. The final ELMo representation is a linear combination of the LSTM states of each layer. And splicing the obtained ELMo embedding, character embedding and glove word embedding as model input.

(2) The method comprises the following specific steps: first, a per-word representation M is calculated from the resulting modeled text matrix representation_tWith all other words representing M_jIs directly multiplied by the parameter matrix and then passed through the activation function, and finally normalized using the Softmax function, and finds the final weighted sum representation of each word:

in addition, a gate function is used to reduce the concern for less relevant information, resulting in the final representation P^*Splicing P and M, and then multiplying the spliced P and M by a parameter matrix to obtain a gate fraction:

g＝sigmoid(W_g[P；M])

P^*＝g⊙[P；M]

(3) specifically, the detailed steps for the answer layer are as follows:

the layer uses a feature multiplexing method, i.e. a bidirectional attention layer representation G, a modeling layer representation M and a self-attention layer representation P are used simultaneously^*To obtain the probability of each word as the start and end positions of the answer, the probability distribution of the answer start position s is calculated by the following bilinear function:

because the starting position has great relevance with the ending position, the starting position probability is adopted to weight and calculate the words and the new representation fused with the starting position information is obtained through the BILSTM fusion information to infer the ending position e:

the loss function finally adopted by the part is the negative log maximum likelihood of the starting position and the ending position:

4. and selecting experimental environment and setting parameters.

The experiment is operated in a GeForce GTX titan 12G GPU video card hardware environment and in software environments of a system python 3.5tensorflow-GPU 1.1CUDA 8.0cudnn 5 and the like of ubuntu 18.04. The experimental process is specifically set as follows: the character embedding layer employs 100 filters of width 5. Word embedding uses a pre-trained 300D word vector (840B word version). Dropout with a reject rate of 0.2 applies to all CNN, LSTM layers and all forward propagation layers. The size of the hidden state d is 100, and the number of parameters is about 400 ten thousand. Model parameter optimization was performed using an adamax optimizer with a batch size of 8, which took about 2 days to train 12 rounds of models on a graphics card with 12G of storage space. The ELMo vectors resulting from the language model trained on Benchmark are set trainable with the other parameters being default values.

Finally, it is emphasized that the above implementation examples are merely illustrative of specific procedures of the present invention and are not to be considered as limiting. Although the flow chart is illustrated in detail by way of example, it should be understood by those skilled in the art that modifications and substitutions can be made to the present invention without departing from the technical core of the present invention, and other embodiments obtained based on the present invention without inventive efforts shall fall within the scope of the present invention.

Claims

1. A reading understanding method based on an ELMo embedding and gating self-attention mechanism is characterized by comprising the following steps: the method specifically comprises the following steps:

s1: respectively performing word segmentation and pretreatment on the article and the problem, and establishing a glove word list and a character list in words appearing in the article and the problem after word segmentation;

s2: inputting each word to obtain its ELMo embedded representation containing context information using a pre-trained ELMo encoder;

s3: mapping each word into a corresponding word vector in a glove word vocabulary to obtain word level representation of the word;

s4: finding out corresponding representation in a character table for each letter of each word, taking the character vector as the input of a convolutional neural network, and performing maximum pooling on the output of a convolutional layer to obtain the character embedded representation with fixed length of each word;

s5: splicing direct vectors of the representations obtained in the steps S2, S3 and S4, and respectively carrying out primary processing on the vectors by using a high way network to obtain primary vector representations of articles and problems;

s6: in step S5, the question vector representation and the article vector representation use a bi-directional BiLSTM sharing parameters for context information fusion, thereby adjusting the representation of each word according to the context information;

s7: matching the text and the question by using the bidirectional attention layer for the representation obtained in the step S6 to obtain an article word representation after the article and the question are mutually sensed;

2. The reading comprehension method of claim 1 based on ELMo embedding and gated attention mechanism, wherein: the ELMo embedding described in step S2 is specifically as follows:

ELMo embedding is obtained by pre-training two-layer bidirectional LSTM, a bidirectional language model is taken as a target, the two-layer bidirectional LSTM is trained on a large corpus and integrated into the model, an ELMo encoder uses a plurality of layers of LSTM, context semantic information is extracted from a high-layer LSTM state, grammatical information is extracted from a lower-layer LSTM state, and final ELMo embedding expression is linear combination of LSTM states of each layer.

3. The reading comprehension method of claim 1 based on ELMo embedding and gated attention mechanism, wherein: the specific process of step S5 is as follows:

y＝F(x，W_H)·G(x，W_G)+x·(1-G(x，W_G))

thereby obtaining a context vector matrix

And a problem vector matrix

Where T is the number of article words, J is the number of question words, and d is one dimensionThe number of convolution filters, then the matrix X and the matrix Q are input into an LSTM with d-dimensional output to summarize articles and problems from two directions, resulting in two matrices:

4. the reading comprehension method of claim 3 based on ELMo embedding and gated attention mechanism, wherein: the matching of the text and the question using the bidirectional attention layer described in step S7 is specifically as follows:

matching the article and question vectors in two directions by using an attention mechanism, and generating a context expression matrix G fused with question information from an input matrix H and a matrix U for each word in the article, wherein the attention score matrix is obtained by the following formula:

wherein

Then the vector is tiled for T times according to columns to obtain

5. the ELMo embedding and gated attention based reading understanding method of claim 4, wherein: the gated attention described in step S9 is specifically as follows:

first, each word representation M is calculated using the text matrix representation resulting from the modeling representation of step S8_tWith all other words representing M_jThe score is normalized to calculate the final weighted sum representation of each word:

using a gate function, the final representation P is obtained^*：

g＝sigmoid(W_g[P；M])

P^*＝g⊙[P；M]。

6. The reading comprehension method of claim 5 wherein said ELMo embedding and gating self-attention mechanism is based on: the specific process of step S10 is as follows: