CN112579739A - Reading understanding method based on ELMo embedding and gating self-attention mechanism - Google Patents
Reading understanding method based on ELMo embedding and gating self-attention mechanism Download PDFInfo
- Publication number
- CN112579739A CN112579739A CN202011542671.2A CN202011542671A CN112579739A CN 112579739 A CN112579739 A CN 112579739A CN 202011542671 A CN202011542671 A CN 202011542671A CN 112579739 A CN112579739 A CN 112579739A
- Authority
- CN
- China
- Prior art keywords
- representation
- word
- attention
- layer
- question
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 33
- 230000007246 mechanism Effects 0.000 title claims abstract description 19
- 239000013598 vector Substances 0.000 claims description 46
- 239000011159 matrix material Substances 0.000 claims description 36
- 230000006870 function Effects 0.000 claims description 20
- 230000002457 bidirectional effect Effects 0.000 claims description 14
- 230000004927 fusion Effects 0.000 claims description 9
- 230000008569 process Effects 0.000 claims description 8
- 230000011218 segmentation Effects 0.000 claims description 8
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000013528 artificial neural network Methods 0.000 claims description 4
- 238000013527 convolutional neural network Methods 0.000 claims description 4
- 238000012549 training Methods 0.000 claims description 4
- 238000013507 mapping Methods 0.000 claims description 3
- 238000011176 pooling Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 238000002474 experimental method Methods 0.000 abstract description 3
- 238000012360 testing method Methods 0.000 abstract description 2
- 239000000284 extract Substances 0.000 description 6
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 238000007476 Maximum Likelihood Methods 0.000 description 1
- RTAQQCXQSZGOHL-UHFFFAOYSA-N Titanium Chemical compound [Ti] RTAQQCXQSZGOHL-UHFFFAOYSA-N 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3346—Query execution using probabilistic model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/253—Grammatical analysis; Style critique
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Mathematical Physics (AREA)
- Biomedical Technology (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Probability & Statistics with Applications (AREA)
- Databases & Information Systems (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a reading understanding method based on an ELMo embedding and gating self-attention mechanism, which is based on a model related to an ELMo embedding and gating self-attention function. In addition, the method multiplexes the feature representations of all layers at the answer layer, and carries out the position prediction of the final answer by using a bilinear function, thereby further improving the overall performance of the system. In experiments on SQuAD data sets, the model is proved to be greatly superior to a plurality of baseline models, the performance is improved by about 5 percent compared with the original baseline, the performance is close to the average level of human tests, and the effectiveness of the method is fully proved.
Description
Technical Field
The invention relates to the technical field of computers, in particular to a reading understanding method based on an ELMo embedding and gating self-attention mechanism.
Background
Machine reading understanding is always an important component of artificial intelligence and is a research hotspot in the field of natural language processing. A great deal of human knowledge is transmitted in the form of unstructured natural language texts, so that a machine can read and understand the texts, and the method has an important significance and has a direct application value for search engines, intelligent customer service and the like. Machine-reading understanding has received widespread attention in the field of natural language processing in recent years, and one of the reasons for this has been the development and application of attention mechanisms that enable models to focus on more relevant portions of the context given a problem. The Stanford SQuAD dataset requires answering the corresponding questions of a given article, the answers can be any possible span in the context. To answer the corresponding question of an article, complex interactions between the question and the context need to be encoded. And extracting a segment from the original text as an answer according to the interactive fusion information, wherein the specific extraction method is to output a start index and an end index of the predicted answer in the article.
With the continuous development of neural networks in recent years and the wide application of LSTM to the task of machine reading and understanding, good performance effects are achieved by combining with an attention mechanism. However, some classical baseline models have a certain improvement space in accuracy, which does not consider the context dependency problem of long text, i.e. the associated information of long context cannot be well captured, and the ambiguity problem of words in different contexts is ignored.
Disclosure of Invention
The invention aims to make up the defects of the prior art, provides a reading understanding method based on an ELMo embedding and gating self-attention mechanism, introduces ELMo word embedding to obtain more accurate context word embedding representation, and adds a self-attention layer with a gating function to relieve the problem of needing further reasoning related to long context. In addition, the answer layer adopts a characteristic reuse method, and a bilinear function is used for calculating the final index position, so that the performance of the system is further improved. In experiments on the SQuAD data set, the model is proved to be greatly superior to most of baseline models, the performance of the model is close to the average level of human tests, and the effectiveness of the model is fully proved.
The invention is realized by the following technical scheme:
a reading comprehension method based on an ELMo embedding and gating self-attention mechanism, comprising the following steps:
s1, performing word segmentation and pretreatment on the article and the question respectively, and establishing a glove word vocabulary and a character list in words appearing in the article and the question after word segmentation;
s2, inputting each word to obtain an ELMo embedded representation containing context information by using a pre-trained ELMo coder;
s3, mapping each word into a corresponding word vector in a glove word vocabulary to obtain the word level representation of the word;
s4, finding out the corresponding representation in the character table for each letter of the word, taking the character vector as the input of the convolutional neural network, and obtaining the character embedded representation with fixed length of each word through the maximal pooling of the output of the convolutional layer;
s5, splicing the direct vectors of the representations obtained in the steps S2, S3 and S4, and respectively carrying out primary processing on the vectors by using a Highway network to obtain primary vector representations of articles and problems;
s6, in step S5, the question vector representation and the article vector representation use a bidirectional BilSTM sharing parameters to carry out context information fusion, so that the representation of each word is adjusted according to the context information;
s7, matching the text and the question by using the bidirectional attention layer for the representation obtained in the step S6 to obtain the article word representation after the article and the question are mutually sensed;
s8, further fusing and reasoning the text representation through the LSTM modeling layer of the two-way double layer to obtain modeling representations of articles and problems respectively, wherein the representations are obtained in the step S7;
s9, carrying out association matching of long context on the text representation obtained in the step S8 by gating a self-attention layer to obtain self-attention representation of the word;
and S10, the output layer combines the representation obtained in the steps S7, S8 and S9 to deduce a starting index and an ending index of the final answer by using a bilinear function, wherein the answer is a phrase between the indexes.
The step S1 is described in detail as follows:
firstly, word lists and character lists in words appearing in articles and problems after word segmentation are established, and the subsequent steps can obtain corresponding indexes of the words and the characters according to the two lists and then obtain corresponding embedded representations of the words and the characters according to the indexes. Secondly, taking each question-answer pair of the data set as a sample, and dividing the sample into a plurality of batches according to the specified batch size as model input.
The step S2 is described in detail as follows:
ELMo embedding is derived from a pre-trained two-layer bi-directional LSTM, which targets bi-directional language models, trained on a large corpus, and easily integrated into existing models. ELMo uses multi-layer LSTM, the upper layer LSTM state extracts context semantic information, and the lower layer LSTM state extracts grammar information. The final ELMo representation is a linear combination of the LSTM states of each layer. The obtained ELMo embedding, character embedding and glove word embedding are spliced to be used as model input, and fine adjustment is carried out on the model to improve the performance, which means that the ELMo embedding is updated in the training process. ELMo allows the vector representation of the vocabulary to consider both context and grammar, addressing the word-ambiguous case.
The detailed process by the step S5 is as follows:
concatenating the ELMo embedded representation, the word level representation and the character embedded representation as an input of a two-layer highway network to obtain a d-dimensional vector of each word, wherein the highway network formula is as follows:
y=F(x,WH)·G(x,WG)+x·(1-G(x,WG))
wherein F represents a forward neural network and G represents gating of an input;
thereby obtaining a context vector matrixAnd a problem vector matrixWherein T is the number of article words, J is the number of question words, d is the number of one-dimensional convolution filters, then the matrix X and the matrix Q are respectively input into an LSTM with d-dimensional output to summarize articles and questions from two directions, and two matrices are obtained:
the step S7 bidirectional attention matching mechanism is as follows:
this layer matches the articles and question vectors in both directions using an attention mechanism and generates a context representation matrix G from the inputs H and U that fuses the question information for each word in the article. The following equation is given:
the formula a for calculating the attention score between the formula matrix H and the matrix U is as follows:
and respectively obtaining attention matrixes in two directions from the obtained attention fractional matrixes:
first, the attention matrix calculation method in the direction of the question is as follows:whereinRepresenting the relevance vector of the tth context word and the question word, and finally obtaining the question representation corresponding to the word as the weighted sum of all the word representations of the question
The attention matrix calculation method of the question to article direction is as follows:obtaining weighted and vector representation of article words most relevant to the questionThen the vector is tiled for T times according to columns to obtain
Finally, a contextual representation fused with the problem information is obtained by:
the gate control attention in the step S9 is described in detail as follows:
the reason for this layer is that some of the problems involve longer contextual contexts, requiring more complex reasoning, and to alleviate these problems, the contextual representations of the words obtained by the modeling layer are directly matched with all other contextual word representations, i.e. each word representation M is first calculated from the text matrix representation obtained by the S8 modeling layertWith all other words representing MjAttention to the score of (1), after normalizing the scoreThe final weighted sum representation for each word is calculated:
in addition, a gate function is used to reduce the concern for less relevant information, resulting in the final representation P*:
g=sigmoid(Wg[P;M])
P*=g⊙[P;M]
The answer layer in said step S10 is described in detail as follows:
using a feature multiplexing method, i.e. using a bi-directional attention layer representation G, a modeling layer representation M and a self-attention layer representation P simultaneously*To obtain the probability of each word as the beginning and ending positions of the answer, the probability distribution of the answer starting position s is calculated by a bilinear function:
then, word weighting and representation are calculated according to the starting position probability, information fusion is carried out through BILSTM, representation containing starting position information is obtained, and finally probability distribution of the ending position is obtained by the same formula as the previous layer representation based on the obtained representation.
The invention has the advantages that: the invention introduces the self-attention layer with the gating function, can further carry out matching and information fusion on the long text, and carries out filtering on unimportant information to a certain degree, thereby relieving the problem of designing longer context and improving the accuracy of the model to a certain degree;
the invention combines ELMo embedding, can obtain more accurate word embedding representation through an encoder pre-trained on large data volume, can make the word embedding representation contain more context information, can effectively solve the problem that polysemous words and the like need context, and improves the model performance.
Drawings
FIG. 1 is a basic flow diagram of the present invention.
FIG. 2 is a diagram of a neural network model according to the present invention.
Detailed Description
As shown in fig. 1, a reading comprehension method based on the ELMo embedding and gating self-attention mechanism includes the following steps:
s1, performing word segmentation and pretreatment on the article and the question respectively, and establishing a glove word vocabulary and a character list in words appearing in the article and the question after word segmentation;
s2, inputting each word to obtain an ELMo embedded representation containing context information by using a pre-trained ELMo coder;
s3, mapping each word into a corresponding word vector in a glove word vocabulary to obtain the word level representation of the word;
s4, finding out the corresponding representation in the character table for each letter of the word, taking the character vector as the input of the convolutional neural network, and obtaining the character embedded representation with fixed length of each word through the maximal pooling of the output of the convolutional layer;
s5, splicing the direct vectors of the representations obtained in the steps S2, S3 and S4, and respectively carrying out primary processing on the vectors by using a Highway network to obtain primary vector representations of articles and problems;
s6, in step S5, the question vector representation and the article vector representation use a bidirectional BilSTM sharing parameters to carry out context information fusion, so that the representation of each word is adjusted according to the context information;
s7, matching the text and the question by using the bidirectional attention layer for the representation obtained in the step S6 to obtain the article word representation after the article and the question are mutually sensed;
s8: the representation obtained in the step S7 is further fused and reasoned to the text representation through the two-way double-layer LSTM modeling layer, and modeling representations of articles and problems are respectively obtained;
s9: performing long-context associative matching on the text representation obtained in the step S8 through a gated self-attention layer to obtain a self-attention representation of the word;
s10: the output layer combines the representation obtained in steps S7, S8, S9 to use bilinear function to deduce the start index and the end index of the final answer, i.e. the answer is the phrase between the indexes.
The step S1 is described in detail as follows:
firstly, word lists and character lists in words appearing in articles and problems after word segmentation are established, and the subsequent steps can obtain corresponding indexes of the words and the characters according to the two lists and then obtain corresponding embedded representations of the words and the characters according to the indexes. Secondly, taking each question-answer pair of the data set as a sample, and dividing the sample into a plurality of batches according to the specified batch size as model input.
The step S2 is described in detail as follows:
ELMo embedding is derived from a pre-trained two-layer bi-directional LSTM, which targets bi-directional language models, trained on a large corpus, and easily integrated into existing models. ELMo uses multi-layer LSTM, the upper layer LSTM state extracts context semantic information, and the lower layer LSTM state extracts grammar information. The final ELMo representation is a linear combination of the LSTM states of each layer. The obtained ELMo embedding, character embedding and glove word embedding are spliced to be used as model input, and fine adjustment is carried out on the model to improve the performance, which means that the ELMo embedding is updated in the training process. ELMo allows the vector representation of the vocabulary to consider both context and grammar, addressing the word-ambiguous case.
The detailed process by the step S5 is as follows:
concatenating the ELMo embedded representation, the word level representation and the character embedded representation as an input of a two-layer highway network to obtain a d-dimensional vector of each word, wherein the highway network formula is as follows:
y=F(x,WH)·G(x,WG)+x·(1-G(x,WG))
wherein F represents a forward neural network and G represents gating of an input;
thereby obtaining a context vector matrixAnd a problem vector matrixWherein T is the number of article words, J is the number of question words, d is the number of one-dimensional convolution filters, then the matrix X and the matrix Q are respectively input into an LSTM with d-dimensional output to summarize articles and questions from two directions, and two matrices are obtained:
the step S7 bidirectional attention matching mechanism is as follows:
this layer matches the articles and question vectors in both directions using an attention mechanism and generates a context representation matrix G from the inputs H and U that fuses the question information for each word in the article. The following equation is given:
the formula a for calculating the attention score between the formula matrix H and the matrix U is as follows:
and respectively obtaining attention matrixes in two directions from the obtained attention fractional matrixes:
first, the attention matrix calculation method in the direction of the question is as follows:whereinRepresenting the relevance vector of the tth context word and the question word, and finally obtaining the question representation corresponding to the word as the weighted sum of all the word representations of the question
The attention matrix calculation method of the question to article direction is as follows:obtaining weighted and vector representation of article words most relevant to the questionThen the vector is tiled for T times according to columns to obtain
Finally, a contextual representation fused with the problem information is obtained by:
the gate control attention in the step S9 is described in detail as follows:
the reason for this layer is that some of the problems involve longer contextual contexts, requiring more complex reasoning, and to alleviate these problems, the contextual representations of the words obtained by the modeling layer are directly matched with all other contextual word representations, i.e. each word representation M is first calculated from the text matrix representation obtained by the S8 modeling layertWith all other words representing MjAttention to (1)And (3) calculating a final weighted sum representation of each word after normalizing the scores:
in addition, a gate function is used to reduce the concern for less relevant information, resulting in the final representation P*:
g=sigmoid(Wg[P;M])
P*=g⊙[P;M]
The answer layer in said step S10 is described in detail as follows:
using a feature multiplexing method, i.e. using a bi-directional attention layer representation G, a modeling layer representation M and a self-attention layer representation P simultaneously*To obtain the probability of each word as the beginning and ending positions of the answer, the probability distribution of the answer starting position s is calculated by a bilinear function:
then, word weighting and representation are calculated according to the starting position probability, information fusion is carried out through BILSTM, representation containing starting position information is obtained, and finally probability distribution of the ending position is obtained by the same formula as the previous layer representation based on the obtained representation.
The specific implementation process of the invention is as follows:
1. an appropriate data set is selected.
This section uses a Stanford question and answer dataset (SQuAD), which is created manually by crowdsourcing. The SQuAD dataset is a regional predictive reading comprehension dataset, i.e. given articles and questions, the machine needs to find the corresponding region (span) of answers in the articles and predict the start and end positions. The length of the region is generally not limited. It is constructed from 536 articles randomly selected from wikipedia, english, including 107785 answers to questions. Typically, articles vary from 50 to 250 words, and questions contain about 10 words. This data set is one of the largest MRC data sets to date.
2. And selecting a model performance evaluation index.
This section uses two indices to evaluate the model: f1 and Exact Match (EM) score. These two scores may be obtained by comparing the model predicted answer with the candidate answers using an official script, i.e., comparing each of the three candidate answers with the predicted answer and selecting the highest score. Where EM is the full correct predictive score, the F1 score is defined as the average overlap score between the predicted answer and the candidate answer.
3. And constructing a model according to the prior art scheme and experience.
As shown in fig. 2, the main core content of the present invention includes the following hierarchical structure: (1) and the ELMo embedded layer obtains an embedded representation that the word contains the context information by using a pre-trained ELMo language model. (2) The self-attention layer with the gating function matches the word representation of the article with all other words of the article, and filters unimportant information by the gating function. (3) And a bilinear function answer layer based on a feature multiplexing method is used for predicting the start and end positions of the answer.
Wherein (1), (2), and (3) are further described:
(1) the ELMo embedding in (1) is obtained by pre-training two-layer bidirectional LSTM, and the model takes a bidirectional language model as a target, is trained on a large corpus, and can be easily integrated into the existing model. ELMo uses multi-layer LSTM, the upper layer LSTM state extracts context semantic information, and the lower layer LSTM state extracts grammar information. The final ELMo representation is a linear combination of the LSTM states of each layer. And splicing the obtained ELMo embedding, character embedding and glove word embedding as model input.
(2) The method comprises the following specific steps: first, a per-word representation M is calculated from the resulting modeled text matrix representationtWith all other words representing MjIs directly multiplied by the parameter matrix and then passed through the activation function, and finally normalized using the Softmax function, and finds the final weighted sum representation of each word:
in addition, a gate function is used to reduce the concern for less relevant information, resulting in the final representation P*Splicing P and M, and then multiplying the spliced P and M by a parameter matrix to obtain a gate fraction:
g=sigmoid(Wg[P;M])
P*=g⊙[P;M]
(3) specifically, the detailed steps for the answer layer are as follows:
the layer uses a feature multiplexing method, i.e. a bidirectional attention layer representation G, a modeling layer representation M and a self-attention layer representation P are used simultaneously*To obtain the probability of each word as the start and end positions of the answer, the probability distribution of the answer start position s is calculated by the following bilinear function:
because the starting position has great relevance with the ending position, the starting position probability is adopted to weight and calculate the words and the new representation fused with the starting position information is obtained through the BILSTM fusion information to infer the ending position e:
the loss function finally adopted by the part is the negative log maximum likelihood of the starting position and the ending position:
4. and selecting experimental environment and setting parameters.
The experiment is operated in a GeForce GTX titan 12G GPU video card hardware environment and in software environments of a system python 3.5tensorflow-GPU 1.1CUDA 8.0cudnn 5 and the like of ubuntu 18.04. The experimental process is specifically set as follows: the character embedding layer employs 100 filters of width 5. Word embedding uses a pre-trained 300D word vector (840B word version). Dropout with a reject rate of 0.2 applies to all CNN, LSTM layers and all forward propagation layers. The size of the hidden state d is 100, and the number of parameters is about 400 ten thousand. Model parameter optimization was performed using an adamax optimizer with a batch size of 8, which took about 2 days to train 12 rounds of models on a graphics card with 12G of storage space. The ELMo vectors resulting from the language model trained on Benchmark are set trainable with the other parameters being default values.
Finally, it is emphasized that the above implementation examples are merely illustrative of specific procedures of the present invention and are not to be considered as limiting. Although the flow chart is illustrated in detail by way of example, it should be understood by those skilled in the art that modifications and substitutions can be made to the present invention without departing from the technical core of the present invention, and other embodiments obtained based on the present invention without inventive efforts shall fall within the scope of the present invention.
Claims (6)
1. A reading understanding method based on an ELMo embedding and gating self-attention mechanism is characterized by comprising the following steps: the method specifically comprises the following steps:
s1: respectively performing word segmentation and pretreatment on the article and the problem, and establishing a glove word list and a character list in words appearing in the article and the problem after word segmentation;
s2: inputting each word to obtain its ELMo embedded representation containing context information using a pre-trained ELMo encoder;
s3: mapping each word into a corresponding word vector in a glove word vocabulary to obtain word level representation of the word;
s4: finding out corresponding representation in a character table for each letter of each word, taking the character vector as the input of a convolutional neural network, and performing maximum pooling on the output of a convolutional layer to obtain the character embedded representation with fixed length of each word;
s5: splicing direct vectors of the representations obtained in the steps S2, S3 and S4, and respectively carrying out primary processing on the vectors by using a high way network to obtain primary vector representations of articles and problems;
s6: in step S5, the question vector representation and the article vector representation use a bi-directional BiLSTM sharing parameters for context information fusion, thereby adjusting the representation of each word according to the context information;
s7: matching the text and the question by using the bidirectional attention layer for the representation obtained in the step S6 to obtain an article word representation after the article and the question are mutually sensed;
s8: the representation obtained in the step S7 is further fused and reasoned to the text representation through the two-way double-layer LSTM modeling layer, and modeling representations of articles and problems are respectively obtained;
s9: performing long-context associative matching on the text representation obtained in the step S8 through a gated self-attention layer to obtain a self-attention representation of the word;
s10: the output layer combines the representation obtained in steps S7, S8, S9 to use bilinear function to deduce the start index and the end index of the final answer, i.e. the answer is the phrase between the indexes.
2. The reading comprehension method of claim 1 based on ELMo embedding and gated attention mechanism, wherein: the ELMo embedding described in step S2 is specifically as follows:
ELMo embedding is obtained by pre-training two-layer bidirectional LSTM, a bidirectional language model is taken as a target, the two-layer bidirectional LSTM is trained on a large corpus and integrated into the model, an ELMo encoder uses a plurality of layers of LSTM, context semantic information is extracted from a high-layer LSTM state, grammatical information is extracted from a lower-layer LSTM state, and final ELMo embedding expression is linear combination of LSTM states of each layer.
3. The reading comprehension method of claim 1 based on ELMo embedding and gated attention mechanism, wherein: the specific process of step S5 is as follows:
concatenating the ELMo embedded representation, the word level representation and the character embedded representation as an input of a two-layer highway network to obtain a d-dimensional vector of each word, wherein the highway network formula is as follows:
y=F(x,WH)·G(x,WG)+x·(1-G(x,WG))
wherein F represents a forward neural network and G represents gating of an input;
thereby obtaining a context vector matrixAnd a problem vector matrixWhere T is the number of article words, J is the number of question words, and d is one dimensionThe number of convolution filters, then the matrix X and the matrix Q are input into an LSTM with d-dimensional output to summarize articles and problems from two directions, resulting in two matrices:
4. the reading comprehension method of claim 3 based on ELMo embedding and gated attention mechanism, wherein: the matching of the text and the question using the bidirectional attention layer described in step S7 is specifically as follows:
matching the article and question vectors in two directions by using an attention mechanism, and generating a context expression matrix G fused with question information from an input matrix H and a matrix U for each word in the article, wherein the attention score matrix is obtained by the following formula:
the formula a for calculating the attention score between the formula matrix H and the matrix U is as follows:and respectively obtaining attention matrixes in two directions from the obtained attention fractional matrixes:
first, the attention matrix calculation method in the direction of the question is as follows:whereinRepresenting the relevance vector of the tth context word and the question word, and finally obtaining the question representation corresponding to the word as the weighted sum of all the word representations of the question
The attention matrix calculation method of the question to article direction is as follows:obtaining weighted and vector representation of article words most relevant to the questionThen the vector is tiled for T times according to columns to obtain
Finally, a contextual representation fused with the problem information is obtained by:
5. the ELMo embedding and gated attention based reading understanding method of claim 4, wherein: the gated attention described in step S9 is specifically as follows:
first, each word representation M is calculated using the text matrix representation resulting from the modeling representation of step S8tWith all other words representing MjThe score is normalized to calculate the final weighted sum representation of each word:
using a gate function, the final representation P is obtained*:
g=sigmoid(Wg[P;M])
P*=g⊙[P;M]。
6. The reading comprehension method of claim 5 wherein said ELMo embedding and gating self-attention mechanism is based on: the specific process of step S10 is as follows:
using a feature multiplexing method, i.e. using a bi-directional attention layer representation G, a modeling layer representation M and a self-attention layer representation P simultaneously*To obtain the probability of each word as the beginning and ending positions of the answer, the probability distribution of the answer starting position s is calculated by a bilinear function:
then, word weighting and representation are calculated according to the starting position probability, information fusion is carried out through BILSTM, representation containing starting position information is obtained, and finally probability distribution of the ending position is obtained by the same formula as the previous layer representation based on the obtained representation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011542671.2A CN112579739A (en) | 2020-12-23 | 2020-12-23 | Reading understanding method based on ELMo embedding and gating self-attention mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011542671.2A CN112579739A (en) | 2020-12-23 | 2020-12-23 | Reading understanding method based on ELMo embedding and gating self-attention mechanism |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112579739A true CN112579739A (en) | 2021-03-30 |
Family
ID=75139229
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011542671.2A Pending CN112579739A (en) | 2020-12-23 | 2020-12-23 | Reading understanding method based on ELMo embedding and gating self-attention mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112579739A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113240098A (en) * | 2021-06-16 | 2021-08-10 | 湖北工业大学 | Fault prediction method and device based on hybrid gated neural network and storage medium |
CN114218365A (en) * | 2021-11-26 | 2022-03-22 | 华南理工大学 | Machine reading understanding method, system, computer and storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110929030A (en) * | 2019-11-07 | 2020-03-27 | 电子科技大学 | Text abstract and emotion classification combined training method |
US20200175015A1 (en) * | 2018-11-29 | 2020-06-04 | Koninklijke Philips N.V. | Crf-based span prediction for fine machine learning comprehension |
-
2020
- 2020-12-23 CN CN202011542671.2A patent/CN112579739A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200175015A1 (en) * | 2018-11-29 | 2020-06-04 | Koninklijke Philips N.V. | Crf-based span prediction for fine machine learning comprehension |
CN110929030A (en) * | 2019-11-07 | 2020-03-27 | 电子科技大学 | Text abstract and emotion classification combined training method |
Non-Patent Citations (1)
Title |
---|
WEIWEI ZHANG等: ""ELMo+Gated Self-attention Network Based on BiDAF for Machine Reading Comprehension"", 《 2020 IEEE 11TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING AND SERVICE SCIENCE (ICSESS)》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113240098A (en) * | 2021-06-16 | 2021-08-10 | 湖北工业大学 | Fault prediction method and device based on hybrid gated neural network and storage medium |
CN114218365A (en) * | 2021-11-26 | 2022-03-22 | 华南理工大学 | Machine reading understanding method, system, computer and storage medium |
CN114218365B (en) * | 2021-11-26 | 2024-04-05 | 华南理工大学 | Machine reading and understanding method, system, computer and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112487182B (en) | Training method of text processing model, text processing method and device | |
CN108733792B (en) | Entity relation extraction method | |
CN113239181B (en) | Scientific and technological literature citation recommendation method based on deep learning | |
CN107798140B (en) | Dialog system construction method, semantic controlled response method and device | |
CN111881291A (en) | Text emotion classification method and system | |
CN111930942B (en) | Text classification method, language model training method, device and equipment | |
CN110647612A (en) | Visual conversation generation method based on double-visual attention network | |
CN108628935A (en) | A kind of answering method based on end-to-end memory network | |
CN113297364B (en) | Natural language understanding method and device in dialogue-oriented system | |
CN113435203A (en) | Multi-modal named entity recognition method and device and electronic equipment | |
CN111191002A (en) | Neural code searching method and device based on hierarchical embedding | |
CN108536735B (en) | Multi-mode vocabulary representation method and system based on multi-channel self-encoder | |
CN112232053A (en) | Text similarity calculation system, method and storage medium based on multi-keyword pair matching | |
Chen et al. | Deep neural networks for multi-class sentiment classification | |
CN110597968A (en) | Reply selection method and device | |
CN111666752A (en) | Circuit teaching material entity relation extraction method based on keyword attention mechanism | |
CN111145914B (en) | Method and device for determining text entity of lung cancer clinical disease seed bank | |
CN112182373A (en) | Context expression learning-based personalized search method | |
CN112579739A (en) | Reading understanding method based on ELMo embedding and gating self-attention mechanism | |
CN110889505A (en) | Cross-media comprehensive reasoning method and system for matching image-text sequences | |
CN115171870A (en) | Diagnosis guiding and prompting method and system based on m-BERT pre-training model | |
CN111428518A (en) | Low-frequency word translation method and device | |
Dandwate et al. | Comparative study of Transformer and LSTM Network with attention mechanism on Image Captioning | |
CN116414988A (en) | Graph convolution aspect emotion classification method and system based on dependency relation enhancement | |
Sun et al. | Rumour detection technology based on the BiGRU_capsule network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210330 |