CN111813913A

CN111813913A - Two-stage problem generation system with problem as guide

Info

Publication number: CN111813913A
Application number: CN202010661187.5A
Authority: CN
Inventors: 沈耀; 倪茂森; 过敏意; 姚斌; 陈�全
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2019-11-27
Filing date: 2019-11-27
Publication date: 2020-10-23
Anticipated expiration: 2039-11-27
Also published as: CN111813913B

Abstract

A problem-oriented two-stage problem generation system, comprising: the system comprises a question-answer data preprocessing module, a context sequence labeling module and a question generating module, wherein: the question-answer data preprocessing module performs re-division, feature extraction and dictionary construction on the data set and vectorizes features and words to obtain a labeling training set and a real label; the context sequence marking module adopts the marked data set to train a network model and obtain a prediction label of a context; and the problem generation module generates a prediction problem sequence by taking the real label and the prediction label as input, and performs back propagation training to obtain a final maximum probability prediction problem through an error of the real label and the prediction label. The invention has obvious improvement on BLEU, MENTOR and ROUGE-L indexes.

Description

Two-stage problem generation system with problem as guide

The application is as follows: application No. 201911179784.8 filed on filing date 2019/11/27 entitled "two stage problem creation System with problem oriented").

Technical Field

The invention relates to a technology in the field of natural language processing, in particular to a problem-oriented two-stage problem generation system.

Background

Question Generation (QG), which aims to generate questions from various natural language texts, plays a crucial role in natural language generation. In recent years, the problem generation has attracted increasing attention due to its wide application. The most intuitive application is to expand the dataset of the question-and-answer task, thereby improving the performance of the task. The questions can help readers to evaluate the degree of mastery of the context and remind the readers of possible omission in the reading process, and the questions have important significance for relieving the burden of teachers and improving the teaching quality in the education industry. Furthermore, in conversational systems, smooth communication often relies on rational questions, which have been an important component of existing conversational systems (e.g. Siri, Alexa and Cortana).

The question generation task, which is a symmetric task of the question answering task, presents valid questions, mainly given context and answers. Many existing end-to-end networks work well in the problem generation area, but there are two problems: 1) questions that do not fully utilize the dataset, just for tags to compute the loss 2) do not make efficient use of the answer, only to fuse the answer into context using 01 or BIO tags.

Disclosure of Invention

The invention provides a two-stage problem generation system taking problems as guidance aiming at the defects of insufficient problem utilization and insufficient answer attention in the prior art,

the invention is realized by the following technical scheme:

the invention relates to a problem-oriented two-stage problem generation system, comprising: the system comprises a question-answer data preprocessing module, a context sequence labeling module and a question generating module, wherein: the question-answer data preprocessing module performs re-division, feature extraction and dictionary construction on the data set and vectorizes features and words to obtain a labeling training set and a real label; the context sequence marking module adopts the marked data set to train a network model and obtain a prediction label of a context; and the problem generation module generates a prediction problem sequence by taking the real label and the prediction label as input, and performs back propagation training to obtain a final maximum probability prediction problem through an error of the real label and the prediction label.

The invention relates to a problem-oriented two-stage problem generation method of the system, which comprises the following stages:

the first stage, based on the LSTM-CRF network, of which the inputs are a separate context encoder and answer encoder, notes in context the words that may be included in the question, where: the context encoder outputs the attribute to the output of the answer encoder to fuse answer information to obtain a fusion matrix H, and finally a sequence mark of a context is generated through a feedforward structure.

The second stage, vectorizing the sequence mark generated in the first stage, connecting the sequence mark with a fusion matrix H as the input of an encoder, and generating the output of the encoder by using a self-attention mechanism with a gate structure in order to promote the information fusion of the context of the long text; in the decoding process, the output of the encoder and the output of the answer encoder in the first stage are simultaneously paid attention to, and a copying mechanism is used for copying the words in the context, so that the problem-oriented two-stage problem is finally obtained.

Technical effects

Compared with the prior art, the relevance of the generated problems and the answers is higher, and meanwhile word overlap evaluation indexes (BLEU, MENTOR, METEOR-L) reach the highest of the existing models; the answer information is effectively decoupled by adopting an answer coding separation mode, and the low-level semantic information is paid attention again in the decoding process, so that the answer information loss is reduced; the two-stage mode of context sequence marking and question generation is used for realizing marking of context words and then generating of out-of-context words and organizing of semantics, and generation of question sentences is completed more efficiently.

The encoder adopts a mode of separating context from answers, so that information fusion of a network structure is more convenient, and the attention degree of question sentences to the answers is improved; and the invention adopts a two-stage mode, firstly, whether the word sequence in the context appears in the question is marked in the first stage to obtain the additional vectorization characteristic, then, the encoder uses the self-attribute structure with the gate structure in the second stage to strengthen the fusion of the context information of the long text, and not only pays attention to the output of the encoder in the decoder stage, but also pays attention to the output of the answer encoder, and finally the BLEU, METEOR and ROUGE _ L indexes of the generated problems reach the best effect at present.

Drawings

FIG. 1 is a block diagram of the system of the present invention;

FIG. 2 is a schematic diagram of the LSTM-CRF network of the present invention;

FIG. 3 is a schematic diagram of the problem module generation of the present invention.

Detailed Description

As shown in fig. 1, the present embodiment relates to a two-stage problem generation system based on an end-to-end network, which includes: the system comprises a question-answer data preprocessing module, a context sequence labeling module and a question generating module, wherein: the question-answer data preprocessing module performs re-division, feature extraction and dictionary construction on the data set and vectorizes features and words to obtain a labeling training set and a real label; the context sequence marking module adopts the marked data set to train a network model and obtain a prediction label of a context; and the problem generation module generates a prediction problem sequence by taking the real label and the prediction label as input, and performs back propagation training to obtain a final maximum probability prediction problem through an error of the real label and the prediction label.

The experimental data for this example is from the Stanford question and answer source data set SQuAD data set that collected over 100,000 + question and answer pairs in a crowd-sourced manner. The dataset used a wikipedia internal ordering system to obtain the top ten thousand high quality articles, then randomly draw 536 articles in the ten thousand, then extract paragraphs from these drawn articles, delete images and tables, discard less than 500 characters of paragraphs, and then get 23215 paragraphs. Next, a crowd-sourced approach is used to ask questions about the paragraphs, where the answer is a segment of the context.

The test environment of this embodiment is a single NVIDIA Titan RTX, the version of the deep learning framework adopted is Pytorch 1.1.0, and the version of CUDA is 10.0.130.

The division specifically includes: the training set for the SQuAD dataset is about 87,000 +, the validation set is 10,000 +, and the test set is also 10,000 +, but is not disclosed, so the validation set is divided in half, half for the validation set and half for the test set.

The feature extraction and dictionary construction specifically comprises the following steps: according to the fact that the used pre-training word vector is Glove, firstly, all words in a training set obtained through division are counted, the word frequency is higher than a word frequency threshold value and the words contained in the Glove are used as a set, and then, words are additionally added<UNK>、<PAD>、<S>、</S>The unknown word, the filling word, the initial character and the end character are used as the dictionary of the embodiment, and then the context sequence, the question sequence and the answer sequence are converted into subscripts in the dictionary set which are respectively marked as W_c、W_qAnd W_ansAt the same time, the space toolkit is used to perform named body recognition (NER) on the context sequenceThe sequence of (A) to (B) is denoted as W_nerThe sequence obtained by Part of Speech tagging (POS) is marked as W_posFinally, the prototype or entry W of the non-stop word appearing in the question in the context is labeled_emerge。

The word frequency threshold value in this embodiment is 4, which can be customized in different situations.

The characteristic and word vectorization specifically includes: converting the context sequence and the answer sequence obtained after the construction of the dictionary and the characteristics into a subscript W in the dictionary set_cAnd W_ansUsing Glove pre-training word vector to carry out vectorization, and identifying a sequence W obtained by a named body_nerSequence W obtained by word tagging_posRandom vectorization is performed to mark the prototypes or words W of non-stop words appearing in the question in the context_emergeAnd the problem sequence is converted into a subscript W in the dictionary set_qAnd the real labels are respectively used as a context sequence labeling module and a question generating module.

The named body type and the part of speech category are respectively 8 and 12, and the dimension is respectively 8 and 12 because the dimension depends on the number of the categories contained in the feature.

As shown in fig. 1 and fig. 2, the context sequence labeling module includes: a separate dual input encoder, a feed-forward network structure, a Conditional Random Field (CRF) structure.

The split dual-input encoder comprises: context encoder and answer encoder, which respectively pass context and answer sequence through the two different two-layer bidirectional LSTM encoders to respectively obtain two vectors S^c，S^aAs context state vector and answer state vector:

wherein: g_iPre-training word vectors for Glove; f. of_iFor additional feature information, in a context encoder f_iContaining the named body and part of speech information in the answerOnly the pre-training word vector, f, is considered in the encoder_iIs empty; the arrow symbol indicates the direction of the recurrent neural network; [;]representing the joining of the two vector final dimensions.

In order for the context encoder to sense the current answer information, the embodiment uses an attention mechanism to fuse the answer information into the context to obtain a fusion matrix H, where the general expression of the attention mechanism is as follows: attention (Q, K, V) ═ softmax ((W)^Q·Q)·(W^K·K)^T)·(W^VV), wherein: h ═ Attention (S)^a，S^c，S^c) Wherein: w^Q，W^K，

Is a trainable parameter matrix.

The feedforward network structure is a fully-connected network, and comprises: two linear transformations, one ReLU activation function, residual structure and Layer normalization (Layer normalization), specifically: ffn (h) ═ LayerNorm (Relu (hW)¹+ b¹)W²+b²+h)W³Wherein: two one-dimensional convolutions are used for realizing linear transformation, the input and output of the convolutions are all 600 dimensions, the middle dimension is 2400 dimensions, and the dimensions of the parameter matrix are respectively

The probability of each tag of the context sequence is finally obtained.

The CRF structure is used for acquiring transition probability among labels, and the loss function is a negative log-likelihood function

Wherein: x denotes the input sequence, y denotes the true sequence tag, y^*Representing the order of prediction possibilitiesColumn labels, Score (x, y) indicates the scores of the true annotated sequences, and the scores representing all possible predicted sequences are summed, using a calculation that uses the path scores for each step to sum, thereby significantly reducing the amount of computation.

As shown in fig. 3, the question generation module repeatedly utilizes the context matrix H fused with the answers in stage 1 to reduce the complexity of model calculation, and meanwhile, vectorizes the tag value of the output of stage 1 to serve as an additional feature E, E connects H to serve as the input of the encoder, and uses a self-attention mechanism with a gate structure to abstract the feature in a higher dimension, in the decoding process, this embodiment not only pays attention to the output of the encoder, but also pays attention to the output of the answer encoder in the context sequence labeling module, and uses a pointer network to solve the OOV problem, and finally obtains a high-quality question with strong answer correlation, and the question generation module includes: an encoder of a self-attention mechanism, a decoder focusing on the answer, and a pointer network having a gate structure.

The encoder of the self-attention mechanism is a bidirectional LSTM network, and a state vector S is obtained through the LSTM network as an input, because of a gradient disappearance problem caused by LSTM long-range dependence, the present embodiment uses a self-attention mechanism with a gate structure, specifically: obtaining an Attention intermediate state vector N by using an Attention mechanism, then screening information of S and N by using a gate structure, and using a residual structure to facilitate gradient transmission and avoid information loss to obtain a final state M, N being Attention (S, S, S) and M being sigmoid (W)^G·[S；N]⊙tanh(W^E·[S；N]) + S, wherein: (S) the first step of the method,

W^G，

a trainable parameter matrix, an indicates multiplication of the corresponding element of the matrix.

The decoder comprises a two-layer unidirectional LSTM structure, and the encoder is concerned in the decoding processThe output final state M is concerned and the state vector S output by the answer encoder in the dual-input encoder of the context sequence labeling module^aTherefore, the loss caused by high-level abstract information is made up, and the method specifically comprises the following steps: contextual attention vector

Answer attention vector

LSTM state transition equation

Wherein:

the Glove pre-training word vector for the previous word of the current predicted sequence,

output h for the sequence of the previous word_t-1The resulting fused vector after attention with encoder output M,

output h for the sequence of the previous word_t-1And answer state vector S^aAnd obtaining an attention vector containing answer information after performing attention.

The pointer network firstly uses h_t、

Connecting as input to the linear layer, the probabilities P of all words in the dictionary are output_gen. Then directly using h in the decoder_tWeight on M attention as probability P of context word replication_copy(ii) a Then by regulating P_copyAnd P_genRatio of (A) to (B)By means of a gate structure to obtain a specific output P_final＝G_copyP_copy+ (1-G_copy)P_genWherein:

G_copyprobability of a gate structure.

During training, firstly, the linear layer and the LSTM of the orthogonalization parameter initialization model are adopted in the embodiment, the first two thousand pre-training word vectors are finely adjusted in the training process to fix the rest word vectors, an SGD optimizer with 0.8 momentum is adopted for training, the initialized learning rate is 0.1, then, half reduction is carried out once every 4 epochs after 8 epochs, and the best training effect is achieved at the 40 th epoch.

During reasoning, the embodiment uses a beam search with the size of 10 and an optimal model in the training process.

By using the model and the training and reasoning method, experiments are performed on the SQuAD data set, and compared with some advanced models at present, the model of the embodiment is obviously improved in BLEU, MENTOR and ROUGE-L indexes, and the experimental results are as follows:

watch 1

Model	BLEU_1	BLEU_2	BLEU_3	BLEU_4	MENTOR	ROUGE-L
							Du et al.(2017)	-	-	-	12.28	16.62	39.75
Song et al.(2018)	-	-	-	13.98	18.77	42.72
							Zhao et al.(2018)	45.69	30.25	22.16	16.85	20.62	44.99
Kim et al.(2019)	-	-	-	16.20	19.92	43.96
							Liu et al.(2019)	46.58	30.90	22.82	17.55	21.24	44.53
This embodiment (binary identifier)	44.68	29.10	21.12	15.93	20.13	44.26
							This embodiment (answer encoder)	45.45	29.98	21.91	16.66	20.46	44.94
This example (two-stage)	46.96	31.68	23.67	18.36	21.43	45.99

The result shows that compared with the previous model, the model of the embodiment achieves the current optimal effect, and compared with the other models, the model achieves the optimal effect on each index.

Which of the above components is original to the invention, has never been disclosed and does not operate in the same manner as any of the prior literature references: a separate answer encoding structure is used and an attention mechanism is used for the answer state vector in the decoding stage; a two-stage approach is used, namely, a method of generating additional features by performing sequence labeling on the context first and then performing problem generation.

The improvement points of the answer coding structure are specifically as follows: the traditional answer labeling mode uses a binary identifier to indicate the position of the answer in the context, and effectively improves various expressions generated by the question by integrating the position information of the answer. However, the binary identifier intelligence used by it holds limited information, motivating embodiments to find better alternatives, fusing answer information more subtly into the training network. The answer encoder derives a fusion matrix H, potentially containing answer position information, by encoding the answers and focusing on the context state vector. Unlike the structure of S2S in which the answer is separated, the embodiment adds a Self-annotation Layer after the Attention Layer to extract the answer information at a high level. In addition, the state vector of the answer encoder is also noted in the decoding process to obtain the answer information of the low level. It is clear that the more fully utilized the answer state vector by the embodiment, the higher the correlation between the generated question and the correct answer.

The improvement points of the two-stage mode are specifically as follows: the traditional end-to-end architecture employs a way to regenerate the context coding into problems, which is not conducive to adding additional information. Embodiments use a two-stage approach to annotating words that may be in question in the context and encode this information, reusing the traditional end-to-end structure to generate words and organizational grammar structures.

Through specific practical experiments, under the environment settings of a single NVIDIA Titan RTX, Pythrch 1.1.0 and CUDA 10.0.130, the first two thousand pre-training word vectors are finely adjusted in the training process to fix the rest word vectors, an SGD optimizer with 0.8 momentum is adopted for training, the initialized learning rate is 0.1, then half reduction is performed every 4 epochs after 8 epochs, the optimal model is obtained at the 40 th epoch, the optimal model in the training process is used during reasoning, a beam search reasoning strategy with the size of 10 is adopted, and the obtained experimental data are shown in the table I.

Compared with the prior art, the system uses an ablation test to determine the contribution of the existing improvement point to the overall model, and the first table shows that compared with the traditional binary identifiers BLEU-4, MERTOR and ROUGE-L, the structure of the separated answer encoder is respectively improved by 0.7, 0.3 and 0.8, the context sequence marking module is additionally provided with 1.7, 1.0 and 1.0 point on the basis, and the overall effect of the model is improved by 2.4, 1.3 and 1.8 points. All indexes are improved more remarkably in a two-stage mode, and the context sequence labeling module contributes most to the overall model.

The foregoing embodiments may be modified in many different ways by those skilled in the art without departing from the spirit and scope of the invention, which is defined by the appended claims and all changes that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims

1. A two-stage problem generation system based on an end-to-end network, comprising: the system comprises a question-answer data preprocessing module, a context sequence labeling module and a question generating module, wherein: the question-answer data preprocessing module performs re-division, feature extraction and dictionary construction on the data set and vectorizes features and words to obtain a labeling training set and a real label; the context sequence marking module adopts the marked data set to train a network model and obtain a prediction label of a context; the problem generation module takes the real label and the prediction label as input to generate a prediction problem sequence, and performs back propagation training through the error of the real label and the prediction label to obtain a final maximum probability prediction problem;

the context sequence labeling module comprises: a separate dual input encoder, feedforward network structure, Conditional Random Field (CRF) structure;

the question generation module comprises: an encoder of a self-attention mechanism, a decoder focusing on the answer, and a pointer network having a gate structure.

2. The system of claim 1, wherein the repartitioning is: the question and answer data preprocessing module receives the SQuAD data set as input, and takes one half of the verification set as the verification set and the other half as the test set.

3. The system of claim 1, wherein the feature extraction and dictionary construction is: counting all words in the training set obtained by division, taking the words with the word frequency higher than the word frequency threshold and contained in the used pre-training word vector Glove as a set, and adding words additionally<UNK>、<PAD>、<S>、</S>The unknown word, the filling word, the initial character and the end character are used as the dictionary of the embodiment, and then the context sequence, the question sequence and the answer sequence are converted into subscripts in the dictionary set which are respectively marked as W_c、W_qAnd W_ansAnd simultaneously, the sequence obtained by carrying out named body recognition on the context sequence by using a space toolkit is marked as W_nerAnd the sequence obtained by part-of-speech tagging is marked as W_posFinally, the prototype or entry W of the non-stop word appearing in the question in the context is labeled_emerge。

4. The system of claim 1, wherein said vectorization is: converting the context sequence and the answer sequence obtained after the construction of the dictionary and the characteristics into a subscript W in the dictionary set_cAnd W_ansUsing Glove pre-training word vector to carry out vectorization, and identifying a sequence W obtained by a named body_nerSequence W obtained by word tagging_posPerforming random vectorization toTagging a prototype or word item W of a non-stop word appearing in a question in context_emergeAnd the problem sequence is converted into a subscript W in the dictionary set_qAnd the real labels are respectively used as a context sequence labeling module and a question generating module.

5. The system of claim 1, wherein said separate two-input encoder comprises: context encoder and answer encoder, which respectively pass context and answer sequence through the two different two-layer bidirectional LSTM encoders to respectively obtain two vectors S^c，S^aAs context state vector and answer state vector:

wherein: g_iPre-training word vectors for Glove; f. of_iFor additional feature information, in a context encoder f_iContaining the named entity and part-of-speech information, taking into account only the pre-training word vectors in the answer encoder, f_iIs empty; the arrow symbol indicates the direction of the recurrent neural network; [;]representing the connection of the two vector final dimensions;

Is a trainable parameter matrix.

6. The system of claim 1, wherein said feedforward networkThe network structure is a fully connected network comprising: two linear transformations, one ReLU activation function, residual structure and Layer normalization (Layer normalization), specifically: ffn (h) ═ LayerNorm (Relu (hW)¹+b¹)W²+b²+h)W³Wherein: two one-dimensional convolutions are used for realizing linear transformation, the input and output of the convolutions are all 600 dimensions, the middle dimension is 2400 dimensions, and the dimensions of the parameter matrix are respectively

The probability of each tag of the context sequence is finally obtained.

7. The system of claim 1, wherein the CRF structure is configured to obtain transition probabilities between tags, and the loss function is a negative log-likelihood function

Wherein: x denotes the input sequence, y denotes the true sequence tag, y^*Representing predicted likely sequence tags, Score (x, y) indicating scores of true annotated sequences, and summing the scores representing all possible predicted sequences, in a manner that uses the path scores of each step for summation, thereby significantly reducing the amount of computation.

8. The system of claim 1, wherein the encoder of the self-attention mechanism is a bidirectional LSTM network, and the input is passed through the LSTM network to obtain the state vector S, because of the problem of gradient disappearance caused by LSTM long-range dependence, this embodiment uses a self-attention mechanism with a gate structure, specifically: obtaining the attention intermediate state matrix H by using the attention mechanism, and then using the gateThe structure screens information of S and H, and a residual structure is used for facilitating gradient transmission and avoiding information loss, so that a final state M is obtained, wherein H is Attention (S, S, S), and M is sigmoid (W)^G[S；H]⊙tanh(W^E·[S；H]) + S, wherein: (S) the first step of the method,

W^G，

9. The system of claim 1, wherein the decoder comprises a two-layer one-way LSTM structure, wherein the final state M of the focus encoder output is focused during decoding and the state vector S of the answer encoder output in the two-input encoder of the context sequence labeling module is focused^aTherefore, the loss caused by high-level abstract information is made up, and the method specifically comprises the following steps: contextual attention vector

Answer attention vector

LSTM state transition equation

Wherein:

10. The system of claim 1, wherein said pointer network first indexes h_t、

Connecting as input to the linear layer, the probabilities P of all words in the dictionary are output_genThen directly using h in the decoder_tWeight on M attention as probability P of context word replication_copy(ii) a Then by regulating P_copyAnd P_genIn a gate structure to obtain a specific output P_final＝G_copyP_copy+(1-G_copy)P_genWherein:

G_copyprobability of a gate structure.

11. A problem-oriented two-stage problem generation method based on the system of any one of the preceding claims, comprising:

the first stage, based on the LSTM-CRF network, of which the inputs are a separate context encoder and answer encoder, notes in context the words that may be included in the question, where: the context encoder outputs an attribute to the output of the answer encoder to fuse answer information to obtain a fusion matrix H, and finally a sequence mark of a context is generated through a feedforward structure;