CN111061851B

CN111061851B - Question generation method and system based on given facts

Info

Publication number: CN111061851B
Application number: CN201911276552.4A
Authority: CN
Inventors: 刘康; 何世柱; 赵军; 刘操
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2019-12-12
Filing date: 2019-12-12
Publication date: 2023-08-08
Anticipated expiration: 2039-12-12
Also published as: CN111061851A

Abstract

The invention relates to a question generation method and a question generation system based on given facts, wherein the question generation method comprises the following steps: acquiring historical reference data, wherein the historical reference data comprises historical input information of a plurality of different users; expanding each history input information to obtain a corresponding context representation; establishing a question generation model according to each piece of input information and the corresponding context representation; and determining a question sequence corresponding to the current input information according to the current input information of the current user based on the question generation model. According to the invention, a question generation model is established through historical reference data; the question generation model can be based on the question generation model, and the question sequence corresponding to the current input information can be accurately determined according to a small amount of current input information given by the current user.

Description

Question generation method and system based on given facts

Technical Field

The invention relates to the technical field of natural language processing, in particular to a question generation method and a question generation system based on given facts.

Background

With the vigorous development of the internet and the increasing popularity of network communication terminals, people are exposed to massive amounts of information related to various fields every day. Knowledge base questions and answers can help people to quickly acquire knowledge from massive information, so that the learning cost of human beings is reduced. However, knowledge base question-answering is severely dependent on manual annotation data, and the annotation data of question-answering pairs (pairs) becomes bottleneck resources for restricting the development of question-sentence technology and question-answering systems, so that question-sentence generation can effectively solve the problem.

The task of question generation mainly automatically generates questions from given answers and auxiliary information thereof. The given answer and ancillary information may be in plain text form or may be a structured knowledge base. The question generation has the following purposes: 1. automatically constructing data resources of questions and answers, or reducing the workload of manually marking question and answer pairs; 2. the method is used for data enhancement and improves the performance of a question-answering system; 3. as a typical text generation task, development and progress of text generation technology can be facilitated.

However, the conventional question generation method easily generates a question whose predicate is not matched, such as given input < statue of liberty, position, new york city > in table 1, may generate a question whose given predicate cannot be expressed like Q1 (who authored statue of liberty.

TABLE 1

Disclosure of Invention

In order to solve the above problems in the prior art, that is, to solve the problem of accurately determining based on a small number of given facts, the present invention provides a question generation method and system based on the given facts.

In order to solve the technical problems, the invention provides the following scheme:

a question generation method based on given facts, the question generation method comprising:

acquiring historical reference data, wherein the historical reference data comprises historical input information of a plurality of different users;

expanding each history input information to obtain a corresponding context representation;

establishing a question generation model according to each piece of input information and the corresponding context representation;

and determining a question sequence corresponding to the current input information according to the current input information of the current user based on the question generation model.

Optionally, the history reference data further includes a plurality of pieces of supervision information, and each piece of supervision information includes a manually noted question and a reference answer corresponding to the history input information;

the question generation method further comprises the following steps:

and correcting the question generation model according to the supervision information to obtain a corrected question generation model.

Optionally, the correcting the question generation model according to the supervision information to obtain a corrected question generation model specifically includes:

based on the question generation model, determining a corresponding historical question sequence according to each historical input information;

according to each history question sequence and the corresponding manual labeling question, calculating and generating question loss

Calculating auxiliary answer loss according to each history question sequence and the corresponding reference answer

Wherein each reference answer includes answer type words corresponding to the history input information, the history question sequence includes generated words corresponding to the history input information,is a set of answer type words, and A represents the answer type words in the set of answer type wordsQuantity of->Is a generated word y in a question sequence _t And the corresponding answer type word a _n Is a loss of (2);

according to the generated question lossAuxiliary answer loss->Determining supervisory information loss

Wherein λ represents a reference coefficient;

according to the supervision information lossAnd correcting the question generation model to obtain a corrected question generation model.

Optionally, the format of the history input information of the different users is a head entity-relation-tail entity;

the expanding of each history input information to obtain a corresponding context representation specifically comprises:

for the head entity and/or the tail entity, using the type information in the knowledge base as a context representation of the head entity and/or the tail entity;

for a relationship, at least one of a sentence corresponding to a domain, a value domain, a topic, and a distance supervision back label in a knowledge base is used as a contextual representation of the relationship.

Optionally, when the knowledge base has multiple types of information, the most frequently used and most differentiated type is selected as the context representation of the head entity and/or the tail entity.

Optionally, the establishing a question generation model according to each input information and the corresponding context representation specifically includes:

for each pair of input information and corresponding context representation,

training the input information to obtain training information;

based on a first sequence model, obtaining a representation sequence according to the context representation;

fusing the training information and the representation sequence to obtain fusion information;

coding the fusion information to obtain a hidden layer sequence;

and decoding each hidden layer sequence, and calculating to obtain a corresponding decoding sequence function, wherein the decoding sequence function is a question generation model.

Optionally, decoding each hidden layer sequence, and calculating to obtain a corresponding decoding sequence function, which specifically includes:

decoding each hidden layer state sequence based on a second sequence model to obtain decoding information;

according to the decoding information, respectively calculating the knowledge base copy mode probability of the names corresponding to the copy history input information in the knowledge base, the context copy mode probability of the copy context representation and the vocabulary generation mode probability of the words generated from the vocabulary;

according to the knowledge base copy mode probability p _cpkb Context copy mode probability p _cptx Vocabulary generation pattern probability p _genv Calculating the predictive probability P (y) _t |s _t ，y _t-1 ，F，C)：

Wherein genev, cpkb and cpctx represent vocabulary generation mode, knowledge base replication mode and context replication mode, respectively, p.represents three different speciesProbability of pattern, P (x) represents probability of generating target word in various patterns, F and C represent input information and context, s, respectively _t Representing the current decoding status, y _t Representing words generated at the current moment;

according to the predictive probability P (y) _t |s _t ，y _t-1 F, C), decoding word by word to obtain a decoded question sequence function.

In order to solve the technical problems, the invention also provides the following scheme:

a question generation system based on a given fact, the question generation system comprising:

the acquisition unit is used for acquiring history reference data, wherein the history reference data comprises a plurality of pieces of history input information of different users;

the expansion unit is used for expanding each history input information to obtain a corresponding context representation;

the modeling unit is used for establishing a question generation model according to the input information and the corresponding context representation;

and the determining unit is used for determining a question sequence corresponding to the current input information according to the current input information of the current user based on the question generation model.

a question generation system based on given facts, comprising:

a processor; and

a memory arranged to store computer executable instructions that, when executed, cause the processor to:

a computer-readable storage medium storing one or more programs that, when executed by an electronic device comprising a plurality of application programs, cause the electronic device to:

According to the embodiment of the invention, the following technical effects are disclosed:

according to the invention, a question generation model is established through historical reference data; the question generation model can be based on the question generation model, and the question sequence corresponding to the current input information can be accurately determined according to a small amount of current input information given by the current user.

Drawings

FIG. 1 is a flow chart of a question generation method based on a given fact of the present invention;

FIG. 2 is a schematic diagram of answer-assisted supervision

Fig. 3 is a schematic block diagram of a question generation system based on given facts of the present invention.

Symbol description:

the system comprises an acquisition unit-1, an expansion unit-2, a modeling unit-3 and a determination unit-4.

Detailed Description

Preferred embodiments of the present invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are merely for explaining the technical principles of the present invention, and are not intended to limit the scope of the present invention.

The invention aims to provide a question generation method based on given facts, which establishes a question generation model through historical reference data; the question generation model can be based on the question generation model, and the question sequence corresponding to the current input information can be accurately determined according to a small amount of current input information given by the current user.

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

As shown in fig. 1, the question generation method based on the given facts of the present invention includes:

step 100: acquiring historical reference data, wherein the historical reference data comprises historical input information of a plurality of different users;

step 200: expanding each history input information to obtain a corresponding context representation;

step 300: establishing a question generation model according to each piece of input information and the corresponding context representation;

step 400: and determining a question sequence corresponding to the current input information according to the current input information of the current user based on the question generation model.

Further, the history reference data further comprises a plurality of pieces of supervision information, and each piece of supervision information comprises a manually marked question and a reference answer corresponding to the history input information.

In order to improve the determination accuracy, the question generation method based on the given facts further comprises the following steps:

The traditional method only has supervision information of question ends, namely, the generated question and the manually marked question are compared, the difference between the questions is taken as loss, and the training is carried out by using an optimizer, so that the ambiguous question, namely, one question corresponds to a plurality of correct answers, can be easily obtained. For this purpose, in addition to manually annotating question sentences as supervision information, answer information is also utilized as auxiliary supervision, with the goal that the generated question sentences contain any one answer type word.

Further, the step of correcting the question generation model according to the supervision information to obtain a corrected question generation model specifically includes:

step S1: based on the question generation model, corresponding historical question sequences are determined according to each historical input information.

Step S2: according to each history question sequence and the corresponding manual labeling question, calculating and generating question loss

Step S3: calculating auxiliary answer loss according to each history question sequence and the corresponding reference answer(as shown in fig. 2):

wherein each reference answer includes answer type words corresponding to the history input information, the history question sequence includes generated words corresponding to the history input information,is a set of answer type words (e.g. type words of new york city include cities, administrative areas, etc.), |a| represents the number of answer type words in the set of answer type words, ++>Is a generated word y in a question sequence _t And the corresponding answer type word a _n Is a loss of (2). Wherein, auxiliary answer is lost->Is the minimum of two calculations.

Step S4: according to the generated question lossAuxiliary answer loss->Determining supervision information loss->

Where λ represents a reference coefficient.

Step S5: according to the supervision information lossAnd correcting the question generation model to obtain a corrected question generation model.

In this embodiment, the format of the history input information of the different users is a head entity (subject) -relationship (predicate) -tail entity (object).

Further, in step 200, the expanding each history input information to obtain a corresponding context representation specifically includes:

Wherein when there are a plurality of types of information of the knowledge base, the most frequently used and most differentiated type is selected as the context representation of the head entity and/or the tail entity.

In step 300, a question generation model is built according to each input information and the corresponding context representation, which specifically includes:

for each pair of input information and corresponding context representation,

training the input information to obtain training information;

coding the fusion information to obtain a hidden layer sequence;

For the fact of input symbolization, a knowledge base such as TransE (transformation) can be used for representing a learning method to pretrain on a large-scale corpus, and the training can be carried out along with the first sequence model by random initialization. For context information, the modeling may be performed with a first sequence model, which may be a recurrent neural network (RNN, recurrent Neural Networks), a gated loop unit (GRU, gated Recurrent Unit), a long short term memory network (LSTM, long Short Term Memory), a transducer model, and the like. Thus, each element in the reference information has both a symbolic representation and a contextual representation, and the fusion can be performed by Gate, and finally, the fusion information can be encoded into a hidden layer sequence (e.g., H _f ＝[h ^s ；h ^p ；h ^o ])。

Further, decoding each hidden layer sequence, and calculating to obtain a corresponding decoding sequence function, which specifically includes:

according to the decoding information, respectively calculating the knowledge base copy mode probability of the name corresponding to the copy history input information in the knowledge base, the context copy mode probability of the copy context representation and the vocabulary generation mode probability of generating words from the vocabulary;

according to the knowledge base copy mode probability p _cpkb Context copy mode probability p _cpctx Vocabulary generation pattern probability p _genv Calculating the predictive probability P (y) _t |s _t ，y _t-1 ，F，C)：

Wherein genev, cpkb and cpctx represent vocabulary generation mode, knowledge base replication mode and context replication mode, respectively, p. represent probabilities of three different modes, P (x) represents probabilities of generating words in various modes, F and C represent input information and context, s, respectively _t Representing the current decoding status, y _t Representing words generated at the current moment;

according to the predictive probability P (y) _t |s _t ，y _t-1 F, C), decoding the words to obtain a question sequence function.

Likewise, the decoder may decode with a second sequence model, which may be a recurrent neural network (RNN, recurrent Neural Networks), a gated loop unit (GRU, gated Recurrent Unit), a long and short term memory network (LSTM, long Short Term Memory), a transducer model, and the like. In the decoding process, in order to better capture the input information, several copy forms can be selectively adopted: 1. considering that the head entity often appears in the generated question sentence, so that the name corresponding to the head entity symbol of the copy knowledge base; context of copy extension it is noted that the input context may have many repeated words, so the mechanism of using maxout pointers, i.e. when multiple identical token (words) appear, takes the token with the highest copy score as the score of the token in copy mode instead of adding. Finally, there are two different copy forms and three modes of token generation from the word list, the weighted sum of the three modes is used as the final token selection probability, then word-by-word decoding is carried out according to the prediction probability of the target word to obtain a question generation model, the question generation model is further corrected to obtain a corrected question generation model, and the question sequence information can be accurately determined based on a small number of given facts according to the corrected question generation model.

The effectiveness of the invention is verified by the following experiments:

test corpus

SimpleQuestions: the current rule models the largest knowledge base question-answer dataset.

The comparison method comprises the following steps:

template: template generation question

Serpan et al (2016): sequence-to-Sequence model generating questions

Elsahar et al (2018): generating questions by introducing a single context

The experimental results (as in Table 2)

TABLE 2

Overall performance comparison: the effectiveness of the present invention is illustrated by comparing the effects of the existing method and the present invention. The performance of the invention is obviously better than that of the standard method, and the performance is continuously improved by adding answer auxiliary supervision (the last line).

TABLE 3 Table 3

Predicate override comparison (as shown in table 3): the predicate of a given input is evaluated manually to determine whether the given predicate is correctly expressed in the generated question, and the best performance of the invention is found by calculating the ratio of correctly expressing the given predicate (predicate coverage Predicate Identification).

TABLE 4 Table 4

Answer coverage comparison: the invention further defines an evaluation index Ans of the proportion of answer type words contained in the generated question _cov For evaluating the degree of certainty of generating answers corresponding to questions. The weight lambda of the answer-assisted supervision information is adjusted. When answer auxiliary supervision is found, BLEU score is higher, ans _cov The lifting is more obvious.

In addition, the invention also provides a question generation system based on the given facts, which can accurately determine the problems based on a small number of given facts.

As shown in fig. 3, the question generation system based on the given facts of the present invention includes: an acquisition unit 1, an expansion unit 2, a modeling unit 3 and a determination unit 4.

Specifically, the acquiring unit 1 is configured to acquire history reference data, where the history reference data includes history input information of a plurality of different users;

the expansion unit 2 is used for expanding each history input information to obtain a corresponding context representation;

the modeling unit 3 is configured to establish a question generation model according to each piece of input information and the corresponding context representation;

the determining unit 4 is configured to determine, based on the question generation model, a question sequence corresponding to current input information of a current user according to the current input information.

Further, the invention also provides a question generation system based on given facts, which comprises the following steps:

a processor; and

The present invention also provides a computer-readable storage medium storing one or more programs that, when executed by an electronic device comprising a plurality of application programs, cause the electronic device to:

Compared with the prior art, the question generation system and the computer readable storage medium based on the given facts have the same beneficial effects as the question generation method based on the given facts, and are not repeated here.

Thus far, the technical solution of the present invention has been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of protection of the present invention is not limited to these specific embodiments. Equivalent modifications and substitutions for related technical features may be made by those skilled in the art without departing from the principles of the present invention, and such modifications and substitutions will fall within the scope of the present invention.

Claims

1. A question generation method based on given facts, characterized in that the question generation method comprises:

acquiring historical reference data, wherein the historical reference data comprises historical input information of a plurality of different users; the history reference data also comprises a plurality of pieces of supervision information, and each piece of supervision information comprises a manually marked question corresponding to the history input information and a reference answer;

the correction method of the question generation model further comprises the following steps:

correcting the question generation model according to the supervision information to obtain a corrected question generation model:

Wherein each reference answer includes answer type words corresponding to the history input information, the history question sequence includes generated words corresponding to the history input information,is a set of answer type words, |a| represents the number of answer type words in the set of answer type words, |a| is +.>Is a generated word y in a question sequence _t And corresponding answer classWord a _n Is a loss of (2);

according to the generated question lossAuxiliary answer loss->Determining supervision information loss->

Wherein λ represents a reference coefficient;

according to the supervision information lossCorrecting the question generation model to obtain a corrected question generation model;

2. The question generation method based on given facts according to claim 1, wherein the format of the history input information of the different users is a head entity-relation-tail entity;

3. A question generation method based on given facts according to claim 2, characterized in that when there are a plurality of types of information of said knowledge base, the most frequently used and most differentiated type is selected as a context representation of the head entity and/or the tail entity.

4. The question generation method based on given facts according to claim 1, wherein the creating a question generation model according to each of the input information and the corresponding context representation specifically comprises:

for each pair of input information and corresponding context representation,

training the input information to obtain training information;

coding the fusion information to obtain a hidden layer sequence;

5. The question generation method based on given facts according to claim 4, wherein said decoding each hidden layer sequence, calculating a corresponding decoding sequence function, specifically includes:

according to the knowledge base copy mode probability p _cpkb Context copy modeProbability p _cpctx Vocabulary generation pattern probability p _genυ Calculating the predictive probability P (y) _t |s _t ，y _t-1 ，F，C)：

P(y _t |s _t ，y _t-1 ，F，C)＝p _genυ P _genυ (y _t |s _t ，V)+p _cpkb P _cpkb (y _t |s _t ，F)+p _cpctx P _cpctx (y _t |s _t ，C)；

according to the predictive probability P (y) _t |s _t ，y _t-1 F, C), decoding word by word to obtain a question sequence function.

6. A question generation system based on a given fact, the question generation system comprising:

the acquisition unit is used for acquiring history reference data, wherein the history reference data comprises a plurality of pieces of history input information of different users; the history reference data also comprises a plurality of pieces of supervision information, and each piece of supervision information comprises a manually marked question corresponding to the history input information and a reference answer;

Wherein each reference answer includes answer type words corresponding to the history input information, the history question sequence includes generated words corresponding to the history input information,is a set of answer type words, |a| represents the number of answer type words in the set of answer type words, |a| is +.>Is a generated word y in a question sequence _t And the corresponding answer type word a _n Is a loss of (2);

Wherein λ represents a reference coefficient;

7. A question generation system based on given facts, comprising:

a processor; and

according to each history question sequence and the corresponding manual labeling question, calculating and generating a questionSentence loss

Wherein λ represents a reference coefficient;

8. A computer-readable storage medium storing one or more programs that, when executed by an electronic device comprising a plurality of application programs, cause the electronic device to:

According to each history question sequence and correspondingTo calculate the auxiliary answer loss

Wherein λ represents a reference coefficient;