CN112668344A

CN112668344A - Complexity-controllable diversified problem generation method based on hybrid expert model

Info

Publication number: CN112668344A
Application number: CN202110099300.XA
Authority: CN
Inventors: 毕胜; 程茜雅; 漆桂林
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2021-01-25
Filing date: 2021-01-25
Publication date: 2021-04-16
Anticipated expiration: 2041-01-25
Also published as: CN112668344B

Abstract

The invention discloses a complexity-controllable diversified problem generation method based on a hybrid expert model, which is mainly used for generating a natural language problem which is related to a text and meets the complexity requirement. The invention provides a novel problem complexity evaluation mode by taking massive problem data in the existing question-answering data set as a starting point, and the method is formed by combining 6 complexity evaluation indexes. And carrying out difficulty labeling on the existing data set by using the complexity evaluation mode to serve as a training set, a verification set and a test set of the model. And coding the given text and the answer by using a bidirectional LSTM network to obtain corresponding semantic representations and splicing the semantic representations. The encoded result is decoded using the LSTM network, creating a problem. Hidden vectors are used to model problem templates of different complexity in the decoding process, so that the generation of problems meeting the given complexity is guided. And different text contents are selected by using a hybrid expert model, so that different problems are generated, and the diversity of problem generation is improved.

Description

Complexity-controllable diversified problem generation method based on hybrid expert model

Technical Field

The invention belongs to the field of natural language processing, and relates to a complexity-controllable diversity problem generation method of a hybrid expert model.

Background

In recent years, with the rapid development of artificial intelligence, natural language processing technology is more and more widely applied. Natural language processing is divided into two major parts, natural language understanding and natural language generation. Among them, a Question Generation (QG) task is a typical task in natural language Generation. Question generation refers to the automatic generation of natural language questions from a series of data sources (e.g., text, pictures, knowledge base). The application prospect of the problem generation task is very wide, for example, in the field of man-machine interaction, a chat robot (Siri, Microsoft ice, etc.) which generates a conversation with a user by asking questions; in the education field, the level of students is tested according to the problems generated by the course materials, so that the mastery degree of the students on the knowledge is known; in addition, as a dual task of automatic question answering, the QG task can provide a large-scale data set for the QA model training by generating a large number of high-quality questions, so that the effect of the QA model is improved.

The current work of question generation focuses mainly on the field of reading and understanding, and a natural language expressed question is generated for a given answer based on a fact text. The traditional QG method is mainly completed by manually constructed rule templates and combining manual labeling, the mode consumes manpower and material resources, and the problem generated by the templates lacks naturalness and diversity. With the development of deep learning technology, the research of the sequence-to-sequence (Seq2Seq) model on the task of text generation such as machine translation has attracted the attention of researchers. The end-to-end deep neural network model can effectively improve the naturalness and diversity of the generated problems and achieve a good generation effect. However, the current QG method based on deep learning mainly studies generation of simple problems, and little work is done to study generation of complex problems. The generation of complex problems also has many practical meanings, for example, in the field of education, because different students have different abilities to accept knowledge, if a simple problem is generated, it is difficult to test the true level of the student. For students with strong abilities, complex questions need to be tested to get true feedback. In addition, the performance of the existing Question Answering (QA) system on simple questions reaches the bottleneck, and the complex questions are more beneficial to the improvement of the QA system. Most of the existing work can not control the complexity of the generated problems, so that the research on the generation of the complex problems has certain practical value and application prospect.

Based on this, the present work proposes a complexity-controlled problem generation model based on a hybrid expert model. The method is mainly used for generating the diversified natural language questions which are related to the texts, can be answered by the answers and meet the complexity requirement under the condition of giving the texts, the answers and the complexity indexes.

Disclosure of Invention

The technical problem is as follows: the invention aims to solve the technical problems that the complexity evaluation and the complexity modeling are difficult to perform aiming at the research of the problem of lack of controllable complexity, and provides a complexity-controllable diversified problem generation method based on a hybrid expert model.

The technical scheme is as follows: the technical scheme adopted by the invention for solving the technical problems is as follows: a complexity-controllable diversification problem generation method based on a hybrid expert model is disclosed. The method takes massive question data in the existing question-answering data set as a starting point, provides a novel question complexity evaluation mode, and is formed by combining 6 complexity evaluation indexes. And the complexity evaluation mode is used for carrying out difficulty marking on the existing data set to be used as a training set, a verification set and a test set of the model provided by the invention. And coding the given text and the answer by using a bidirectional LSTM network to obtain corresponding semantic representations and splicing the semantic representations. The encoded result is decoded using the LSTM network, creating a problem. Hidden vector modeling complexity factors are used in the decoding process to guide the generation of problems meeting given complexity. And different text contents are selected by using a hybrid expert model, so that different problems are generated, and the diversity of problem generation is improved.

The complexity-controllable diversity problem generation method based on the hybrid expert model comprises the following steps:

1) the characteristics of a question and answer data set are mined, and a self-adaptive problem complexity measuring method is provided;

2) the problem complexity measuring method is used for carrying out complexity labeling on data in the existing data set and dividing the data into a training set, a verification set and a test set;

3) encoding the given text and answers using a bi-directional LSTM network;

4) decoding the coding result by using an LSTM network to generate a problem;

5) in the decoding process, problem templates with different complexities are modeled by using hidden vectors, so that the generation of problems meeting the given complexity is guided;

6) different text contents are selected by using the hybrid expert model, so that different problems are generated, and the diversity of problem generation is improved.

As a further improvement of the invention, in the step 1),

since the difficulty of a question is not only related to the question itself, but also to the given text and the interaction between the two. Therefore, the invention provides five complexity influence factors from three aspects of problems, texts and interaction between the problems and the texts, and designs a self-adaptive problem complexity measuring method, wherein the five factors comprise:

1) number of clauses in question

Because one clause represents one event, when multiple clauses are included in a problem, the association of multiple events is involved, and thus the problem is difficult to understand. So the problem is more complicated if the number of clauses is larger.

2) Number of modifiers in question

Each fixed language in the question is equivalent to one jump in the reasoning path of the answer, so the more the number of the fixed languages is, the more difficult the process of finding the answer is, and the more difficult the question is to answer.

3) Degree of association of sentences in text

If the sentence association degree in the text is higher, the more concentrated the subject of the text is, the more sufficient the description about the specific subject is, the easier the text is to understand, the process of reasoning the answer is relatively easy, and thus the question is easy to answer. The invention uses the similarity of the subject distribution of sentences in the text to express the sentence association degree. The topic model is first trained, the topic distributions for each sentence are calculated, and then the similarity of these topic distributions is measured using the Kullback-Leibler divergence. The calculation method is as follows:

wherein, t_iAnd t_jRespectively representing the topic distribution of the ith sentence and the jth sentence in the text, and N is the number of sentences in the text. Finally, the higher the similarity of the sentence topic distribution, the higher the sentence relevance, and the simpler the problem.

4) Frequency of occurrence of entities in question in text

The higher the frequency of occurrence, the more content in the text describing these entities, the easier it is to find answers, and thus the simpler the question is. The present invention utilizes the spaCy tool to identify the entity in question and uses the following formula to calculate the frequency with which the entity in question appears in the text. To ensure

The value of (d) is positively correlated with the complexity, and the invention uses reciprocal operations.

5) Average distance of entity in question and answer span in text

In finding an answer, if the entity in the question is closer to the answer span, there is more or less some connection between them, so it is easier to find the answer. Therefore, the average distance between the entity in the question and the answer span in the text is calculated as the complexity influence factor, and the larger the distance is, the harder the question is to answer, and the more complicated the question is.

Because the value difference of the different complexity influence factors is large and influences the evaluation of the final complexity, the invention adopts a normalized calculation method for the value of the influence factor to eliminate the influence caused by the overlarge value, and the calculation formula of the score cpx of the complexity of each problem is as follows:

wherein, ω is_iIs the weight of the ith influencing factor.

As a further improvement of the invention, in the step 5),

because similar questions have similar template structures, they can be used to guide question generation. The direct construction of the problem templates is time-consuming and labor-consuming, so the invention uses the hidden vector pi epsilon {1, …, n_πAs a memory module to model the template structure of the problem. Whenever a pi is selected, its corresponding template is used to guide question generation. In order to generate the problem of controllable complexity, namely generating a simple problem and generating a complex problem, the invention uses two implicit vectors pi^simpleAnd pi^complexAnd selecting corresponding templates according to the problems with different complexity respectively.

As a further improvement of the present invention, in the step 6),

in order to control the diversity of the selected problem templates, the invention uses a mixed expert model to select different templates and further build the problem templatesDifferent text contents are modeled, different problems are generated finally, and the diversity of problem generation is improved. Specifically, a hidden vector z e {1, …, n is defined_zRepresents a series of experts, each of which is concerned with a different question template.

Has the advantages that:

compared with other problem generation methods, the method considers the influence of the text on the problem complexity, and designs a self-adaptive, reasonable and accurate complexity evaluation method. In addition, templates with different complexity problems are modeled through hidden vectors, and the problem that the complexity of the model generation is controllable can be effectively guided by means of the templates. And finally, different text contents are selected by adopting a hybrid expert model according to the complexity level, so that the diversity of generated problems is ensured.

Experimental analysis proves that the complexity evaluation method provided by the method accords with the data characteristics of the existing data set, and the complexity of the problem can be evaluated. In addition, the complexity-controllable diversity problem generation method based on the hybrid expert model can generate high-quality complexity-controllable problems, and the generated complex problems play a certain role in improving the performance of a problem system.

Drawings

FIG. 1 is an example of an implementation of the present invention, given a text and corresponding question-answer pair;

FIG. 2 is a schematic diagram of the basic process of the present invention;

FIG. 3 is a model framework diagram of the present invention.

Detailed Description

The invention is further described with reference to the following examples and the accompanying drawings.

The invention discloses a similar case recommendation method based on network representation learning, which comprises the following steps:

1) the method comprises the following steps of mining characteristics of a question-answer data set, providing five complexity influence factors from three aspects of questions, texts and interaction between the questions and the texts, and designing a self-adaptive question complexity measuring method, wherein the five factors comprise:

a) clause in questionNumber of

Because one clause represents one event, when multiple clauses are included in a problem, the association of multiple events is involved, and thus the problem is difficult to understand. So the problem is more complicated if the number of clauses is larger. Problem Q, as shown in FIG. 1₁And problem Q₂No clause, question Q₃There is a clause, so

Problem Q₃Specific problem Q₂、Q₁And is more complex.

b) Number of modifiers in question

Each fixed language in the question is equivalent to one jump in the reasoning path of the answer, so the more the number of the fixed languages is, the more difficult the process of finding the answer is, and the more difficult the question is to answer. Problem Q, as shown in FIG. 1₁And problem Q₂No modifier, question Q₃There are four modifiers, respectively "American", "black", "comedy" and "thriller", so

Problem Q₃Specific problem Q₂、Q₁And is more complex.

c) Degree of association of sentences in text

If the sentence association degree in the text is higher, the more concentrated the subject of the text is, the more sufficient the description about the specific subject is, the easier the text is to understand, and the process of reasoning the answer is relatively easy becauseThis question is easy to answer. The invention uses the similarity of the subject distribution of sentences in the text to express the sentence association degree. The topic model is first trained, the topic distributions for each sentence are calculated, and then the similarity of these topic distributions is measured using the Kullback-Leibler divergence. The calculation method is as follows:

d) Frequency of occurrence of entities in question in text

Wherein the content of the first and second substances,

represents the set of entities in the question and,

which represents the length of the collection and,

represents a set of entities in the text that,

representing entity e_iThe number of times it appears in the text. As shown in FIG. 1, the text Passage₂There are six entities, respectively, "Irma Pamela Hall", "A Family Thing", "Soul Food", "The Ladykillers", "Joel" and "Ethan Coen". Problem Q₂There are 1 entity "The Ladykillers", appearing twice in The text. Problem Q₃There are three entities, the "Irma Pamela Hall", "Joel" and "Ethan Coen", respectively, that all appear once in the text. Thus, n_{Irma Pamela Hall}＝1，n_{Soul Food}＝1，n_{A Family Thing}＝1，n_{The Ladykillers}＝2，n_Joel＝1，n_{Ethan Coen}＝1，

Problem Q₃Specific problem Q₂And is more complex.

e) Average distance of entity in question and answer span in text

In finding an answer, if the entity in the question is closer to the answer span, there is more or less some connection between them, so it is easier to find the answer. Therefore, the average distance between the entity in the question and the answer span in the text is calculated as the complexity influence factor, and the larger the distance is, the harder the question is to answer, and the more complicated the question is. As shown in FIG. 1, in the text Passage₂Problem Q₂The distance between The entity "The Ladykillers" and The answer "Joel and Ethan Coen" is 10 words, question Q₃The entities "Joel", "Ethan Coen", "Irma Pamela Hall" and The answer "The Ladykillers" in (a) are 10, 12, 37 words apart. Therefore, the temperature of the molten metal is controlled,

problem Q₃Specific problem Q₂And is more complex.

according to the invention, HotpotQA and SQuAD are used as experimental data sets. Among them, HotpotQA has three levels of complexity problems, including simple, medium, and difficult. For ease of experimentation, the present invention repartitions the problem in HotpotQA into two classes, simple and difficult. Specifically, the complexity of each question in the dataset is calculated using the complexity measure mentioned in step 1), and then the maximum complexity value is selected as the threshold for distinguishing the complexity among the questions whose original label is "simple". If the complexity value of the other problem is greater than the threshold, the problem is considered complex, otherwise it is simple. The problems in HotpotQA and SQuAD are divided into two categories, simple and complex, according to the threshold. And dividing the training set, the verification set and the test set according to the ratio of 8:1:1, wherein the specific information is shown in the following table 1:

TABLE 1 data information in HotpotQA and SQuAD

3) Given text X ═ X (X) using a bi-directional LSTM network₁,x₂,…,x_n) And answer a. The formula is as follows:

obtaining a semantic representation e of the answer in the same way_a。

4) The encoded result is decoded using the LSTM network, creating a problem. A model diagram of the present invention is shown in FIG. 3. To generate a query word that matches the answer type, the present invention uses a semantic representation e of the answer_aAs the initial state of the decoder. At each time step of the decoding process, the word y generated using the previous time step_t-1Hidden state s_t-1Semantic vector of text at current time step

Selected problem template

And expert vector e_zTo update the hidden state s of the current time step_t. The calculation formula is as follows:

semantic vector of text at current time step

The calculation formula is obtained by the calculation of the attention mechanism and is as follows:

e_t，i＝W_etanh(W_ss_t-1+W_hh_i)

to address the out-of-vocabulary problem during generation, the present invention also uses a copy mechanism that allows the decoder to choose to generate new words from the vocabulary or to copy words from the input source text. The final generation probability is calculated as follows:

5) problem templates with different complexity are modeled by using hidden vectors in the decoding process to obtain template representation

Thereby guiding the generation of a problem that satisfies a given complexity. The calculation formula is as follows:

where d is a given complexity level. p (pi)_i| X, a, d, z) is a learned prior distribution that can be obtained by gated network training.

6) Different text contents are selected by using the hybrid expert model, so that different problems are generated, and the diversity of problem generation is improved. Embodied in step 4), using an expert vector e_zDirecting the generation of decoding states.

Compared with other problem generation methods, the method provided by the invention considers the influence of the text on the problem complexity, and designs a self-adaptive, reasonable and accurate complexity evaluation method. In addition, templates with different complexity problems are modeled through hidden vectors, and the problem that the complexity of the model generation is controllable can be effectively guided by means of the templates. And finally, different text contents are selected by adopting a hybrid expert model according to the complexity level, so that the diversity of generated problems is ensured.

The above examples are only preferred embodiments of the present invention, it should be noted that: it will be apparent to those skilled in the art that various modifications and equivalents can be made without departing from the spirit of the invention, and it is intended that all such modifications and equivalents fall within the scope of the invention as defined in the claims.

Claims

1. A complexity-controllable diversified problem generation method based on a hybrid expert model is characterized by comprising the following steps:

3) encoding the given text and answers using a bi-directional LSTM network;

4) decoding the coding result by using an LSTM network to generate a problem;

2. The hybrid expert model-based complexity-controlled diversity problem generation method according to claim 1, wherein in the step 1),

five complexity influence factors are provided from three aspects of problems, texts and interaction between the problems and the texts, and an adaptive problem complexity measuring method is designed, wherein the five factors comprise:

1) number of clauses in question

The more clauses, the more complex the problem;

2) number of modifiers in question

The more slogans, the more difficult the question is to answer;

3) degree of association of sentences in text

The higher the association, the more answers the question; the similarity of the topic distribution of sentences in the text is used for expressing the sentence association degree, a topic model is trained firstly, the topic distribution of each sentence is calculated, and then the similarity of the topic distributions is measured by using Kullback-Leibler divergence in the following calculation mode:

wherein, t_iAnd t_jRespectively representing the topic distribution of the ith sentence and the jth sentence in the text, wherein N is the number of the sentences in the text, and finally, the higher the similarity of the topic distribution of the sentences is, the higher the sentence relevance is, and the simpler the problem is;

4) frequency of occurrence of entities in question in text

The higher the occurrence, the simpler the problem; the spaCy tool was used to identify the entity in question and the following formula was used to calculate the frequency with which the entity in question appeared in the text, in order to ensure

The value of (d) is positively correlated with the complexity, using a reciprocal operation;

5) average distance of entity in question and answer span in text

The smaller the answer, the easier it is to find the answer; calculating the average distance between an entity in the question and the answer span in the text as a complexity influence factor, wherein the larger the distance is, the more complicated the question is;

the influence factor value is subjected to a normalized calculation method, the influence caused by an overlarge value is eliminated, and finally the calculation formula of the score cpx of each problem complexity is as follows:

wherein, ω is_iIs the weight of the ith influencing factor.

3. The hybrid expert model-based complexity-controlled diversity problem generation method according to claim 2, wherein in the step 5),

using implicit vector π ∈ {1, …, n_πThe template structure of the problem is modeled as a memory module, and when one pi is selected, the corresponding template can be used for guiding the generation of the problem; using two implicit vectors pi^simpleAnd pi^complexAnd selecting corresponding templates according to the problems with different complexity respectively.

4. The hybrid expert model-based complexity-controlled diversity problem generation method according to claim 2, wherein in the step 6),

different templates are selected by using a mixed expert model to model different text contents, different problems are generated finally, the diversity of problem generation is improved, and specifically, a hidden vector z is defined to be e {1, …, n_zRepresents a series of experts, each of which is concerned with a different question template.