CN109726274B

CN109726274B - Question generation method, device and storage medium

Info

Publication number: CN109726274B
Application number: CN201811641895.1A
Authority: CN
Inventors: 孙兴武; 刘璟
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2018-12-29
Filing date: 2018-12-29
Publication date: 2021-04-30
Anticipated expiration: 2038-12-29
Also published as: CN109726274A

Abstract

The embodiment of the invention provides a problem generation method, a problem generation device and a computer readable storage medium. The problem generation method comprises the following steps: identifying the text type of the document to be processed according to the text structure; selecting a generative model corresponding to the text type, the generative model comprising at least one of an explicit problem generative model, a structured and semi-structured problem generative model, and a natural language problem generative model; generating a question for the document to be processed using the selected generative model. According to the method and the device, aiming at the characteristics of different text types, for the whole document or all parts of texts of the whole document, the most applicable generation model is selected, and the accuracy rate of generation problems is improved.

Description

Question generation method, device and storage medium

Technical Field

The present invention relates to the field of information technology, and in particular, to a problem generation method and apparatus, and a computer-readable storage medium.

Background

FAQ (Frequently Asked Questions, question and answer system) is a main means for providing online help on the current network, and provides consulting services for users by organizing some possible Frequently Asked question and answer pairs in advance.

The FAQ implementation in the prior art mainly includes the following:

(1) general question-answering system, search-based or knowledge-based question-answering service.

(2) Customized retrieval, namely, segmenting and segmenting document contents to create indexes; or obtaining the question-answer pair through a document structuring or manual screening method.

(3) Problem retrieval based on word matching or synonym matching.

The defects of the prior art mainly comprise the following aspects:

(1) search-based or knowledge-based general question-answering systems do not address the need for customization.

(2) For the way of implementing question answering by creating indexes for document contents, firstly, not all contents are question answering contents, so that storage space is wasted throughout the storage; secondly, the accuracy of generating the question in this way is low, because the word hit does not mean that the current content is the answer; there is also no way to judge answer boundaries and to form a visual FAQ document. The visualization is to deeply read and understand the text content, so that a plurality of question and answer pairs are extracted, and a user can conveniently search question retrieval answers. The prior art cannot deeply understand chapters or generate texts well.

(3) Problem retrieval based on synonym or word matching has poor generalization capability and low recall.

Disclosure of Invention

Embodiments of the present invention provide a problem generation method, apparatus, and computer-readable storage medium, so as to at least solve one or more technical problems in the prior art.

In a first aspect, an embodiment of the present invention provides a problem generation method, including:

identifying the text type of the document to be processed according to the text structure;

selecting a generative model corresponding to the text type, the generative model comprising at least one of an explicit problem generative model, a structured and semi-structured problem generative model, and a natural language problem generative model;

generating a question for the document to be processed using the selected generative model.

In one embodiment, identifying the text type of the document to be processed according to the text structure includes: identifying whether a question and answer structure exists in the text structure of the document to be processed;

selecting a generative model corresponding to the text type, comprising: if the text structure of the document to be processed has a question-answer structure, taking the explicit question generation model as a generation model corresponding to the text type;

generating a question for the document to be processed using the selected generative model, comprising: and generating a question for the to-be-processed document by utilizing the explicit question generation model.

In one embodiment, generating questions for the document to be processed using the explicit question generation model includes:

judging whether the question part in the question-answer structure is matched with the corresponding answer part or not, and screening out a part of text corresponding to the question-answer structure which is successfully matched as a candidate text;

classifying the screened candidate texts by utilizing a first recurrent neural network model so as to identify an explicit problem from the candidate texts;

and taking the explicit question as a question generated for the document to be processed.

In one embodiment, identifying the text type of the document to be processed according to the text structure includes: identifying whether a title structure exists in a text structure of the document to be processed, wherein the title structure comprises a title or a table;

selecting a generative model corresponding to the text type, comprising: if the text structure of the document to be processed has a title structure, taking the structured and semi-structured problem generation model as a generation model corresponding to the text type;

generating a question for the document to be processed using the selected generative model, comprising: and generating a problem aiming at the document to be processed by utilizing the structured and semi-structured problem generation model.

In one embodiment, generating questions for the document to be processed using the structured and semi-structured question generation model includes:

under the condition that a title exists in a text structure of the document to be processed, acquiring an attribute statement related to the title;

and repeating the generated problem according to the attribute.

In one embodiment, obtaining a review of attributes associated with the title includes:

acquiring a search click display log related to the title;

performing data mining on the search click display log to obtain attribute repeat related to the title;

and storing the attribute statement into an attribute statement table.

In one embodiment, repeating the generated problem according to the attributes includes:

generating a problem using a first coder-decoder model based on said property profile; or,

and inquiring the attribute repeat related to the title from the attribute repeat table, and generating a question according to the inquired attribute repeat.

In one embodiment, identifying the text type of the document to be processed according to the text structure includes: identifying whether a question-answer structure and a title structure exist in a text structure of the document to be processed, wherein the title structure comprises a title or a table;

selecting a generative model corresponding to the text type, comprising: if the text structure of the document to be processed does not have a question-answer structure and a title structure, taking the natural language question generation model as a generation model corresponding to the text type;

generating a question for the document to be processed using the selected generative model, comprising: and generating a question aiming at the document to be processed by utilizing the natural language question generation model.

In one embodiment, generating a question for the document to be processed using the natural language question generation model includes:

screening a target sentence from the document to be processed by utilizing a second recurrent neural network model, wherein the target sentence comprises a sentence with complete semantics;

selecting candidate answer segments from the target sentence by using a third recurrent neural network model;

generating a question using a second coder-decoder model based on the candidate answer segment.

In one embodiment, the method further comprises: and carrying out answer boundary positioning on the generated question.

In one embodiment, the answer boundary locating is performed for the generated question, and comprises the following steps:

predicting the starting and ending positions of answer segments corresponding to the questions by utilizing a bidirectional attention flow network;

and sequencing the answer segments by utilizing a learning sequencing model, and carrying out answer boundary positioning on the questions according to a sequencing result, wherein the characteristics of the learning sequencing model comprise the starting and ending positions of the answer segments.

In a second aspect, an embodiment of the present invention provides a problem generation apparatus, including:

the text type identification unit is used for identifying the text type of the document to be processed according to the text structure;

a generative model selection unit for selecting a generative model corresponding to the text type, the generative model including at least one of an explicit question generation model, a structured and semi-structured question generation model, and a natural language question generation model;

and the question generating unit is used for generating a question for the to-be-processed document by using the selected generative model.

In one embodiment, the text type recognition unit includes a first recognition subunit configured to: identifying whether a question and answer structure exists in the text structure of the document to be processed;

the generative model selection unit comprises a first selection subunit configured to: if the text structure of the document to be processed has a question-answer structure, taking the explicit question generation model as a generation model corresponding to the text type;

the question generation unit comprises a first generation subunit configured to: and generating a question for the to-be-processed document by utilizing the explicit question generation model.

In one embodiment, the first generating subunit is further configured to:

In one embodiment, the text type recognition unit includes a second recognition subunit configured to: identifying whether a title structure exists in a text structure of the document to be processed, wherein the title structure comprises a title or a table;

the generative model selection unit comprises a second selection subunit configured to: if the text structure of the document to be processed has a title structure, taking the structured and semi-structured problem generation model as a generation model corresponding to the text type;

the question generation unit comprises a second generation subunit configured to: and generating a problem aiming at the document to be processed by utilizing the structured and semi-structured problem generation model.

In one embodiment, the second generating subunit comprises:

the repeated statement acquiring subunit is used for acquiring the attribute repeated statement related to the title under the condition that the text structure of the document to be processed has the title;

and the question repeating and generating subunit is used for repeating and generating the question according to the attribute.

In one embodiment, the rephrasing obtaining subunit is further configured to:

acquiring a search click display log related to the title;

and storing the attribute statement into an attribute statement table.

In one embodiment, the rephrase question generation subunit is further configured to:

In one embodiment, the text type recognition unit includes a third recognition subunit configured to: identifying whether a question-answer structure and a title structure exist in a text structure of the document to be processed, wherein the title structure comprises a title or a table;

the generative model selection unit comprises a third selection subunit configured to: if the text structure of the document to be processed does not have a question-answer structure and a title structure, taking the natural language question generation model as a generation model corresponding to the text type;

the question generation unit comprises a third generation subunit configured to: and generating a question aiming at the document to be processed by utilizing the natural language question generation model.

In one embodiment, the third generating subunit is further configured to:

In one embodiment, the apparatus further includes an answer boundary positioning unit, configured to perform answer boundary positioning on the generated question.

In one embodiment, the answer boundary locating unit is further configured to:

In a third aspect, an embodiment of the present invention provides a problem generation apparatus, where functions of the apparatus may be implemented by hardware, or may be implemented by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the above-described functions.

In one possible design, the apparatus includes a processor and a memory, the memory is used for storing a program supporting the apparatus to execute the method, and the processor is configured to execute the program stored in the memory. The apparatus may also include a communication interface for communicating with other devices or a communication network.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the method according to any one of the first aspect.

One of the above technical solutions has the following advantages or beneficial effects: aiming at the characteristics of different text types, for the whole document or all parts of texts of the whole document, the most applicable generation model is selected, and the accuracy of the generation problem is improved.

Another technical scheme in the above technical scheme has the following advantages or beneficial effects: the accurate boundary of the answer corresponding to the question can be obtained through the question-answering technology, and the accuracy of the generated FAQ document is further improved.

The foregoing summary is provided for the purpose of description only and is not intended to be limiting in any way. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features of the present invention will be readily apparent by reference to the drawings and following detailed description.

Drawings

In the drawings, like reference numerals refer to the same or similar parts or elements throughout the several views unless otherwise specified. The figures are not necessarily to scale. It is appreciated that these drawings depict only some embodiments in accordance with the disclosure and are therefore not to be considered limiting of its scope.

Fig. 1 is a flowchart of a problem generation method according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of an FAQ mining target document sample of the problem generation method according to the embodiment of the present invention.

Fig. 3 is a schematic diagram of text types of a question generation method according to an embodiment of the present invention.

Fig. 4 is a flowchart of generating a question using an explicit question generation model according to the question generation method provided in the embodiment of the present invention.

Fig. 5 is a flowchart of a problem generation method according to an embodiment of the present invention, where the problem is generated by using an explicit problem generation model.

Fig. 6 is a flowchart of a problem generation method using a structured and semi-structured problem generation model to generate a problem according to an embodiment of the present invention.

Fig. 7 is a flowchart of a problem generation method using a structured and semi-structured problem generation model to generate a problem according to an embodiment of the present invention.

Fig. 8 is a flowchart of a problem generation method according to an embodiment of the present invention, in which a problem is generated using a structured and semi-structured problem generation model.

Fig. 9 is a flowchart of a problem generation method according to an embodiment of the present invention, in which a problem is generated using a natural language problem generation model.

Fig. 10 is a flowchart of a problem generation method according to an embodiment of the present invention, in which a problem is generated using a natural language problem generation model.

Fig. 11 is a flowchart of a problem generation method according to an embodiment of the present invention.

Fig. 12 is a schematic diagram of an online portion of a problem generation method according to an embodiment of the present invention.

Fig. 13 is a flowchart illustrating answer boundary positioning of a question generation method according to an embodiment of the present invention.

Fig. 14 is a schematic diagram of an offline part of a problem generation method according to an embodiment of the present invention.

Fig. 15 is a block diagram of a problem generation apparatus according to an embodiment of the present invention.

Fig. 16 is a block diagram of a problem generation apparatus according to an embodiment of the present invention.

Fig. 17 is a block diagram of a second generation subunit of the problem generation apparatus according to the embodiment of the present invention.

Fig. 18 is a block diagram of a problem generation apparatus according to an embodiment of the present invention.

Fig. 19 is a block diagram of a problem generation apparatus according to an embodiment of the present invention.

Detailed Description

In the following, only certain exemplary embodiments are briefly described. As those skilled in the art will recognize, the described embodiments may be modified in various different ways, all without departing from the spirit or scope of the present invention. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive.

Fig. 1 is a flowchart of a problem generation method according to an embodiment of the present invention. As shown in fig. 1, the problem generation method according to the embodiment of the present invention includes:

step S110, identifying the text type of the document to be processed according to the text structure;

step S120, selecting a generation model corresponding to the text type, wherein the generation model comprises at least one of an explicit question generation model, a structured and semi-structured question generation model and a natural language question generation model;

step S130, generating a question for the document to be processed by using the selected generation model.

The FAQ is used on many websites to provide counseling services to users. Some of the problems common to users are listed in the FAQ, which is a form of online help. Users often encounter problems when utilizing the functionality or services of some websites that appear to be simple but may be difficult to discern without explanation. Sometimes even the user is lost due to the effects of these detail problems. In many cases, these problems can be solved by simple interpretation, which is the value of FAQ.

Both the questions and solutions of the FAQ design must be frequently asked and encountered by the user. For example, in network marketing, FAQ is considered as a common online customer service means, and a good FAQ system should answer at least 80% of the users' general and common questions. Through the use of the FAQ, the method is convenient for users, greatly reduces the pressure of website workers, saves a large amount of customer service cost and increases the satisfaction degree of the users. Therefore, in the FAQ design, it is crucial to generate a high accuracy problem using a scientific and reasonable method.

Fig. 2 is a schematic diagram of an FAQ mining target document sample of the problem generation method according to the embodiment of the present invention. Question-answer pairs may be automatically generated from the specified document shown in fig. 2, and the question-answer pairs desired to be generated are exemplified as follows:

q: what are the harm of smoking a second hand cigarette to the body?

A: 1. increasing the chance of lung cancer …; 2. harm to memory, nicotine … 3 in smoke can cause asthma, pneumonia and ear inflammation in children ….

Q: how many times more likely a person smoking a second hand will have lung cancer than a normal person?

A: 2.6 times to 6 times.

In the above, q (query) indicates a question, and a (answer) indicates an answer to the question.

The embodiment of the invention realizes intelligent question answering through customized retrieval, can process specific documents to realize the purpose of customized retrieval, and enables the generated intelligent question answering to be more targeted. The document to be processed is analyzed through reading comprehension, a question generation technology and a question-answering technology, questions which are possibly asked by a user are extracted, and customized question-answering service is achieved.

Specifically, the text of a given document to be processed can be divided into three text types: explicit FAQ text types, structured and semi-structured text types, natural language text types. Fig. 3 is a table diagram illustrating text types of a question generation method according to an embodiment of the present invention. "Q" in fig. 3 represents a problem of generation corresponding to each text sample.

The first row in the table of fig. 3 shows an explicit FAQ text type, which corresponds to the problem of generation: "is a 98-tuple unlimited number available to order local, domestic traffic packages? "

Shown in the second row of the table of FIG. 3 are structured and semi-structured text types, which correspond to the generation of problems: what are ways of doing business with Baidu goddess card? "where" structured "may include the case of a table structure in a document," semi-structured "may include the case of a title in a document. The generated question "Baidu goddess card" can be obtained according to the total title of the document or the related vocabulary entry and key words of the document.

The third row in the table of fig. 3 shows natural language text types, which correspond to the problem of generation: how to charge for new 4G card users to enter the network? "among them," 4G foka "in the generated question can be obtained according to the general title of the document or the related vocabulary entry and keyword of the document.

According to the embodiment of the invention, different generation models corresponding to different text types are respectively selected for the documents of different text types, so that the generated problems are more targeted and have higher accuracy, and the requirements of customized question-answering service are better met. The generated questions and answers thereof can support the customer service robot to realize human-computer interaction.

In the above technical solution, an index may also be created for a problem to save storage space. For example, the manner in which the index is created may include an inverted index or a key value index. The inverted index is derived from the situation that records need to be searched according to the values of the attributes in practical application. Each entry in such an index table includes an attribute value and the address of the record having the attribute value. Since the attribute value is not determined by the record but the position of the record is determined by the attribute value, it is called an inverted index. The recall accuracy of query answers based on the question index is higher. Furthermore, semantic indexes can be established for the generated question and answer pairs to support the customer service robot to realize human-computer interaction.

In one example, the document to be processed may be a document that belongs to one of the three text types, and the generative model corresponding to the text type may be selected to generate a question for the entire document. In another example, the document to be processed may be divided into several parts, and the text types of the respective parts of text respectively belong to one of the three text types. In this case, the generation models corresponding to the text types of the respective parts of the text can be respectively selected, and the question is respectively generated for the respective parts of the text.

In one embodiment, identifying the text type of the document to be processed according to the text structure includes: identifying the text type of each part of text in the document to be processed according to the text structure;

selecting a generative model corresponding to the text type, comprising: selecting a generation model corresponding to the text type of each part of text;

generating a question for the document to be processed using the selected generative model, comprising: generating a question for the respective portion of text using the selected generative model.

Firstly, identifying a document to be processed or a text type to which each part in the document to be processed belongs according to a text structure:

(1) and identifying whether a question-answer structure, namely the question and the answer of the question, exists in the text structure of the document to be processed. And if so, generating a question aiming at the part of the text with the question-answer structure in the document to be processed by using an explicit question generation model.

(2) And identifying whether the text structure of the document to be processed has a title or a table, if so, generating a problem by using a structured and semi-structured problem generation model aiming at the part of the text with the title or the table in the document to be processed.

(3) If there is no question-answer structure or no title or table in a part of text in the document to be processed, for example, the part of text in the document to be processed is in a format of a plain text document, and the part of text may be referred to as a plain text part. In this case, a natural language question generation model is used to generate a question for the portion of text.

The technical scheme has the following advantages or beneficial effects: aiming at the characteristics of different text types, for the whole document or all parts of texts of the whole document, the most applicable generation model is selected, and the accuracy of the generation problem is improved.

Fig. 4 is a flowchart of generating a question using an explicit question generation model according to the question generation method provided in the embodiment of the present invention. As shown in fig. 4, in an embodiment, the step S110 in fig. 1, identifying the text type of the document to be processed according to the text structure, may specifically include the step S210: and identifying whether a question-answer structure exists in the text structure of the document to be processed.

Step S120 in fig. 1 selects a generative model corresponding to the text type, which may specifically include step S220: and if the text structure of the document to be processed has a question-answer structure, taking the explicit question generation model as a generation model corresponding to the text type.

In step S130 in fig. 1, generating a question for the to-be-processed document by using the selected generative model specifically includes step S230: and generating a question for the to-be-processed document by utilizing the explicit question generation model.

Referring to the example shown in the first row of the table of fig. 3, if there is a question and answer structure in the text structure of the document to be processed, the text type of the part of text belongs to the explicit FAQ text type. For an explicit FAQ text type, selecting an explicit question generation model corresponding thereto generates a question for the portion of text.

Fig. 5 is a flowchart of a problem generation method according to an embodiment of the present invention, where the problem is generated by using an explicit problem generation model. As shown in fig. 5, in an embodiment, in step S230 in fig. 4, generating a question for the to-be-processed document by using the explicit question generation model specifically includes:

step S310, judging whether the question part in the question-answering structure is matched with the corresponding answer part, and screening out a part of text corresponding to the question-answering structure which is successfully matched as a candidate text;

step S320, classifying the screened candidate texts by utilizing a first recurrent neural network model so as to identify an explicit problem from the candidate texts;

and step S330, taking the explicit question as a question generated for the document to be processed.

In such an embodiment, the question may be generated by document structure parsing and question identification. Wherein the process of document structure parsing may include initially screening out possible explicit problems with the text structure. The process of problem identification may include: the RNN (Recurrent Neural Network) is used as a problem classification model, words are used as characteristics, training data are manually marked, and the RNN model is trained to be used as an explicit problem generation model.

Specifically, the processing procedure of the explicit problem generation model can be divided into the following steps:

(1) and (3) document structure analysis: and finding out a question part and a corresponding answer part in the text, and judging whether the question part and the answer part are matched. If the question portion and the answer portion match, the question portion and the answer portion are screened as "possible explicit questions".

For example: the method of finding a question portion and a corresponding answer portion in a text may include: a question mark is found, a sentence before the question mark is determined as a question part, and a sentence after the question mark is determined as a answer part. Whether the questions and the answers are matched can be judged through semantic understanding.

(2) Problem identification: the screened "possible explicit problems" are classified using the RNN model, i.e., the first recurrent neural network model described above. The results output by the first recurrent neural network model can be divided into two categories, one for conclusively determining that it is an explicit problem and the other for conclusively determining that it is not an explicit problem.

The features used by the RNN model described above may include: part of speech and word dependency. The text can be divided into sentences and words, and different features can be extracted according to requirements to serve as features of the training model. For example, punctuation is used to perform clauses; word segmentation is performed using NLP (Natural Language Processing). It is possible to try to divide a sentence into patterns of, for example, 5 words, 3 words, or 1 word, and select the pattern with the best effect for word division.

Fig. 6 is a flowchart of a problem generation method using a structured and semi-structured problem generation model to generate a problem according to an embodiment of the present invention. As shown in fig. 6, in an embodiment, the step S110 in fig. 1, identifying the text type of the document to be processed according to the text structure, may specifically include the step S410: and identifying whether a title structure exists in the text structure of the document to be processed, wherein the title structure comprises a title or a table.

Step S120 in fig. 1 selects a generative model corresponding to the text type, which may specifically include step S420: and if the text structure of the document to be processed has a title structure, taking the structured and semi-structured problem generating model as a generating model corresponding to the text type.

In step S130 in fig. 1, generating a question for the to-be-processed document by using the selected generative model specifically includes step S430: and generating a problem aiming at the document to be processed by utilizing the structured and semi-structured problem generation model.

Referring to the example shown in the second row of the table of fig. 3, if there is a title structure in the text structure of the document to be processed, where the title structure includes a title or a table, the text type of the portion of text belongs to structured and semi-structured text types. For structured and semi-structured text types, selecting a structured and semi-structured question generation model corresponding to the structured and semi-structured text types to generate questions aiming at the part of the text.

Fig. 7 is a flowchart of a problem generation method using a structured and semi-structured problem generation model to generate a problem according to an embodiment of the present invention. As shown in fig. 7, in an embodiment, in step S430 in fig. 6, generating a question for the to-be-processed document by using the structured and semi-structured question generation model specifically may include:

step S510, under the condition that a title exists in a text structure of the document to be processed, acquiring an attribute statement related to the title;

and step S520, repeating the generated problem according to the attributes.

In such an embodiment, a search click exposure log in the FAQ's retrieval system may be obtained, and the profile of the attribute may be mined in dependence on the search click exposure log. For example: charging mode- > how to charge, these two attributes can be repeated for each other. The model is generated as a semi-structured and structured problem based on the generation of the attribute repetition.

Fig. 8 is a flowchart of a problem generation method according to an embodiment of the present invention, in which a problem is generated using a structured and semi-structured problem generation model. As shown in fig. 8, in an embodiment, in step S510 in fig. 7, acquiring the attribute statement related to the title may specifically include:

step S610, obtaining a search click display log related to the title;

step S620, performing data mining on the search click display log to obtain an attribute statement related to the title;

step S630, store the attribute statement into an attribute statement table.

In an embodiment, step S520 in fig. 7, repeating the generated question according to the attribute may specifically include:

In particular, the structural and semi-structured generative models may comprise the following processing steps:

(1) and (4) displaying the log by depending on the click of the search, and mining the repeated description of the attribute by using a data mining method. And storing the duplicate of the mined attributes into an attribute duplicate table.

For example, two users a and B click on the same URL (Uniform Resource Locator) after using different keyword searches. The meaning expressed by these two different keywords may be the same and may be repeated for each other.

In one example, the user searches for "efficacy of bitter gourd". Wherein "balsam pear" is an entity and "efficacy" is an attribute of an entity. The repetition of the attribute of "efficacy" also includes "action", "efficacy", and the like. That is to say, the expression of "efficacy of balsam pear", "action of balsam pear" and "efficacy of balsam pear" means the same.

(2) The model is generated as a semi-structured and structured problem based on the generation of the attribute repetition. Two embodiments may be included:

the first method is as follows: the problem is generated using a Sequence to Sequence (Seq 2 Seq) model, the first coder-decoder model described above. The features used by the above models include lexical and syntactic features, the location of the start and end of the answer predicted by the sequence tagging model, and word features. The input information of the model is a paragraph of a document to be processed, and the output information is a problem generated aiming at the input information. In one example, the text content in the document to be processed is: "Beijing is the first capital of China. "then the sequence marking model can mark" Beijing ". Then the 'Beijing' and 'Beijing is the capital of China' are used as input information and input into the seq2seq model to generate the problem 'where the capital of China is'.

The Seq2Seq model, which may also be referred to as an Encoder-Decoder model, is an important variant of the RNN model. The Encoder-Decoder structure does not limit the sequence length of input and output, and thus the range of applications is very wide, such as: machine translation, text summarization, reading comprehension, speech recognition, etc.

Since the seq2seq model can be such that the input and output sequences are not of equal length, Sequence to Sequence, it enables the transition from one Sequence to another, for example it can implement a chat robot conversation model. The classical RNN model fixes the size of the input and output sequences, while the seq2seq model breaks this limitation.

An encoder (encoder) and a decoder (decoder) correspond to the two RNNs of the input sequence and the output sequence, respectively. The basic idea of the common encoder-decoder structure is to use two RNNs, one RNN as an encoder and the other RNN as a decoder. The encoder is responsible for compressing an input sequence into a vector of a specified length, which can be regarded as the semantics of the sequence, and this process is called encoding. The decoder is responsible for generating the specified sequence from the semantic vector, a process also referred to as decoding.

The second method comprises the following steps: the attribute statement table is queried and a relevant statement is selected for generating a question.

For example, for a repeat of a query: the efficacy of bitter gourd, the action of bitter gourd and the efficacy of bitter gourd can be selected from a most suitable repeat to generate problems. The queried repeat statements can be compared and analyzed with expressions in the document to be processed by utilizing methods of semantic understanding, keyword matching and the like to determine the most suitable repeat statement for generating the problem.

In another case, for the case that a table exists in the text structure of the document to be processed, the contents of the table header, each row record and each column field in the table can be identified by using methods such as semantic understanding, keyword matching and the like, so as to determine the generated question and the corresponding answer. For example, the content of one column of data in the table may be the generated question, while the content of the other column of data may be the answer to it.

Fig. 9 is a flowchart of a problem generation method according to an embodiment of the present invention, in which a problem is generated using a natural language problem generation model. As shown in fig. 9, in an embodiment, the step S110 in fig. 1, identifying the text type of the document to be processed according to the text structure, may specifically include the step S710: identifying whether a question-answer structure and a title structure exist in a text structure of the document to be processed, wherein the title structure comprises a title or a table;

step S120 in fig. 1 selects a generative model corresponding to the text type, which may specifically include step S720: if the text structure of the document to be processed does not have a question-answer structure and a title structure, taking the natural language question generation model as a generation model corresponding to the text type;

in step S130 in fig. 1, generating a question for the to-be-processed document by using the selected generative model specifically includes step S730: and generating a question aiming at the document to be processed by utilizing the natural language question generation model.

Referring to the example shown in the third row of the table of fig. 3, if there is no question-answer structure and no title structure in the text structure of the document to be processed, the text type of the text of the portion belongs to the natural language text type. For a natural language text type, selecting a natural language question generation model corresponding to the natural language text type generates a question for the portion of text.

Fig. 10 is a flowchart of a problem generation method according to an embodiment of the present invention, in which a problem is generated using a natural language problem generation model. As shown in fig. 10, in an embodiment, in step S730 in fig. 9, generating a question for the to-be-processed document by using the natural language question generation model specifically includes:

step S810, a second recurrent neural network model is utilized to screen out a target sentence from the document to be processed, wherein the target sentence comprises a sentence with complete semantics;

step S820, selecting candidate answer segments from the target sentence by using a third recurrent neural network model;

step S830, generating a question using a second coder-decoder model according to the candidate answer segment.

In such an embodiment, the target sentence may be first classified and screened using the RNN model, candidate answer fragments may be selected using RNN sequence labeling for the screened target sentence, and then the question may be generated using the seq2seq model.

Specifically, the natural language generation model may include the following processing steps:

(1) firstly, a second recurrent neural network model is used for classifying and screening target sentences, and sentences with complete semantics are screened out.

(2) For the screened target sentences, a third recurrent neural network model is used for selecting candidate answer segments, namely, selecting questions and corresponding segments which can be answers. Wherein the third recurrent neural network model can be trained using sequence labeling. Sequence annotation may include sentence annotation, i.e., annotating a question that can be asked.

(3) The problem generation model generates a problem using the seq2seq model, i.e., the second encoder-decoder model described above.

In one example, the text content in the document to be processed is: "Beijing is the first capital of China. The sequence label can be marked as Beijing. Then the 'Beijing' and 'Beijing is the capital of China' are used as input information and input into the seq2seq model to generate the problem 'where the capital of China is'.

Fig. 11 is a flowchart of a problem generation method according to an embodiment of the present invention. As shown in fig. 11, in one embodiment, the method further includes step S140: and carrying out answer boundary positioning on the generated question.

The technical scheme has the following advantages or beneficial effects: by the aid of the question-answering technology, for example, answer boundary positioning, accurate boundaries of answers corresponding to questions can be obtained, and accuracy of the generated FAQ documents is further improved.

In one example, the problem generation method of the embodiment of the present invention includes two parts: an on-line part problem generation method and an off-line part problem generation method. Fig. 12 is a schematic diagram of an online portion of a problem generation method according to an embodiment of the present invention. The "target document" in fig. 12 is also a document to be processed. The "general document parsing" in fig. 12 includes identifying a document structure. In particular, "general document parsing" may include identifying the subject (title), sub-title, paragraphs under the sub-title, and text of a document, and the identified document structure may be described using a tree structure. "general document parsing" also includes identifying which of the three text types described above (explicit FAQ text type, structured and semi-structured text type, natural language text type) each portion in the document to be processed belongs to. After the text types of all parts in the document to be processed are identified, corresponding question generation models are selected to generate questions in the next step, and answer boundary positioning is carried out.

Fig. 13 is a flowchart illustrating answer boundary positioning of a question generation method according to an embodiment of the present invention. As shown in fig. 13, in an embodiment, in step S140 in fig. 11, the performing answer boundary positioning on the generated question may specifically include:

step S910, predicting the start-stop position of the answer segment corresponding to the question by using a bidirectional attention network;

step S920, using a learning ranking model to rank the answer segments, and performing answer boundary positioning on the questions according to the ranking result, wherein the features of the learning ranking model include start and stop positions of the answer segments.

Specifically, the answer boundary positioning may include the following processing steps:

(1) Bi-DAF (Bi-Directional Attention Flow network) is adopted as a reading understanding model, and the starting and stopping positions of answers can be accurately predicted.

A bi-directional attention flow network is a hierarchical, multi-stage structure that models context at different levels of granularity. The bidirectional attention flow network mainly includes a method of modeling a context on a Character granularity Level (Character Level) and a Word granularity Level (Word Level), and obtaining a problem-aware context representation using bidirectional attention flow.

An exemplary bidirectional attention flow network may include the following layers:

1. character embedding layer (Character embedding layer)

The main role of this layer is to map words to a fixed size vector, and this layer can use a convolutional neural network (Character level CNN) at the Character level to perform its function.

2. Word embedding layer (Word embedding layer)

Each word may be mapped to a fixed size vector using a pre-trained word embedding model.

3. Context embedding layer (context embedding layer)

The layer mainly functions to add a clue (cue) of a context to each word, and the first three layers are applied to the problem and the context.

4. Attention flow layer (Attention flow layer)

The problem and context vectors are combined to generate a set of problem-aware feature vectors.

5. Model layer (Modeling layer)

The context may be scanned using a recurrent neural network.

6. Output layer (Output layer)

This layer provides answers to the questions.

(2) The answer fragment ranking is performed using LTR (Learning to rank).

Wherein, the start-stop position predicted in the step (1) is used as a characteristic of LTR in the step (2). And (2) searching answers corresponding to the questions from a large text behind the questions by adopting LTR. Wherein, the LTR model sorts the answer fragments according to the question-answer characteristics.

The learning ranking is a ranking method for supervised learning. The LTR can be utilized to carry out sorting according to the relevance by constructing a relevance function. For the traditional sorting method, various information is difficult to fuse, and an overfitting phenomenon is likely to occur. The learning sequencing is easy to fuse various characteristics, mature and deep theoretical basis exists, parameters are optimized through iteration, and a set of mature theories are provided for solving the problems of sparseness, overfitting and the like.

In this embodiment, the target document is first segmented, either by identifying natural paragraphs or by using a list extraction method. The paragraphs are then characterized and ranked, such as by extracting features using relevant tools such as "domain features" and/or "matching features". The characteristics can include the following:

question answer matching characteristics: alignment matching technology, DNN (Deep Neural Networks, QP matching technology) and QP matching technology combined with a knowledge graph, wherein Q represents a question (Query), P represents a segmented paragraph (Paragrap), and the alignment matching comprises Q and P alignment;

the method is characterized by comprising the following steps: the entity question-answer characteristics, the how whhy question-answer characteristics, the non-question-answer characteristics and the description class question-answer characteristics;

the structure is characterized in that: list structure features;

text characteristics: a content quality characteristic;

cross-check characteristics: a text aggregation feature.

Fig. 14 is a schematic diagram of an offline part of a problem generation method according to an embodiment of the present invention. As shown in FIG. 14, the main generative models and data for the offline portion may include two portions and five models. Wherein the two parts comprise document title data and question and answer annotation data. The five models include an explicit question generation model, a structured and semi-structured question generation model, a natural language question generation model, a Bi-DAF model (reading understanding model), and an LTR model (answer fragment ranking model).

Fig. 15 is a block diagram of a problem generation apparatus according to an embodiment of the present invention. As shown in fig. 15, the problem generation apparatus according to the embodiment of the present invention includes:

a text type identification unit 100 for identifying the text type of the document to be processed according to the text structure;

a generative model selection unit 200 for selecting a generative model corresponding to the text type, the generative model including at least one of an explicit question generation model, a structured and semi-structured question generation model, and a natural language question generation model;

a question generating unit 300, configured to generate a question for the to-be-processed document using the selected generative model.

Fig. 16 is a block diagram of a problem generation apparatus according to an embodiment of the present invention. As shown in fig. 16, in an embodiment, the text type recognition unit 100 includes a first recognition subunit 110, where the first recognition subunit 110 is configured to: identifying whether a question and answer structure exists in the text structure of the document to be processed;

the generative model selection unit 200 comprises a first selection subunit 210, the first selection subunit 210 being configured to: if the text structure of the document to be processed has a question-answer structure, taking the explicit question generation model as a generation model corresponding to the text type;

the question generating unit 300 comprises a first generating sub-unit 310, the first generating sub-unit 310 being configured to: and generating a question for the to-be-processed document by utilizing the explicit question generation model.

In one embodiment, the first generating subunit 310 is further configured to:

In one embodiment, the text type recognition unit 100 includes a second recognition subunit 120, and the second recognition subunit 120 is configured to: identifying whether a title structure exists in a text structure of the document to be processed, wherein the title structure comprises a title or a table;

the generative model selection unit 200 comprises a second selection subunit 220, wherein the second selection subunit 220 is configured to: if the text structure of the document to be processed has a title structure, taking the structured and semi-structured problem generation model as a generation model corresponding to the text type;

the question generating unit 300 comprises a second generating sub-unit 320, the second generating sub-unit 320 being configured to: and generating a problem aiming at the document to be processed by utilizing the structured and semi-structured problem generation model.

Fig. 17 is a block diagram of a second generation subunit of the problem generation apparatus according to the embodiment of the present invention. As shown in fig. 17, in one embodiment, the second generating subunit 320 includes:

a statement obtaining subunit 321, configured to, in a case where a title exists in the text structure of the document to be processed, obtain an attribute statement related to the title;

and a question-repeating generating subunit 322, configured to repeat the generated question according to the attribute.

In one embodiment, the rephrasing obtaining subunit 321 is further configured to:

acquiring a search click display log related to the title;

and storing the attribute statement into an attribute statement table.

In one embodiment, the rephrase question generation subunit 322 is further configured to:

Referring to fig. 16, in an embodiment, the text type recognition unit 100 includes a third recognition subunit 130, and the third recognition subunit 130 is configured to: identifying whether a question-answer structure and a title structure exist in a text structure of the document to be processed, wherein the title structure comprises a title or a table;

the generative model selection unit 200 comprises a third selection subunit 230, the third selection subunit 230 being configured to: if the text structure of the document to be processed does not have a question-answer structure and a title structure, taking the natural language question generation model as a generation model corresponding to the text type;

the question generating unit 300 comprises a third generating sub-unit 330, the third generating sub-unit 330 being configured to: and generating a question aiming at the document to be processed by utilizing the natural language question generation model.

In one embodiment, the third generating subunit 330 is further configured to:

Fig. 18 is a block diagram of a problem generation apparatus according to an embodiment of the present invention. As shown in fig. 18, in one embodiment, the apparatus further includes an answer boundary positioning unit 400, configured to perform answer boundary positioning on the generated question.

In one embodiment, the answer boundary positioning unit 400 is further configured to:

The functions of each unit in the problem generation apparatus according to the embodiment of the present invention may refer to the related description of the above method, and are not described herein again.

In one possible design, the problem generation apparatus includes a processor and a memory, the memory is used for storing a program supporting the problem generation apparatus to execute the problem generation method, and the processor is configured to execute the program stored in the memory. The question generating means may also comprise a communication interface, the question generating means communicating with other devices or a communication network.

Fig. 19 is a block diagram of a problem generation apparatus according to an embodiment of the present invention. As shown in fig. 19, the apparatus includes: a memory 101 and a processor 102, the memory 101 having stored therein a computer program operable on the processor 102. The processor 102, when executing the computer program, implements the problem generation method in the above-described embodiments. The number of the memory 101 and the processor 102 may be one or more.

The device also includes:

and the communication interface 103 is used for communicating with external equipment and performing data interactive transmission.

Memory 101 may comprise high-speed RAM memory and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

If the memory 101, the processor 102 and the communication interface 103 are implemented independently, the memory 101, the processor 102 and the communication interface 103 may be connected to each other through a bus and perform communication with each other. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 19, but it is not intended that there be only one bus or one type of bus.

Optionally, in a specific implementation, if the memory 101, the processor 102, and the communication interface 103 are integrated on a chip, the memory 101, the processor 102, and the communication interface 103 may complete communication with each other through an internal interface.

In still another aspect, an embodiment of the present invention provides a computer-readable storage medium storing a computer program, which when executed by a processor, implements any one of the above-described problem generation methods.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may also be stored in a computer readable storage medium. The storage medium may be a read-only memory, a magnetic or optical disk, or the like.

The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive various changes or substitutions within the technical scope of the present invention, and these should be covered by the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A question generation method, comprising:

identifying a text type of a document to be processed according to a text structure, wherein the text type comprises at least one of an explicit FAQ text type, a structured and semi-structured text type and a natural language text type;

selecting a generative model corresponding to the text type, the generative model comprising at least one of an explicit problem generative model, a structured and semi-structured problem generative model, and a natural language problem generative model; taking the explicit question generation model as a generation model corresponding to the text type under the condition that the text type is an explicit FAQ text type; taking the structured and semi-structured problem generation model as a generation model corresponding to the text type under the condition that the text type is a structured and semi-structured text type; taking the natural language question generation model as a generation model corresponding to the text type under the condition that the text type is a natural language text type;

2. The method of claim 1,

identifying the text type of the document to be processed according to the text structure, comprising: identifying whether a question and answer structure exists in the text structure of the document to be processed;

3. The method of claim 2, wherein generating questions for the document to be processed using the explicit question generation model comprises:

4. The method of claim 1,

identifying the text type of the document to be processed according to the text structure, comprising: identifying whether a title structure exists in a text structure of the document to be processed, wherein the title structure comprises a title or a table;

5. The method of claim 4, wherein generating questions for the document to be processed using the structured and semi-structured question generation model comprises:

and repeating the generated problem according to the attribute.

6. The method of claim 5, wherein obtaining the duplicate of the attribute associated with the title comprises:

acquiring a search click display log related to the title;

and storing the attribute statement into an attribute statement table.

7. The method of claim 6, wherein repeating the generated problem according to the attributes comprises:

8. The method of claim 1,

identifying the text type of the document to be processed according to the text structure, comprising: identifying whether a question-answer structure and a title structure exist in a text structure of the document to be processed, wherein the title structure comprises a title or a table;

9. The method of claim 8, wherein generating questions for the document to be processed using the natural language question generation model comprises:

10. The method according to any one of claims 1-9, further comprising: and carrying out answer boundary positioning on the generated question.

11. The method of claim 10, wherein performing answer boundary positioning for the generated question comprises:

12. A question generation apparatus, comprising:

the text type identification unit is used for identifying the text type of the document to be processed according to the text structure, wherein the text type comprises at least one of an explicit FAQ text type, a structured and semi-structured text type and a natural language text type;

a generative model selection unit for selecting a generative model corresponding to the text type, the generative model including at least one of an explicit question generation model, a structured and semi-structured question generation model, and a natural language question generation model; taking the explicit question generation model as a generation model corresponding to the text type under the condition that the text type is an explicit FAQ text type; taking the structured and semi-structured problem generation model as a generation model corresponding to the text type under the condition that the text type is a structured and semi-structured text type; taking the natural language question generation model as a generation model corresponding to the text type under the condition that the text type is a natural language text type;

13. The apparatus of claim 12,

the text type identification unit comprises a first identification subunit, and the first identification subunit is used for: identifying whether a question and answer structure exists in the text structure of the document to be processed;

14. The apparatus of claim 13, wherein the first generating subunit is further configured to:

15. The apparatus of claim 12,

the text type identification unit comprises a second identification subunit, and the second identification subunit is used for: identifying whether a title structure exists in a text structure of the document to be processed, wherein the title structure comprises a title or a table;

16. The apparatus of claim 15, wherein the second generation subunit comprises:

17. The apparatus of claim 16, wherein the duplicate acquisition subunit is further configured to:

acquiring a search click display log related to the title;

and storing the attribute statement into an attribute statement table.

18. The apparatus of claim 17, wherein the rephrase problem generating subunit is further configured to:

19. The apparatus of claim 12,

the text type identification unit comprises a third identification subunit, and the third identification subunit is used for: identifying whether a question-answer structure and a title structure exist in a text structure of the document to be processed, wherein the title structure comprises a title or a table;

20. The apparatus of claim 19, wherein the third generation subunit is further configured to:

21. The apparatus according to any one of claims 12-20, further comprising an answer boundary positioning unit for performing answer boundary positioning for the generated question.

22. The apparatus of claim 21, wherein the answer boundary locating unit is further configured to:

23. A question generation apparatus, comprising:

one or more processors;

storage means for storing one or more programs;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-11.

24. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-11.