CN111026834A

CN111026834A - Question and answer corpus generation method and system

Info

Publication number: CN111026834A
Application number: CN201911258482.XA
Authority: CN
Inventors: 许建伟
Original assignee: AI Speech Ltd
Current assignee: AI Speech Ltd
Priority date: 2019-12-10
Filing date: 2019-12-10
Publication date: 2020-04-17
Anticipated expiration: 2039-12-10
Also published as: CN111026834B

Abstract

The embodiment of the invention provides a question and answer corpus generating method. The method comprises the following steps: receiving a corpus text; detecting the text quantity of the corpus text, and determining the entity and the attribute of the corpus text for the knowledge graph when the text quantity is smaller than a preset threshold value; querying a regular expression matched with the corpus text based on the entity and the attribute; determining a fuzzy statement of the corpus text based on the regular expression, inputting the fuzzy statement into a knowledge graph, and determining a corresponding text of the corpus text according to the inverted index; and performing corpus generation on the corpus text and the corresponding text through a regular expression to construct a plurality of paired question-answer dialogue corpuses. The embodiment of the invention also provides a question and answer corpus generating system. The embodiment of the invention uses fuzzy search in the knowledge graph, and improves the recall rate of retrieval. In the knowledge graph retrieval, an inverted index method is used, and the retrieval efficiency is improved. So that a plurality of question-answering conversation predictions can be generated in pairs in the text and the text section.

Description

Question and answer corpus generation method and system

Technical Field

The invention relates to the field of knowledge graph question answering, in particular to a question answering corpus generating method and system.

Background

Reading and understanding the answer effect of the question-answer language model requires a large amount of high-quality paired question-answer corpus support, and in order to obtain the high-quality dialogue corpus, a corpus generation method is generally used.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the related art:

the existing corpus generating method is difficult to generate paired corpora such as a dialogue question-answer type, and the training text corpora which are easy to obtain do not appear in pairs and are independent sentences. It is difficult to generate conversational question-answer pair corpora using such a single sentence:

1. if a pair of question-and-answer corpora is to be constructed using individual sentences, a knowledge graph is required to answer (or generate topics from the answers). However, the speed of knowledge-graph query answers is slow; meanwhile, the questioning modes of the sentences are different, fuzzy search of the knowledge graph is difficult to use, the same content is possible, and corresponding answers cannot be obtained due to the different questioning modes.

2. When facing a large text sentence as a paragraph, a reading comprehension model is used to extract a pair of question-answering dialogs therein. However, reading the understanding model itself requires a large number of labeled high-quality paired question-answering text corpora for training.

Disclosure of Invention

The method aims to at least solve the problems that in the prior art, in the process of generating paired question-answer corpora, the speed of knowledge graph query is low, fuzzy search cannot be used, and the paired question-answer corpora are difficult to obtain in a dialogue question-answer mode.

In a first aspect, an embodiment of the present invention provides a question-answer corpus generating method, including:

receiving a corpus text;

detecting the text quantity of the corpus text, and determining the entity and the attribute of the corpus text for the knowledge graph when the text quantity is smaller than a preset threshold value;

querying a regular expression matched with the corpus text based on the entity and the attribute;

determining a fuzzy statement of the corpus text based on the regular expression, inputting the fuzzy statement into a knowledge graph, and determining a corresponding text of the corpus text according to an inverted index, wherein the corresponding text comprises: answer text and/or question text;

and performing corpus generation on the corpus text and the corresponding text through the regular expression to construct a plurality of paired question-answer dialogue corpuses.

In a second aspect, an embodiment of the present invention provides a question-answer corpus generating system, including:

the corpus receiving program module is used for receiving corpus texts;

the information determining program module is used for detecting the text quantity of the corpus text, and determining the entity and the attribute of the corpus text for the knowledge graph when the text quantity is smaller than a preset threshold value;

a regular expression query program module, configured to query a regular expression matched with the corpus text based on the entity and the attribute;

a corresponding text determination program module, configured to determine a fuzzy statement of the corpus text based on the regular expression, input the fuzzy statement to a knowledge graph, and determine a corresponding text of the corpus text according to an inverted index, where the corresponding text includes: answer text and/or question text;

and the question-answer corpus generating program module is used for performing corpus generation on the corpus text and the corresponding text through the regular expression so as to construct a plurality of paired question-answer dialogue corpuses.

In a third aspect, an electronic device is provided, comprising: the apparatus includes at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor, and the instructions are executable by the at least one processor to enable the at least one processor to perform the steps of the questionnaire generating method according to any embodiment of the invention.

In a fourth aspect, an embodiment of the present invention provides a storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the steps of the question-answer corpus generating method according to any embodiment of the present invention.

The embodiment of the invention has the beneficial effects that: fuzzy search is used in the knowledge graph, and retrieval recall rate is improved. In the knowledge graph retrieval, an inverted index method is used, and the retrieval efficiency is improved. Therefore, a plurality of question-answer dialogue expectations in pairs can be generated in the text segment and the text to train reading and understanding of the question-answer language model.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

Fig. 1 is a flowchart of a method for generating a corpus of questions and answers according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a question-answer corpus generating system according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a flowchart of a method for generating a corpus of questions and answers according to an embodiment of the present invention, which includes the following steps:

s11: receiving a corpus text;

s12: detecting the text quantity of the corpus text, and determining the entity and the attribute of the corpus text for the knowledge graph when the text quantity is smaller than a preset threshold value;

s13: querying a regular expression matched with the corpus text based on the entity and the attribute;

s14: determining a fuzzy statement of the corpus text based on the regular expression, inputting the fuzzy statement into a knowledge graph, and determining a corresponding text of the corpus text according to an inverted index, wherein the corresponding text comprises: answer text and/or question text;

s15: and performing corpus generation on the corpus text and the corresponding text through the regular expression to construct a plurality of paired question-answer dialogue corpuses.

For step S11, a corpus text is received, where the corpus text is usually some corpus text that is easy to collect, for example, a dialog posted in a forum, a sentence showing content in a web page. These sentences are not only easy to obtain, but also often have a potential "question-and-answer" pattern. For example, someone posts a question "relativistic who invented" on the web. Thus, the 'relativistic who invented' is collected as the corpus text of the method.

For step S12, the amount of text in the corpus text, i.e. the number of words in the text, is detected, and since these corpus texts are easy to obtain, some corpus texts may be a single sentence, and some corpus texts may be a whole sentence. The two texts are distinguished by the number of words. For such a single sentence of a small number of words, its entities and attributes for the knowledge graph are determined. The Entity (Entity) is an abstraction of an objective individual, and a person, a movie, and a sentence can be regarded as an Entity. Property is an abstraction of an entity and a relationship between entities. For example, "relativity is who invented" means that "relativity" is an entity, and "invention" means that the corresponding attribute is.

For step S13, based on the determined entities and attributes of "relativity", "invention", regular expressions matching the "relativity who invented" are queried. For example, there are matched: $ # { # basicconcept } is (who | who) (invention | propose. Wherein $ { # basicconcept } is an entity.

With step S14, the ambiguity of "who relatives invented" is determined based on the regular expression determined in step S13. For example, the relativity is (who | what | who | which is most of the kernel) (invention | propose |) (the | of | tweed | "empty" (representing that there may be no such portion)). The fuzzy utterance can form various utterances, is input into the knowledge graph, ensures the diversity of sentences, ensures that indexes can be inverted in the knowledge graph, and can search the corresponding texts of the corpus texts while improving the search speed.

For example, Q: "relativity who invented"

Obtaining an answer through a knowledge graph, A: "Einstein".

For step S15, performing corpus generation on the corpus text and the corresponding text through the regular expression, for example:

'what relatives put forward' and 'Einstein'

Relativity theory is the woolen cloth invented by the person 'Einstein'

"relativity theory" was proposed by who "einstein"

"relativity theory is invented for which kernel brother" Einstein "

A plurality of paired question-answer dialogue corpora are generated, and sufficient paired question-answer dialogue corpora are used for training reading understanding question-answer language models.

According to the embodiment, fuzzy search is used in the knowledge graph, retrieval recall rate is improved, for example, if one more term is added in the sentence, the knowledge graph cannot process the query just right, and a corresponding result cannot be obtained. In the knowledge graph retrieval, an inverted index method is used, and the retrieval efficiency is improved. Therefore, a plurality of question-answer dialogue predictions in pairs can be generated to train reading understanding of the question-answer language model.

As an implementation manner, in this embodiment, when the text amount is greater than a preset threshold, the corpus text is divided into a plurality of corpus text segments;

respectively extracting entities and attributes of the corpus text segments for the knowledge graph;

querying a plurality of regular expressions matched with each of the text segments of the material based on the entities and the attributes;

determining fuzzy descriptions of the corpus text segments based on the regular expressions, inputting the fuzzy descriptions into a knowledge graph, and determining corresponding texts of the corpus text segments according to inverted indexes, wherein the corresponding texts comprise: answer text or question text;

determining a corpus abbreviation of the corpus text, and generating a plurality of triples of [ question texts, corpus abbreviations and answer texts ] through a plurality of corpus text segments and corresponding texts;

and performing corpus generation on the [ question text, corpus abbreviation and answer text ] triples through the regular expressions to construct a plurality of paired question-answer dialogue corpuses.

In this embodiment, when a corpus text may be a whole sentence, the corpus text is divided into a plurality of corpus text segments;

for example, "Qingshui village is located 30 km in northeast of Fugu county, east is adjacent to Huangpu village, south is adjacent to Hai temple village, west is connected to chai village, and five families of Zhao are located in village, north is connected to Hartown, total area is 1667 square kilometers, and total cultivated land area is 38265 mu. 15 administrative villages, 80 natural villages, 2489 households and 10102 people all over the country, wherein the agricultural population is 9868 people, and the land population is 8408 people. The loess plateau is located in the clear water countryside, rainfall is rare all the year round, windy and sandy natural resources are relatively poor in spring, and abundant coal resources are only available in the southeast. The economic product owners of the whole countryside mainly use the agricultural planting industry, and the main economic crops comprise millets, millet, corn, potatoes, mung beans and the like. In addition, the method also develops the Malus micromalus cultivation industry and the livestock breeding industry to a certain extent. The clear water country civil wind floating and plain society has good security status, people's security industry has not occurred criminal cases for many years, and the elm forest city and the foxtail county are evaluated as safe villages and towns, civilized collection towns and social security comprehensive treatment advanced units for many years.

This sentence is divided into: the clear water country is located 30 kilometers in the northeast of the prefecture county, the total area is 1667 square kilometers, the total cultivated area is 38265 mu, and the all-country economic product owner is mainly in agricultural planting, and the main economic crops comprise millers, millet, corn, potatoes, mung beans and the like (because of too many texts, the text segments are not displayed for several times, and only partial text segments of the corpora are extracted).

And respectively extracting entities and attributes of the corpus text segments, wherein the entities are 'Xishuixiang' in the 30 kilometers northeast of the Fugu county region, and the attributes are 'Xishuixiang' in the 30 kilometers northeast of the Fugu county region. The rest of the corpus text segments are not described in detail.

Similarly, for "the clear water country is located 30 km in the northeast of the valley county," match the corresponding regular expression { # basicconcept } (located | in) 30 (km | km) in the valley county (northeast | southeast | southwest | northwest | in the valley county).

Determining fuzzy descriptions of 'clear water village is located at 30 kilometers of the northeast of the prefecture county region' based on the regular expression, inputting each fuzzy description into a knowledge graph, and determining corresponding texts of each corpus text segment according to inverted indexes. Because the ' clear water county is located 30 kilometers in the northeast of the prefecture county ' region ' belongs to the text of the answer type, the corresponding question text is determined to be asked in the knowledge map, and thus the question text of ' the number of addresses of the clear water county ' is obtained.

The abbreviation of the corpus text is determined, that is, the abbreviation of a large corpus in the above text is extracted to represent the large corpus text in the above text, and the abbreviation is also used as the basis for question and answer. For example, it is simply referred to as the fresh town profile. Generating a triple of [ the number of addresses of the Qingshu village, the brief introduction of the Qingshu town, and the Qingshu village which is 30 kilometers away from the northeast of the prefecture county ] through a plurality of corpus text segments and corresponding texts;

constructing a plurality of paired question-and-answer dialog corpora based on the determined triples, for example:

"the address of the clear water village is what" the clear water village is 30 kilometers from the northeast of the prefecture of the fugu county "

"where the address of the clear water village" and "the clear water village is 30 km from northeast of the prefecture county"

According to the embodiment, the question-answer type dialogue expectation with high quality can be extracted from the corpus text with large text quantity, and the reading understanding question-answer type language model is further trained.

As an implementation manner, in this embodiment, after querying the regular expression matched with the corpus text, the method includes preprocessing the regular expression, including:

and sequentially detecting each character of the regular expression, and when any character is a preset wildcard character, converting any character which is the preset wildcard character into a specified character.

In the present embodiment, since a regular expression cannot be directly used under the constraint of an algorithm, wildcards, such as characters like ".", "+", "$", "{, }", cannot appear. These characters have special meaning in regular expressions, so they are replaced. For example:

{ # company } (corporation)? (what.

Will be replaced by

# company? (what is.

As an embodiment, the sequentially detecting each character of the regular expression includes:

and judging each character of the regular expression one by one through a recursive algorithm.

According to the embodiment, the regular expression is introduced into the algorithm to have certain constraint, and the regular expression is adjusted to avoid the constraint and improve the stability of the method.

Fig. 2 is a schematic structural diagram of a query and answer corpus generating system according to an embodiment of the present invention, which can execute the query and answer corpus generating method according to any of the above embodiments and is configured in a terminal.

The system for generating a corpus of questions and answers provided by this embodiment includes: a corpus receiving program module 11, an information determining program module 12, a regular expression query program module 13, a corresponding text determining program module 14 and a question and answer corpus generating program module 15.

The corpus receiving program module 11 is configured to receive a corpus text; the information determination program module 12 is configured to detect a text amount of the corpus text, and determine an entity and an attribute of the corpus text for a knowledge graph when the text amount is smaller than a preset threshold; the regular expression query program module 13 is configured to query a regular expression matched with the corpus text based on the entity and the attribute; the corresponding text determination program module 14 is configured to determine a fuzzy statement of the corpus text based on the regular expression, input the fuzzy statement to a knowledge graph, and determine a corresponding text of the corpus text according to an inverted index, where the corresponding text includes: answer text and/or question text; the question-answer corpus generating program module 15 is configured to perform corpus generation on the corpus text and the corresponding text through the regular expression, so as to construct a plurality of paired question-answer dialog corpuses.

Further, the information determination program module is further configured to: when the text amount is larger than a preset threshold value, dividing the corpus text into a plurality of corpus text sections;

the information determination program module is used for respectively extracting the entity and the attribute of each text segment of the language material, which are used for the knowledge graph;

a regular expression query program module, configured to query, based on the entity and the attribute, a plurality of regular expressions that are matched with each of the text segments of the material;

a corresponding text determination program module, configured to determine a fuzzy statement of each corpus text segment based on each regular expression, input each fuzzy statement to a knowledge graph, and determine a corresponding text of each corpus text segment according to an inverted index, where the corresponding text includes: answer text or question text;

and the question-answer corpus generating program module is used for performing corpus generation on the [ question text, corpus abbreviation and answer text ] triple through the regular expressions to construct a plurality of paired question-answer dialogue corpuses.

Further, the system further comprises: regular expression pre-handler module for

Further, the sequentially detecting each character of the regular expression comprises:

The embodiment of the invention also provides a nonvolatile computer storage medium, wherein the computer storage medium stores computer executable instructions which can execute the question and answer corpus generating method in any method embodiment;

as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:

receiving a corpus text;

As a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the methods in embodiments of the present invention. One or more program instructions are stored in a non-transitory computer readable storage medium, and when executed by a processor, perform the corpus questioning and answering generating method in any of the above-described method embodiments.

The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the device, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

An embodiment of the present invention further provides an electronic device, which includes: the apparatus includes at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor, and the instructions are executable by the at least one processor to enable the at least one processor to perform the steps of the questionnaire generating method according to any embodiment of the invention.

The client of the embodiment of the present application exists in various forms, including but not limited to:

(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones, multimedia phones, functional phones, and low-end phones, among others.

(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as tablet computers.

(3) Portable entertainment devices such devices may display and play multimedia content. The devices comprise audio and video players, handheld game consoles, electronic books, intelligent toys and portable vehicle-mounted navigation devices.

(4) Other electronic devices with data processing capabilities.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A question-answer corpus generating method comprises the following steps:

receiving a corpus text;

2. The method according to claim 1, wherein when the text amount is larger than a preset threshold, the corpus text is divided into a plurality of corpus text segments;

3. The method of claim 1, wherein after querying the regular expression that matches the corpus text, the method includes pre-processing the regular expression, including:

4. The method of claim 3, wherein the detecting each character of the regular expression in turn comprises:

5. A corpus generating system, comprising:

the corpus receiving program module is used for receiving corpus texts;

6. The system of claim 5, wherein the information determination program module is further to: when the text amount is larger than a preset threshold value, dividing the corpus text into a plurality of corpus text sections;

7. The system of claim 5, wherein the system further comprises: regular expression pre-handler module for

8. The system of claim 7, wherein the detecting each character of the regular expression in turn comprises:

9. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any of claims 1-4.

10. A storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 4.