CN111026834A - Question and answer corpus generation method and system - Google Patents

Question and answer corpus generation method and system Download PDF

Info

Publication number
CN111026834A
CN111026834A CN201911258482.XA CN201911258482A CN111026834A CN 111026834 A CN111026834 A CN 111026834A CN 201911258482 A CN201911258482 A CN 201911258482A CN 111026834 A CN111026834 A CN 111026834A
Authority
CN
China
Prior art keywords
text
corpus
question
answer
regular expression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911258482.XA
Other languages
Chinese (zh)
Other versions
CN111026834B (en
Inventor
许建伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
AI Speech Ltd
Original Assignee
AI Speech Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by AI Speech Ltd filed Critical AI Speech Ltd
Priority to CN201911258482.XA priority Critical patent/CN111026834B/en
Publication of CN111026834A publication Critical patent/CN111026834A/en
Application granted granted Critical
Publication of CN111026834B publication Critical patent/CN111026834B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/319Inverted lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a question and answer corpus generating method. The method comprises the following steps: receiving a corpus text; detecting the text quantity of the corpus text, and determining the entity and the attribute of the corpus text for the knowledge graph when the text quantity is smaller than a preset threshold value; querying a regular expression matched with the corpus text based on the entity and the attribute; determining a fuzzy statement of the corpus text based on the regular expression, inputting the fuzzy statement into a knowledge graph, and determining a corresponding text of the corpus text according to the inverted index; and performing corpus generation on the corpus text and the corresponding text through a regular expression to construct a plurality of paired question-answer dialogue corpuses. The embodiment of the invention also provides a question and answer corpus generating system. The embodiment of the invention uses fuzzy search in the knowledge graph, and improves the recall rate of retrieval. In the knowledge graph retrieval, an inverted index method is used, and the retrieval efficiency is improved. So that a plurality of question-answering conversation predictions can be generated in pairs in the text and the text section.

Description

Question and answer corpus generation method and system
Technical Field
The invention relates to the field of knowledge graph question answering, in particular to a question answering corpus generating method and system.
Background
Reading and understanding the answer effect of the question-answer language model requires a large amount of high-quality paired question-answer corpus support, and in order to obtain the high-quality dialogue corpus, a corpus generation method is generally used.
In the process of implementing the invention, the inventor finds that at least the following problems exist in the related art:
the existing corpus generating method is difficult to generate paired corpora such as a dialogue question-answer type, and the training text corpora which are easy to obtain do not appear in pairs and are independent sentences. It is difficult to generate conversational question-answer pair corpora using such a single sentence:
1. if a pair of question-and-answer corpora is to be constructed using individual sentences, a knowledge graph is required to answer (or generate topics from the answers). However, the speed of knowledge-graph query answers is slow; meanwhile, the questioning modes of the sentences are different, fuzzy search of the knowledge graph is difficult to use, the same content is possible, and corresponding answers cannot be obtained due to the different questioning modes.
2. When facing a large text sentence as a paragraph, a reading comprehension model is used to extract a pair of question-answering dialogs therein. However, reading the understanding model itself requires a large number of labeled high-quality paired question-answering text corpora for training.
Disclosure of Invention
The method aims to at least solve the problems that in the prior art, in the process of generating paired question-answer corpora, the speed of knowledge graph query is low, fuzzy search cannot be used, and the paired question-answer corpora are difficult to obtain in a dialogue question-answer mode.
In a first aspect, an embodiment of the present invention provides a question-answer corpus generating method, including:
receiving a corpus text;
detecting the text quantity of the corpus text, and determining the entity and the attribute of the corpus text for the knowledge graph when the text quantity is smaller than a preset threshold value;
querying a regular expression matched with the corpus text based on the entity and the attribute;
determining a fuzzy statement of the corpus text based on the regular expression, inputting the fuzzy statement into a knowledge graph, and determining a corresponding text of the corpus text according to an inverted index, wherein the corresponding text comprises: answer text and/or question text;
and performing corpus generation on the corpus text and the corresponding text through the regular expression to construct a plurality of paired question-answer dialogue corpuses.
In a second aspect, an embodiment of the present invention provides a question-answer corpus generating system, including:
the corpus receiving program module is used for receiving corpus texts;
the information determining program module is used for detecting the text quantity of the corpus text, and determining the entity and the attribute of the corpus text for the knowledge graph when the text quantity is smaller than a preset threshold value;
a regular expression query program module, configured to query a regular expression matched with the corpus text based on the entity and the attribute;
a corresponding text determination program module, configured to determine a fuzzy statement of the corpus text based on the regular expression, input the fuzzy statement to a knowledge graph, and determine a corresponding text of the corpus text according to an inverted index, where the corresponding text includes: answer text and/or question text;
and the question-answer corpus generating program module is used for performing corpus generation on the corpus text and the corresponding text through the regular expression so as to construct a plurality of paired question-answer dialogue corpuses.
In a third aspect, an electronic device is provided, comprising: the apparatus includes at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor, and the instructions are executable by the at least one processor to enable the at least one processor to perform the steps of the questionnaire generating method according to any embodiment of the invention.
In a fourth aspect, an embodiment of the present invention provides a storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the steps of the question-answer corpus generating method according to any embodiment of the present invention.
The embodiment of the invention has the beneficial effects that: fuzzy search is used in the knowledge graph, and retrieval recall rate is improved. In the knowledge graph retrieval, an inverted index method is used, and the retrieval efficiency is improved. Therefore, a plurality of question-answer dialogue expectations in pairs can be generated in the text segment and the text to train reading and understanding of the question-answer language model.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
Fig. 1 is a flowchart of a method for generating a corpus of questions and answers according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a question-answer corpus generating system according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a flowchart of a method for generating a corpus of questions and answers according to an embodiment of the present invention, which includes the following steps:
s11: receiving a corpus text;
s12: detecting the text quantity of the corpus text, and determining the entity and the attribute of the corpus text for the knowledge graph when the text quantity is smaller than a preset threshold value;
s13: querying a regular expression matched with the corpus text based on the entity and the attribute;
s14: determining a fuzzy statement of the corpus text based on the regular expression, inputting the fuzzy statement into a knowledge graph, and determining a corresponding text of the corpus text according to an inverted index, wherein the corresponding text comprises: answer text and/or question text;
s15: and performing corpus generation on the corpus text and the corresponding text through the regular expression to construct a plurality of paired question-answer dialogue corpuses.
For step S11, a corpus text is received, where the corpus text is usually some corpus text that is easy to collect, for example, a dialog posted in a forum, a sentence showing content in a web page. These sentences are not only easy to obtain, but also often have a potential "question-and-answer" pattern. For example, someone posts a question "relativistic who invented" on the web. Thus, the 'relativistic who invented' is collected as the corpus text of the method.
For step S12, the amount of text in the corpus text, i.e. the number of words in the text, is detected, and since these corpus texts are easy to obtain, some corpus texts may be a single sentence, and some corpus texts may be a whole sentence. The two texts are distinguished by the number of words. For such a single sentence of a small number of words, its entities and attributes for the knowledge graph are determined. The Entity (Entity) is an abstraction of an objective individual, and a person, a movie, and a sentence can be regarded as an Entity. Property is an abstraction of an entity and a relationship between entities. For example, "relativity is who invented" means that "relativity" is an entity, and "invention" means that the corresponding attribute is.
For step S13, based on the determined entities and attributes of "relativity", "invention", regular expressions matching the "relativity who invented" are queried. For example, there are matched: $ # { # basicconcept } is (who | who) (invention | propose. Wherein $ { # basicconcept } is an entity.
With step S14, the ambiguity of "who relatives invented" is determined based on the regular expression determined in step S13. For example, the relativity is (who | what | who | which is most of the kernel) (invention | propose |) (the | of | tweed | "empty" (representing that there may be no such portion)). The fuzzy utterance can form various utterances, is input into the knowledge graph, ensures the diversity of sentences, ensures that indexes can be inverted in the knowledge graph, and can search the corresponding texts of the corpus texts while improving the search speed.
For example, Q: "relativity who invented"
Obtaining an answer through a knowledge graph, A: "Einstein".
For step S15, performing corpus generation on the corpus text and the corresponding text through the regular expression, for example:
'what relatives put forward' and 'Einstein'
Relativity theory is the woolen cloth invented by the person 'Einstein'
"relativity theory" was proposed by who "einstein"
"relativity theory is invented for which kernel brother" Einstein "
A plurality of paired question-answer dialogue corpora are generated, and sufficient paired question-answer dialogue corpora are used for training reading understanding question-answer language models.
According to the embodiment, fuzzy search is used in the knowledge graph, retrieval recall rate is improved, for example, if one more term is added in the sentence, the knowledge graph cannot process the query just right, and a corresponding result cannot be obtained. In the knowledge graph retrieval, an inverted index method is used, and the retrieval efficiency is improved. Therefore, a plurality of question-answer dialogue predictions in pairs can be generated to train reading understanding of the question-answer language model.
As an implementation manner, in this embodiment, when the text amount is greater than a preset threshold, the corpus text is divided into a plurality of corpus text segments;
respectively extracting entities and attributes of the corpus text segments for the knowledge graph;
querying a plurality of regular expressions matched with each of the text segments of the material based on the entities and the attributes;
determining fuzzy descriptions of the corpus text segments based on the regular expressions, inputting the fuzzy descriptions into a knowledge graph, and determining corresponding texts of the corpus text segments according to inverted indexes, wherein the corresponding texts comprise: answer text or question text;
determining a corpus abbreviation of the corpus text, and generating a plurality of triples of [ question texts, corpus abbreviations and answer texts ] through a plurality of corpus text segments and corresponding texts;
and performing corpus generation on the [ question text, corpus abbreviation and answer text ] triples through the regular expressions to construct a plurality of paired question-answer dialogue corpuses.
In this embodiment, when a corpus text may be a whole sentence, the corpus text is divided into a plurality of corpus text segments;
for example, "Qingshui village is located 30 km in northeast of Fugu county, east is adjacent to Huangpu village, south is adjacent to Hai temple village, west is connected to chai village, and five families of Zhao are located in village, north is connected to Hartown, total area is 1667 square kilometers, and total cultivated land area is 38265 mu. 15 administrative villages, 80 natural villages, 2489 households and 10102 people all over the country, wherein the agricultural population is 9868 people, and the land population is 8408 people. The loess plateau is located in the clear water countryside, rainfall is rare all the year round, windy and sandy natural resources are relatively poor in spring, and abundant coal resources are only available in the southeast. The economic product owners of the whole countryside mainly use the agricultural planting industry, and the main economic crops comprise millets, millet, corn, potatoes, mung beans and the like. In addition, the method also develops the Malus micromalus cultivation industry and the livestock breeding industry to a certain extent. The clear water country civil wind floating and plain society has good security status, people's security industry has not occurred criminal cases for many years, and the elm forest city and the foxtail county are evaluated as safe villages and towns, civilized collection towns and social security comprehensive treatment advanced units for many years.
This sentence is divided into: the clear water country is located 30 kilometers in the northeast of the prefecture county, the total area is 1667 square kilometers, the total cultivated area is 38265 mu, and the all-country economic product owner is mainly in agricultural planting, and the main economic crops comprise millers, millet, corn, potatoes, mung beans and the like (because of too many texts, the text segments are not displayed for several times, and only partial text segments of the corpora are extracted).
And respectively extracting entities and attributes of the corpus text segments, wherein the entities are 'Xishuixiang' in the 30 kilometers northeast of the Fugu county region, and the attributes are 'Xishuixiang' in the 30 kilometers northeast of the Fugu county region. The rest of the corpus text segments are not described in detail.
Similarly, for "the clear water country is located 30 km in the northeast of the valley county," match the corresponding regular expression { # basicconcept } (located | in) 30 (km | km) in the valley county (northeast | southeast | southwest | northwest | in the valley county).
Determining fuzzy descriptions of 'clear water village is located at 30 kilometers of the northeast of the prefecture county region' based on the regular expression, inputting each fuzzy description into a knowledge graph, and determining corresponding texts of each corpus text segment according to inverted indexes. Because the ' clear water county is located 30 kilometers in the northeast of the prefecture county ' region ' belongs to the text of the answer type, the corresponding question text is determined to be asked in the knowledge map, and thus the question text of ' the number of addresses of the clear water county ' is obtained.
The abbreviation of the corpus text is determined, that is, the abbreviation of a large corpus in the above text is extracted to represent the large corpus text in the above text, and the abbreviation is also used as the basis for question and answer. For example, it is simply referred to as the fresh town profile. Generating a triple of [ the number of addresses of the Qingshu village, the brief introduction of the Qingshu town, and the Qingshu village which is 30 kilometers away from the northeast of the prefecture county ] through a plurality of corpus text segments and corresponding texts;
constructing a plurality of paired question-and-answer dialog corpora based on the determined triples, for example:
"the address of the clear water village is what" the clear water village is 30 kilometers from the northeast of the prefecture of the fugu county "
"where the address of the clear water village" and "the clear water village is 30 km from northeast of the prefecture county"
According to the embodiment, the question-answer type dialogue expectation with high quality can be extracted from the corpus text with large text quantity, and the reading understanding question-answer type language model is further trained.
As an implementation manner, in this embodiment, after querying the regular expression matched with the corpus text, the method includes preprocessing the regular expression, including:
and sequentially detecting each character of the regular expression, and when any character is a preset wildcard character, converting any character which is the preset wildcard character into a specified character.
In the present embodiment, since a regular expression cannot be directly used under the constraint of an algorithm, wildcards, such as characters like ".", "+", "$", "{, }", cannot appear. These characters have special meaning in regular expressions, so they are replaced. For example:
{ # company } (corporation)? (what.
Will be replaced by
# company? (what is.
As an embodiment, the sequentially detecting each character of the regular expression includes:
and judging each character of the regular expression one by one through a recursive algorithm.
According to the embodiment, the regular expression is introduced into the algorithm to have certain constraint, and the regular expression is adjusted to avoid the constraint and improve the stability of the method.
Fig. 2 is a schematic structural diagram of a query and answer corpus generating system according to an embodiment of the present invention, which can execute the query and answer corpus generating method according to any of the above embodiments and is configured in a terminal.
The system for generating a corpus of questions and answers provided by this embodiment includes: a corpus receiving program module 11, an information determining program module 12, a regular expression query program module 13, a corresponding text determining program module 14 and a question and answer corpus generating program module 15.
The corpus receiving program module 11 is configured to receive a corpus text; the information determination program module 12 is configured to detect a text amount of the corpus text, and determine an entity and an attribute of the corpus text for a knowledge graph when the text amount is smaller than a preset threshold; the regular expression query program module 13 is configured to query a regular expression matched with the corpus text based on the entity and the attribute; the corresponding text determination program module 14 is configured to determine a fuzzy statement of the corpus text based on the regular expression, input the fuzzy statement to a knowledge graph, and determine a corresponding text of the corpus text according to an inverted index, where the corresponding text includes: answer text and/or question text; the question-answer corpus generating program module 15 is configured to perform corpus generation on the corpus text and the corresponding text through the regular expression, so as to construct a plurality of paired question-answer dialog corpuses.
Further, the information determination program module is further configured to: when the text amount is larger than a preset threshold value, dividing the corpus text into a plurality of corpus text sections;
the information determination program module is used for respectively extracting the entity and the attribute of each text segment of the language material, which are used for the knowledge graph;
a regular expression query program module, configured to query, based on the entity and the attribute, a plurality of regular expressions that are matched with each of the text segments of the material;
a corresponding text determination program module, configured to determine a fuzzy statement of each corpus text segment based on each regular expression, input each fuzzy statement to a knowledge graph, and determine a corresponding text of each corpus text segment according to an inverted index, where the corresponding text includes: answer text or question text;
determining a corpus abbreviation of the corpus text, and generating a plurality of triples of [ question texts, corpus abbreviations and answer texts ] through a plurality of corpus text segments and corresponding texts;
and the question-answer corpus generating program module is used for performing corpus generation on the [ question text, corpus abbreviation and answer text ] triple through the regular expressions to construct a plurality of paired question-answer dialogue corpuses.
Further, the system further comprises: regular expression pre-handler module for
And sequentially detecting each character of the regular expression, and when any character is a preset wildcard character, converting any character which is the preset wildcard character into a specified character.
Further, the sequentially detecting each character of the regular expression comprises:
and judging each character of the regular expression one by one through a recursive algorithm.
The embodiment of the invention also provides a nonvolatile computer storage medium, wherein the computer storage medium stores computer executable instructions which can execute the question and answer corpus generating method in any method embodiment;
as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:
receiving a corpus text;
detecting the text quantity of the corpus text, and determining the entity and the attribute of the corpus text for the knowledge graph when the text quantity is smaller than a preset threshold value;
querying a regular expression matched with the corpus text based on the entity and the attribute;
determining a fuzzy statement of the corpus text based on the regular expression, inputting the fuzzy statement into a knowledge graph, and determining a corresponding text of the corpus text according to an inverted index, wherein the corresponding text comprises: answer text and/or question text;
and performing corpus generation on the corpus text and the corresponding text through the regular expression to construct a plurality of paired question-answer dialogue corpuses.
As a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the methods in embodiments of the present invention. One or more program instructions are stored in a non-transitory computer readable storage medium, and when executed by a processor, perform the corpus questioning and answering generating method in any of the above-described method embodiments.
The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the device, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
An embodiment of the present invention further provides an electronic device, which includes: the apparatus includes at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor, and the instructions are executable by the at least one processor to enable the at least one processor to perform the steps of the questionnaire generating method according to any embodiment of the invention.
The client of the embodiment of the present application exists in various forms, including but not limited to:
(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones, multimedia phones, functional phones, and low-end phones, among others.
(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as tablet computers.
(3) Portable entertainment devices such devices may display and play multimedia content. The devices comprise audio and video players, handheld game consoles, electronic books, intelligent toys and portable vehicle-mounted navigation devices.
(4) Other electronic devices with data processing capabilities.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A question-answer corpus generating method comprises the following steps:
receiving a corpus text;
detecting the text quantity of the corpus text, and determining the entity and the attribute of the corpus text for the knowledge graph when the text quantity is smaller than a preset threshold value;
querying a regular expression matched with the corpus text based on the entity and the attribute;
determining a fuzzy statement of the corpus text based on the regular expression, inputting the fuzzy statement into a knowledge graph, and determining a corresponding text of the corpus text according to an inverted index, wherein the corresponding text comprises: answer text and/or question text;
and performing corpus generation on the corpus text and the corresponding text through the regular expression to construct a plurality of paired question-answer dialogue corpuses.
2. The method according to claim 1, wherein when the text amount is larger than a preset threshold, the corpus text is divided into a plurality of corpus text segments;
respectively extracting entities and attributes of the corpus text segments for the knowledge graph;
querying a plurality of regular expressions matched with each of the text segments of the material based on the entities and the attributes;
determining fuzzy descriptions of the corpus text segments based on the regular expressions, inputting the fuzzy descriptions into a knowledge graph, and determining corresponding texts of the corpus text segments according to inverted indexes, wherein the corresponding texts comprise: answer text or question text;
determining a corpus abbreviation of the corpus text, and generating a plurality of triples of [ question texts, corpus abbreviations and answer texts ] through a plurality of corpus text segments and corresponding texts;
and performing corpus generation on the [ question text, corpus abbreviation and answer text ] triples through the regular expressions to construct a plurality of paired question-answer dialogue corpuses.
3. The method of claim 1, wherein after querying the regular expression that matches the corpus text, the method includes pre-processing the regular expression, including:
and sequentially detecting each character of the regular expression, and when any character is a preset wildcard character, converting any character which is the preset wildcard character into a specified character.
4. The method of claim 3, wherein the detecting each character of the regular expression in turn comprises:
and judging each character of the regular expression one by one through a recursive algorithm.
5. A corpus generating system, comprising:
the corpus receiving program module is used for receiving corpus texts;
the information determining program module is used for detecting the text quantity of the corpus text, and determining the entity and the attribute of the corpus text for the knowledge graph when the text quantity is smaller than a preset threshold value;
a regular expression query program module, configured to query a regular expression matched with the corpus text based on the entity and the attribute;
a corresponding text determination program module, configured to determine a fuzzy statement of the corpus text based on the regular expression, input the fuzzy statement to a knowledge graph, and determine a corresponding text of the corpus text according to an inverted index, where the corresponding text includes: answer text and/or question text;
and the question-answer corpus generating program module is used for performing corpus generation on the corpus text and the corresponding text through the regular expression so as to construct a plurality of paired question-answer dialogue corpuses.
6. The system of claim 5, wherein the information determination program module is further to: when the text amount is larger than a preset threshold value, dividing the corpus text into a plurality of corpus text sections;
the information determination program module is used for respectively extracting the entity and the attribute of each text segment of the language material, which are used for the knowledge graph;
a regular expression query program module, configured to query, based on the entity and the attribute, a plurality of regular expressions that are matched with each of the text segments of the material;
a corresponding text determination program module, configured to determine a fuzzy statement of each corpus text segment based on each regular expression, input each fuzzy statement to a knowledge graph, and determine a corresponding text of each corpus text segment according to an inverted index, where the corresponding text includes: answer text or question text;
determining a corpus abbreviation of the corpus text, and generating a plurality of triples of [ question texts, corpus abbreviations and answer texts ] through a plurality of corpus text segments and corresponding texts;
and the question-answer corpus generating program module is used for performing corpus generation on the [ question text, corpus abbreviation and answer text ] triple through the regular expressions to construct a plurality of paired question-answer dialogue corpuses.
7. The system of claim 5, wherein the system further comprises: regular expression pre-handler module for
And sequentially detecting each character of the regular expression, and when any character is a preset wildcard character, converting any character which is the preset wildcard character into a specified character.
8. The system of claim 7, wherein the detecting each character of the regular expression in turn comprises:
and judging each character of the regular expression one by one through a recursive algorithm.
9. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any of claims 1-4.
10. A storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 4.
CN201911258482.XA 2019-12-10 2019-12-10 Question and answer corpus generation method and system Active CN111026834B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911258482.XA CN111026834B (en) 2019-12-10 2019-12-10 Question and answer corpus generation method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911258482.XA CN111026834B (en) 2019-12-10 2019-12-10 Question and answer corpus generation method and system

Publications (2)

Publication Number Publication Date
CN111026834A true CN111026834A (en) 2020-04-17
CN111026834B CN111026834B (en) 2022-07-08

Family

ID=70205262

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911258482.XA Active CN111026834B (en) 2019-12-10 2019-12-10 Question and answer corpus generation method and system

Country Status (1)

Country Link
CN (1) CN111026834B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111966805A (en) * 2020-08-13 2020-11-20 贝壳技术有限公司 Method, device, medium and electronic equipment for assisting in realizing session

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150278264A1 (en) * 2014-03-31 2015-10-01 International Business Machines Corporation Dynamic update of corpus indices for question answering system
CN105868313A (en) * 2016-03-25 2016-08-17 浙江大学 Mapping knowledge domain questioning and answering system and method based on template matching technique
CN109408821A (en) * 2018-10-22 2019-03-01 腾讯科技(深圳)有限公司 A kind of corpus generation method, calculates equipment and storage medium at device
CN110147436A (en) * 2019-03-18 2019-08-20 清华大学 A kind of mixing automatic question-answering method based on padagogical knowledge map and text

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150278264A1 (en) * 2014-03-31 2015-10-01 International Business Machines Corporation Dynamic update of corpus indices for question answering system
CN105868313A (en) * 2016-03-25 2016-08-17 浙江大学 Mapping knowledge domain questioning and answering system and method based on template matching technique
CN109408821A (en) * 2018-10-22 2019-03-01 腾讯科技(深圳)有限公司 A kind of corpus generation method, calculates equipment and storage medium at device
CN110147436A (en) * 2019-03-18 2019-08-20 清华大学 A kind of mixing automatic question-answering method based on padagogical knowledge map and text

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111966805A (en) * 2020-08-13 2020-11-20 贝壳技术有限公司 Method, device, medium and electronic equipment for assisting in realizing session

Also Published As

Publication number Publication date
CN111026834B (en) 2022-07-08

Similar Documents

Publication Publication Date Title
Menick et al. Teaching language models to support answers with verified quotes
CN106649818B (en) Application search intention identification method and device, application search method and server
CN107818105B (en) Recommendation method of application program and server
CN105868317B (en) Digital education resource recommendation method and system
Parker Environmentalism and education for sustainability in Indonesia
CN111400506B (en) Ancient poetry proposition method and system
CN105045857A (en) Social network rumor recognition method and system
CN104133817A (en) Online community interaction method and device and online community platform
CN103699626A (en) Method and system for analysing individual emotion tendency of microblog user
CN107122455A (en) A kind of network user's enhancing method for expressing based on microblogging
CN104636456A (en) Question routing method based on word vectors
Usher Defending and transcending local identity through environmental discourse
Reed ‘This loopy idea’an analysis of UKIP’s social media discourse in relation to rurality and climate change
CN102651719A (en) Method and equipment for tracking message topics in message interaction environment
CN104679885A (en) User search string organization name recognition method based on semantic feature model
CN103473959A (en) System and method for special dictation training study related to numbers in foreign language
CN104915399A (en) Recommended data processing method based on news headline and recommended data processing method system based on news headline
CN110765270A (en) Training method and system of text classification model for spoken language interaction
CN113961692A (en) Machine reading understanding method and system
CN111026834B (en) Question and answer corpus generation method and system
CN109558591A (en) Chinese event detection method and device
CN110390104B (en) Irregular text transcription method and system for voice dialogue platform
CN103309851A (en) Method and system for spam identification of short text
CN111128122B (en) Method and system for optimizing rhythm prediction model
CN110188352B (en) Text theme determining method and device, computing equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 215123 building 14, Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province

Applicant after: Sipic Technology Co.,Ltd.

Address before: 215123 building 14, Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province

Applicant before: AI SPEECH Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant