CN113157897A

CN113157897A - Corpus generation method and device, computer equipment and storage medium

Info

Publication number: CN113157897A
Application number: CN202110575555.9A
Authority: CN
Inventors: 谢忠玉
Original assignee: Ping An Life Insurance Company of China Ltd
Current assignee: Ping An Life Insurance Company of China Ltd
Priority date: 2021-05-26
Filing date: 2021-05-26
Publication date: 2021-07-23

Abstract

The invention relates to the technical field of artificial intelligence, in particular to a corpus generating method, a corpus generating device, corpus generating equipment and a storage medium. The corpus generating method comprises the steps of obtaining high-frequency questioning words and texts to be mined, wherein the high-frequency questioning words correspond to a target question and answer field; extracting a target response sentence corresponding to the high-frequency question words from the text to be mined according to the high-frequency question words; matching the high-frequency question words with the text similarity of a plurality of historical question sentences in a historical question-answer library, and acquiring a plurality of historical question sentences as historical question templates based on the text similarity; replacing the historical questioning words in the historical questioning template with the high-frequency questioning words to obtain questioning corpus; and taking the question corpus and the target answer sentence corresponding to the high-frequency question words as a target question-answer corpus. The method can realize the rapid construction of the question and answer corpus in different vertical fields, and can effectively reduce the labor cost. The invention also relates to the technical field of blockchain, and the historical question sentences can be stored in the blockchain.

Description

Corpus generation method and device, computer equipment and storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a corpus generating method, a corpus generating device, computer equipment and a storage medium.

Background

In the current internet + era, the technology of artificial intelligence is applied to the fields of traffic management and the like in a large scale, and a question-answering system is an important field of artificial intelligence and relates to application landing of the artificial intelligence in the vertical field. For the current question-answering system, most of the user's Questions are focused on some high-frequency Questions on the head, namely, the motivation of Frequencyty Activated Questions (FAQ), the quantity and quality of the FAQ corpus are the basis of the whole system, but no universal full-coverage FAQ corpus exists at present, so that the FAQ corpus needs to be reconstructed aiming at different vertical fields, and the reconstruction of the FAQ corpus needs to consume a large amount of manpower and material resources, so that the labor cost is high, and the efficiency is low.

Disclosure of Invention

The embodiment of the invention provides a corpus generating method and device, computer equipment and a storage medium, and aims to solve the problems that a corpus set needs to be reconstructed aiming at different vertical fields, labor cost is high and efficiency is low.

A corpus generation method includes:

acquiring high-frequency questioning words and texts to be mined corresponding to the target question and answer field; the high-frequency questioning words are used for indicating subject words corresponding to high-frequency questions in the target question answering field;

extracting a target response sentence corresponding to the high-frequency question word from the text to be mined according to the high-frequency question word;

performing text similarity matching on the high-frequency question words and a plurality of historical question sentences in a historical question-answer library to obtain a plurality of historical question sentences as historical question templates;

replacing the historical questioning words in the historical questioning template with the high-frequency questioning words to obtain questioning linguistic data; the historical question words are subject words corresponding to the historical question templates;

and taking the question corpus and the target answer sentence corresponding to the high-frequency question words as a target question-answer corpus.

A corpus generation apparatus, comprising:

the data acquisition module is used for acquiring high-frequency questioning words and texts to be mined corresponding to the target question and answer field; the high-frequency questioning words are used for indicating subject words corresponding to high-frequency questions in the target question answering field;

the target question-answer sentence extraction module is used for extracting a target answer sentence corresponding to the high-frequency question word from the text to be mined according to the high-frequency question word;

a historical question template obtaining module, configured to perform text similarity matching on the high-frequency question words and multiple historical question sentences in a historical question-answer library, and obtain multiple historical question sentences as historical question templates;

the query corpus acquiring module is used for replacing the historical query words in the historical query template with the high-frequency query words to obtain query corpus; the historical question words are subject words corresponding to the historical question templates;

and the target question-answer corpus acquiring module is used for taking the question corpus and the target answer sentences corresponding to the high-frequency question words as target question-answer corpus.

A computer device, comprising a memory, a processor and a computer program stored in the memory and operable on the processor, wherein the processor implements the steps of the corpus generating method when executing the computer program.

A computer storage medium, the computer storage medium storing a computer program, the computer program, when executed by a processor, implementing the steps of the corpus generation method described above.

In the corpus generating method, the corpus generating device, the computer device and the storage medium, the high-frequency question words corresponding to the target question-answering field and the text to be mined are obtained, so that the target answer sentences corresponding to the high-frequency question words are extracted from the text to be mined according to the high-frequency question words, namely, the corpus data is constructed aiming at the high-frequency question words, and the quick cold start of the construction of the question-answering corpus can be realized. Then, performing text similarity matching on the high-frequency question words and a plurality of historical question sentences in a historical question-answer library to obtain a plurality of historical question sentences as historical question templates, so that the historical question templates extracted from the historical real question sentences of the words are reconstructed based on the high-frequency question to obtain question corpus, and the authenticity of the question corpus can be effectively ensured; and finally, the question corpus and the target answer sentence corresponding to the high-frequency question words are used as the target question corpus, so that the purpose of automatically constructing the question corpus in different vertical fields can be realized, the labor cost is reduced, and the purpose of quickly constructing the question corpus can be realized.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1 is a diagram of an application environment of a corpus generating method according to an embodiment of the present invention;

FIG. 2 is a flowchart of a corpus generation method according to an embodiment of the present invention;

FIG. 3 is a flowchart of a corpus generation method according to an embodiment of the present invention;

FIG. 4 is a flowchart of a corpus generation method according to an embodiment of the present invention;

FIG. 5 is a detailed flowchart of step S202 in FIG. 2;

FIG. 6 is a detailed flowchart of step S403 in FIG. 4;

FIG. 7 is a detailed flowchart of step S408 in FIG. 4;

FIG. 8 is a detailed flowchart of step S203 in FIG. 2;

FIG. 9 is a diagram illustrating a corpus generating device according to an embodiment of the present invention;

FIG. 10 is a schematic diagram of a computer device according to an embodiment of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The corpus generation method may be applied in an application environment as in fig. 1, where a computer device communicates with a server over a network. The computer device may be, but is not limited to, various personal computers, laptops, smartphones, tablets, and portable wearable devices. The server may be implemented as a stand-alone server.

In an embodiment, as shown in fig. 2, a corpus generating method is provided, which is described by taking the server in fig. 1 as an example, and includes the following steps:

s201: acquiring high-frequency questioning words and texts to be mined corresponding to the target question and answer field; the high-frequency question words are used for indicating subject words corresponding to high-frequency questions in the target question answering field.

The method can be applied to an automatic question and answer corpus generating tool, is used for automatically generating the question and answer corpus in the field in a text mining mode aiming at different vertical fields (such as insurance fields), achieves the purpose of quick cold start of question and answer corpus construction, can effectively reduce labor cost and improves the construction efficiency of the question and answer corpus.

The target question and answer field may include, but is not limited to, various vertical fields, such as insurance fields, among others. In the present embodiment, for convenience of understanding, the following technical solutions are described by taking applications in the insurance field as examples.

Specifically, for the field of insurance question answering, the definition of the terms and terms related to insurance in the insurance is a high-frequency question consulted by the user, so the high-frequency question words in the embodiment may be subject words corresponding to the questions related to the insurance terms and terms, that is, the subject words may be terms and terms in the high-frequency question. The text to be mined can be insurance specification documents, insurance policy specifications and the like. It is to be understood that the high frequency questioning word may be one or more, which are not limited herein.

S202: and extracting a target response sentence corresponding to the high-frequency question words from the text to be mined according to the high-frequency question words.

Specifically, in order to save storage space and improve extraction efficiency of the target answer sentence, before extracting the target answer sentence corresponding to the high-frequency question word from the text to be mined according to the high-frequency question word, preprocessing may be performed on the text to be mined, including but not limited to english removal processing and punctuation mark removal processing. In this embodiment, the punctuation removal processing includes, but is not limited to, processing by using a regular expression. The regular expression may be an expression preset by a developer, such as a sentence system. Specifically, keyword matching may be performed on the text to be mined based on the high-frequency question words to obtain target response sentences corresponding to the high-frequency question words, for example, the high-frequency question words are XXX, the text to be mined includes "XXX" which means … … ", and" … … "may be used as the target response sentences corresponding to XXX through keyword matching.

It can be understood that, since the target answer sentence in this embodiment is mined from the text to be mined, that is, the specification text of the insurance specification document and the policy specification, compliance and accuracy of the target answer sentence can be ensured, and moreover, the cost for auditing the compliance of the corpus can be further saved.

S203: and performing text similarity matching on the high-frequency question words and a plurality of historical question sentences in the historical question-answer library to obtain a plurality of historical question sentences as historical question templates.

And the text similarity is used for representing the similarity between the high-frequency question words and the historical question sentences. The text similarity may be represented by jaccard similarity, which is not limited herein. The Jaccard similarity is the ratio of the vector intersection and the vector union of the high-frequency question words and the historical question sentences. It is understood that the high-frequency question words and the historical question sentences can be represented in the form of word vectors respectively to calculate the text similarity.

Specifically, in the present embodiment, the search is mainly performed on the history question sentences including the high-frequency question words or the similar words, but whether the history question sentences include repeated high-frequency question words or similar words has little influence on the present embodiment, so that the similarity of texts is expressed by jaccard (jaccard) similarity.

It is emphasized that, in order to further ensure the privacy and security of the history question sentence, the history question sentence may also be stored in a node of a block chain.

S204: replacing the historical questioning words in the historical questioning template with high-frequency questioning words to obtain questioning corpus; the historical question words are subject words corresponding to the historical question templates.

The historical question words are subject words corresponding to the historical question templates. The subject term is used to reflect the keywords of the subject in the history question sentence, such as "what is defined by XXX", and the history question term in the sentence is "XXX". Specifically, the historical question words in the historical question template are replaced by high-frequency question words to obtain question corpus, for example, what the definition of the historical question template is "XXX", where "XXX" is the historical question word, and assuming that the current high-frequency question word includes "a" and "B", so that one of the historical question words in the historical question template can be randomly selected in sequence or in a loop to replace, so as to obtain what the definition of "a" is and what the definition of "B" is.

S205: and taking the target answer sentences corresponding to the question corpus and the high-frequency question words as target question and answer corpora.

Specifically, the target question-answer corpus can be obtained by associating the question corpus with the target answer sentences corresponding to the high-frequency question words in the question corpus, manual sample collection is not needed, and the authenticity and the effectiveness of the generated target question-answer corpus can be ensured due to the fact that the question templates are mined according to historical question sentences.

In this embodiment, the high-frequency question words and the text to be mined corresponding to the target question and answer field are obtained, so that the target answer sentences corresponding to the high-frequency question words are extracted from the text to be mined according to the high-frequency question words, that is, the corpus data is constructed for the high-frequency question words, and the quick cold start of the question and answer corpus construction can be realized. Then, performing text similarity matching on the high-frequency question words and a plurality of historical question sentences in a historical question-answer library to obtain a plurality of historical question sentences as historical question templates, so that the historical question templates extracted from the historical real question sentences of the words are reconstructed based on the high-frequency question to obtain question corpus, and the authenticity of the question corpus can be effectively ensured; and finally, the target answer sentences corresponding to the question corpus and the high-frequency question words are used as the target question and answer corpus, so that the purpose of automatically constructing the question and answer corpus in different vertical fields can be realized, the labor cost is reduced, and the purpose of quickly constructing the question and answer corpus can be realized.

In an embodiment, as shown in fig. 3, after step S204, the corpus generating method further includes the following steps:

s301: and (4) carrying out syntactic and semantic inspection on the query corpus by adopting a pre-trained language model to obtain a score value of the query corpus.

S302: and when the score value is not less than a preset score threshold value, reserving the query corpus.

S303: and when the score value is smaller than a preset score threshold value, removing the query corpus.

S304: and taking the reserved query language material and the target response sentence corresponding to the high-frequency query words as the target query language material.

Specifically, since the query corpus generated after the historical query words in the historical query template are replaced may have grammatical and semantic problems, in order to ensure the validity of the query corpus, in this embodiment, the replaced query corpus may be scored by using a pre-trained language model for the replaced query corpus, so as to retain the query corpus with a higher score. The language model can adopt a GPT (general Pre-Training) model, and can be obtained by Training by labeling forecast data in advance and used for checking the accuracy of the grammar and the semantics of the natural sentences. Specifically, grammar semantic examination is carried out on the query corpus through a language model to obtain a score value of the query corpus, when the score value is not smaller than a preset score threshold value, the query corpus is considered to be available corpus, the query corpus is reserved, when the score value is smaller than the preset score threshold value, the query corpus is considered not to conform to the expression of natural sentences, the unavailable query corpus is removed, and effectiveness and accuracy of the query corpus are guaranteed.

In an embodiment, as shown in fig. 4, the corpus generating method further includes the following steps:

s401: and acquiring high-frequency questioning words and texts to be mined corresponding to the target question and answer field.

Specifically, step S401 is consistent with step S201, and is not described herein again to avoid repetition.

S402: and extracting a target response sentence corresponding to the high-frequency question words from the text to be mined according to the high-frequency question words.

Specifically, step S402 is consistent with step S202, and is not described herein again to avoid repetition.

S403: and carrying out synonym expansion on the high-frequency question words to obtain a plurality of target synonyms corresponding to the high-frequency question words.

Specifically, synonym expansion is further performed on the high-frequency question words in the embodiment to obtain a plurality of target synonyms corresponding to the high-frequency question words, so that each target synonym is subsequently adopted to replace the high-frequency question words in the question corpus to obtain a plurality of target question corpora corresponding to the high-frequency question words, and the purpose of corpus expansion is achieved.

S404: and performing text similarity matching on the high-frequency question words and a plurality of historical question sentences in the historical question-answer library to obtain a plurality of historical question sentences as historical question templates.

Specifically, step S404 is consistent with step S203, and is not described herein again to avoid repetition.

S405: and replacing the historical questioning words in the historical questioning template with high-frequency questioning words to obtain questioning corpus.

Specifically, step S405 is consistent with step S204, and is not described herein again to avoid repetition.

S406: and (4) carrying out syntactic and semantic inspection on the query corpus by adopting a pre-trained language model to obtain a score value of the query corpus.

Specifically, step S406 is consistent with step S301, and is not described herein again to avoid repetition.

S407: and when the score value is not less than a preset score threshold value, reserving the query corpus.

Specifically, step S407 is consistent with step S302, and is not described herein again to avoid repetition.

S408: and replacing the high-frequency question words in the reserved question corpus according to each target synonym to obtain a plurality of target question corpora corresponding to the high-frequency question words.

Specifically, each target synonym is adopted to replace the high-frequency question words in the reserved question corpus, namely, the semantic grammar of the question corpus is determined to be in accordance with the natural language expression, and then the partial reserved question corpus is replaced, so that the replacement workload is reduced, and the effectiveness of word replacement can be ensured.

S409: and taking each target question corpus and the target answer sentences corresponding to the high-frequency question words as target question and answer corpora.

Specifically, the target question-answer corpus is expanded by taking each target question-answer corpus and the target answer sentences corresponding to the high-frequency question words as the target question-answer corpus, so that more question-answer corpuses are obtained.

In an embodiment, as shown in fig. 5, in step S202, extracting a target answer sentence corresponding to a high-frequency question word from a text to be mined according to the high-frequency question word, specifically the following steps are performed:

s501: and obtaining a sentence extraction template corresponding to the target question and answer field.

S502: and extracting a target answer sentence corresponding to the high-frequency question word from the text to be mined according to the sentence extraction template.

Specifically, the sentence extraction template is a sentence extraction expression created in advance according to the requirement of the target question and answer field, for example, XXX is defined as XXX, wherein XXX is a high-frequency question word, a plurality of character strings corresponding to the high-frequency question word and conforming to the expression can be extracted from the text to be mined according to the sentence extraction template, and a portion "×" in the extracted character strings is used as the target answer sentence corresponding to the high-frequency question word.

In an embodiment, as shown in fig. 6, in step S403, that is, performing synonym expansion on the high-frequency question words, and acquiring a plurality of target synonyms corresponding to the high-frequency question words, the method specifically includes the following steps:

s601: and inputting the high-frequency question words into the similarity function for processing, and acquiring a plurality of candidate synonyms corresponding to the high-frequency question words and first similarities corresponding to the candidate synonyms.

The similarity function is a function which can return a first similarity between the high-frequency question words and the candidate synonyms in the word2vec tool. The first similarity is the word similarity between the high-frequency question words and the candidate synonyms.

Specifically, the server directly inputs the high-frequency question words into the similarity function for processing, so as to obtain a plurality of candidate synonyms corresponding to the high-frequency question words returned by the similarity function and a first similarity between the candidate synonyms and the high-frequency question words.

S602: and selecting a plurality of target synonyms corresponding to the high-frequency questioning words from the plurality of candidate synonyms based on the first approximation degree.

Specifically, the similarity corresponding to each candidate synonym is sorted in descending order, and the candidate synonyms ranked at the top K are selected as target synonyms. Or setting a similarity threshold value to take the candidate synonym with the first similarity larger than the similarity threshold value as the target synonym. The value of K can be set according to actual needs, and is not limited here.

In an embodiment, as shown in fig. 7, in step S408, the high-frequency query words in the reserved query corpus are replaced according to each target synonym, so as to obtain a plurality of target query corpora corresponding to the high-frequency query words, which specifically includes the following steps:

s701: and (4) segmenting the target question corpus, removing high-frequency question words in the target question corpus, and obtaining intermediate sentences.

In this embodiment, the target query corpus is divided to remove the high-frequency query words in the target query corpus, and the synonym expansion processing is performed on the character strings, i.e., the intermediate sentences, except the high-frequency query words, so as to further expand the corpus.

S702: and performing word segmentation processing on the intermediate sentence to obtain a plurality of word times to be replaced.

Specifically, before word segmentation, developers can set a Chinese word stock in advance to provide technical support for word segmentation. The chinese lexicon (hereinafter referred to as "lexicon") is a lexicon used for segmenting chinese characters. In this embodiment, the specific steps of performing word segmentation on the intermediate sentence by using the maximum reverse matching algorithm are as follows: firstly, setting the maximum length MAX of sentence segmentation; then, the historical text information is divided into a plurality of sentences, and the division can be carried out according to the end characters of the sentences (such as:; segmenting each sentence according to the sequence from right to left to obtain a single character string; and then comparing the single character string with a word bank, recording if the word is contained in the word bank to form a word to be replaced, and otherwise, continuing to compare by reducing one single character until one single character is left, and stopping.

Illustratively, the maximum length MAX of sentence segmentation is 5, the input sentence is "i eat one person", and the segmentation is started in the order from right to left to obtain a single string, i.e., "eat one person"; if the word is not in the word bank, one word is reduced to become ' one ' to eat by one person '; if the word is not in the word bank, continuously reducing one single word 'one' to become 'people eat'; if the word is not in the word bank, a single word, namely 'person' is reduced, and the word bank becomes 'meal'; the word, namely 'eating', exists in the word stock, and is recorded to obtain a word to be replaced. At the moment, the sentence is changed into 'I one person', the word is not in the word bank, a single word is reduced to be 'I', and the sentence is changed into 'one person'; if the word is not in the word bank, continuously reducing a single word 'one' to become a 'person'; the word, namely 'person' exists in the word stock, and the word is recorded to obtain another word to be replaced. At the moment, the sentence is changed into 'I', the word is not in the word bank, a single word is reduced, namely, 'I', is changed into 'I'; the word, namely 'one', exists in the word stock, the word is recorded, and another word to be replaced is obtained. At this time, the sentence is left with only one word "I", and the algorithm is terminated. Finally, the word segmentation result of the intermediate sentence "I eat alone" by adopting the algorithm of the maximum reverse matching is "I/eating".

S703: and inputting the word to be replaced into the similarity function for processing, and acquiring a plurality of original near-meaning words corresponding to the word to be replaced and second similarity corresponding to the original near-meaning words.

The approximation function is a function which can return the original near-meaning word corresponding to each word to be replaced in the word2vec tool. The second similarity is the word similarity between the word to be replaced and the original similar meaning word.

S704: and selecting a target similar meaning word corresponding to the word to be replaced from the plurality of original similar meaning words based on the second approximation degree.

Specifically, the similarity corresponding to each candidate synonym may be sorted in descending order, and the top M-ranked original synonyms may be obtained as the target synonyms. Or setting a similarity threshold value to take the original similar meaning words with the first similarity larger than the similarity threshold value as target similar meaning words. The value of M can be set according to actual needs, and is not limited here.

S705: and reconstructing the target question corpus based on the target near-meaning words and the high-frequency question words to update the target question corpus.

Specifically, the server may randomly select a target similar meaning word from the plurality of target similar meaning words to replace the corresponding word to be replaced in the target question-answering corpus, and combine the target similar meaning word with the high-frequency question words to restore the sentence structure of the target question-answering corpus, so as to further achieve the purpose of data expansion.

Further, in this embodiment, because each to-be-replaced word rank corresponds to a plurality of to-be-replaced word ranks, when a to-be-replaced word rank is randomly selected from the plurality of to-be-replaced word ranks for replacement, the to-be-replaced word rank may be the same as the target synonym, and a situation that the added corpus is the same as the target query corpus occurs, so after the added corpus is obtained, all the added query corpora need to be deduplicated and updated, so as to ensure validity of the expansion data.

For convenience of understanding, the following examples are described, for example, the to-be-replaced word includes a and B, since the position of each to-be-replaced word corresponds to the intermediate sentence, there is a sentence sequence, that is, a-B, and the target synonym corresponding to each to-be-replaced word includes a- (a1) and B- (B1, B2), then a plurality of target synonyms corresponding to a are expressed as { a, a1}, a plurality of target synonyms corresponding to B are expressed as { B, B1, B2}, one target synonym is randomly selected from a plurality of target synonyms corresponding to each to-be-replaced word, which may include the following forms, (a, B), (a, B1), (A, B2), (B, a1), (a1, B1), (a1, B2), each to-be-replaced by a target synonym in the intermediate sentence, and the replaced intermediate sentence is obtained, namely (A-B), (A-B1), (A-B2), (B-a1), (a1-B1) and (a1-B2), removing repeated intermediate sentences to obtain updated intermediate sentences, namely (A-B1), (A-B2), (B-a1), (a1-B1) and (a1-B2), and then combining each updated intermediate sentence with corresponding high-frequency questioning words, namely restoring the high-frequency questioning words in the intermediate sentences according to the sentence structure and word positions of the target questioning corpus before segmentation to obtain the reconstructed target questioning corpus.

In an embodiment, as shown in fig. 8, in step S203, that is, based on the text similarity, obtaining a plurality of historical question sentences as a historical question template specifically includes the following steps:

s801: and acquiring the text similarity between the high-frequency question words and each historical question sentence.

Wherein the text similarity can be a jaccard (Jacard) similarity representation; specifically, the high-frequency question words and the historical question sentences are respectively subjected to vectorization conversion to obtain vectors corresponding to the high-frequency question words and the historical question sentences, and then the ratio of the vector intersection and the vector union of the high-frequency question words and the historical question sentences is calculated to serve as the text similarity.

S802: the obtained multiple text similarities are arranged in a descending order, and the previous N-bit historical question sentences are obtained as a historical question template; alternatively, the first and second electrodes may be,

s803: and based on the obtained plurality of text similarity, taking the historical question sentence with the text similarity larger than a preset text similarity threshold as a historical question template.

Specifically, in this embodiment, the determining the historical question template may include, but is not limited to, implementing the following two ways, one is to perform descending order arrangement on the text similarity, so as to use the top N-bit historical question sentences as the historical question template; the value of N may be set according to actual needs, and is not limited herein. The other is as follows: and taking the historical question sentences with the text similarity larger than a preset text similarity threshold as historical question templates.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.

In an embodiment, a corpus generating device is provided, and the corpus generating device corresponds to the corpus generating method in the above embodiment one to one. As shown in fig. 8, the corpus generating apparatus includes a data acquiring module 10, a target question-and-answer sentence extracting module 20, a historical question template acquiring module 30, a question corpus acquiring module 40, and a target question-and-answer corpus acquiring module 50. The functional modules are explained in detail as follows:

the data acquisition module 10 is used for acquiring high-frequency questioning words and texts to be mined corresponding to the target question and answer field; wherein the high-frequency question words are used for indicating subject words corresponding to high-frequency questions in the target question-answering field

A target question-answer sentence extracting module 20, configured to extract a target answer sentence corresponding to the high-frequency question from the text to be mined according to the high-frequency question

A historical question template obtaining module 30, configured to perform text similarity matching on the high-frequency question words and multiple historical question sentences in the historical question-answer library, and obtain multiple historical question sentences as historical question templates

The query corpus acquiring module 40 is configured to replace the historical query words in the historical query template with high-frequency query words to obtain a query corpus; wherein, the history question words are subject words corresponding to the history question template

And a target question-answer corpus acquiring module 50, configured to use the question corpus and the target answer sentences corresponding to the high-frequency question words as target question-answer corpus.

Specifically, the corpus generating method device further comprises a semantic grammar checking module, a first processing module, a second processing module and a target question and answer corpus acquiring module.

A semantic grammar checking module for carrying out grammar semantic check on the query corpus by adopting the pre-trained language model to obtain the score value of the query corpus

A first processing module for reserving the query corpus when the score value is not less than the preset score threshold value

A second processing module for removing the query corpus when the score value is smaller than the preset score threshold value

The target question-answer corpus acquiring module is specifically configured to use the reserved question corpus and the target answer sentences corresponding to the high-frequency question words as the target question-answer corpus.

Specifically, the corpus generating method device further comprises a synonym expansion module, a target question corpus acquiring module and a target question-answer corpus acquiring module.

A synonym expansion module for carrying out synonym expansion on the high-frequency question words and obtaining a plurality of target synonyms corresponding to the high-frequency question words

A target query corpus obtaining module, configured to replace the high-frequency query words in the reserved query corpus according to each target synonym, so as to obtain multiple target query corpora corresponding to the high-frequency query words

The target question-answer corpus acquiring module is specifically configured to use each target question corpus and a target answer sentence corresponding to the high-frequency question words as the target question-answer corpus.

Specifically, the target question-answer sentence extraction module includes an extraction template or unit and a target answer sentence extraction unit.

An extraction template or unit for obtaining a sentence extraction template corresponding to the target question-answering field

And the target answer sentence extraction unit is used for extracting a target answer sentence corresponding to the high-frequency question word from the text to be mined according to the sentence extraction template.

Specifically, the synonym expansion module comprises a candidate synonym acquisition module and a target synonym acquisition module.

A candidate synonym obtaining module, configured to input the high-frequency question into the similarity function for processing, and obtain multiple candidate synonyms corresponding to the high-frequency question and a first similarity corresponding to the candidate synonyms

And the target synonym acquisition module is used for selecting a plurality of target synonyms corresponding to the high-frequency question words from the plurality of candidate synonyms based on the first approximation degree.

Specifically, the corpus generating method device further comprises a segmentation module, a word segmentation module, an original similar meaning word obtaining module, a target synonym obtaining module and a target query corpus reconstruction module.

A segmentation module, configured to segment the target question corpus, remove high-frequency question words in the target question corpus, and obtain an intermediate sentence

A word segmentation module for performing word segmentation processing on the intermediate sentence to obtain a plurality of word times to be replaced

An original similar meaning word obtaining module, configured to input the word rank to be replaced into the similarity function for processing, and obtain a plurality of original similar meaning words corresponding to the word rank to be replaced and a second similarity corresponding to the original similar meaning words

And the target similar meaning word acquisition module is used for selecting the target similar meaning word corresponding to the word order to be replaced from the plurality of original similar meaning words based on the second approximation degree.

And the target question corpus reconstructing module is used for reconstructing the target question corpus based on the target near-synonym words and the high-frequency question words so as to update the target question corpus.

Specifically, the historical question template acquisition module comprises a sorting unit and a historical question template acquisition unit.

And the text similarity acquiring unit is used for acquiring the text similarity between the high-frequency question words and each historical question sentence.

The sequencing unit is used for carrying out descending sequencing on the obtained multiple text similarities and obtaining the previous N-bit historical question sentences as a historical question template; alternatively, the first and second electrodes may be,

and the historical question template acquiring unit is used for acquiring a plurality of text similarities and taking the historical question sentences of which the text similarities are greater than a preset text similarity threshold as the historical question templates.

For the concrete limitation of the corpus generating device, reference may be made to the above limitation on the corpus generating method, which is not described herein again. All or part of the modules in the corpus generating device can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 9. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a computer storage medium and an internal memory. The computer storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the computer storage media. The database of the computer device is used for storing data generated or obtained in the process of executing the corpus generating method, namely a wishful graph recognition model. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a corpus generation method.

In one embodiment, a computer device is provided, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the processor implements the steps of the corpus generating method in the above embodiments, such as the steps S201 to S207 shown in fig. 2 or the steps shown in fig. 3 to 7. Alternatively, when the processor executes the computer program, the functions of each module/unit in the embodiment of the corpus generating device, for example, the functions of each module/unit shown in fig. 8, are not described herein again to avoid repetition.

In an embodiment, a computer storage medium is provided, where a computer program is stored on the computer storage medium, and when executed by a processor, the computer program implements the steps of the corpus generating method in the foregoing embodiment, for example, steps S201 to S207 shown in fig. 2 or steps shown in fig. 3 to 7, which are not repeated herein for avoiding repetition. Alternatively, the computer program, when executed by the processor, implements the functions of each module/unit in the embodiment of the corpus generating device, for example, the functions of each module/unit shown in fig. 8, and are not described herein again to avoid repetition.

The blockchain referred in this embodiment is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules, so as to perform all or part of the functions described above.

The above examples are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the foregoing examples, those of ordinary skill in the art should understand that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims

1. A corpus generating method, comprising:

2. The corpus generating method according to claim 1, wherein after said replacing the history question words in the history question template with the high frequency question words to obtain the question corpus, the corpus generating method further comprises:

performing syntactic and semantic inspection on the query corpus by adopting a pre-trained language model to obtain a score value of the query corpus;

when the score value is not smaller than a preset score threshold value, the query corpus is reserved;

when the score value is smaller than the preset score threshold value, removing the query corpus;

the taking the question corpus and the target answer sentence corresponding to the high-frequency question words as a target question-answer corpus includes:

and taking the reserved query language material and the target answer sentence corresponding to the high-frequency query words as the target query language material.

3. The corpus generating method according to claim 2, wherein after the obtaining of the high-frequency question words and the text to be mined corresponding to the target question-answering field, the corpus generating method further comprises:

carrying out synonym expansion on the high-frequency question words to obtain a plurality of target synonyms corresponding to the high-frequency question words;

the taking the reserved query corpus and the target answer sentence corresponding to the high-frequency query word as the target query and answer corpus includes:

replacing the high-frequency question words in the reserved question corpus according to each target synonym to obtain a plurality of target question corpora corresponding to the high-frequency question words;

and taking each target question corpus and the target answer sentence corresponding to the high-frequency question words as the target question and answer corpus.

4. The corpus generating method according to claim 1, wherein said extracting, according to the high-frequency question words, the target answer sentences corresponding to the high-frequency question words from the text to be mined comprises:

obtaining a sentence extraction template corresponding to the target question and answer field;

and extracting a target answer sentence corresponding to the high-frequency question word from the text to be mined according to the sentence extraction template.

5. The corpus generating method according to claim 3, wherein said performing synonym expansion on the high-frequency question words to obtain a plurality of target synonyms corresponding to the high-frequency question words comprises:

inputting the high-frequency question words into an approximation function for processing, and acquiring a plurality of candidate synonyms corresponding to the high-frequency question words and first approximation degrees corresponding to the candidate synonyms;

and selecting a plurality of target synonyms corresponding to the high-frequency question words from the candidate synonyms based on the first approximation degree.

6. The corpus generating method according to claim 3, wherein after the high-frequency query words in the retained query corpus are replaced according to each target synonym, so as to obtain a plurality of target query corpuses corresponding to the high-frequency query words, the corpus generating method further comprises:

dividing the target question corpus, and removing high-frequency question words in the target question corpus to obtain intermediate sentences;

performing word segmentation processing on the intermediate sentence to obtain a plurality of word times to be replaced;

inputting the word to be replaced into an approximation function for processing, and acquiring a plurality of original near-meaning words corresponding to the word to be replaced and second approximation degrees corresponding to the original near-meaning words;

selecting a target similar meaning word corresponding to the word to be replaced from the original similar meaning words based on the second approximation degree;

reconstructing the target question corpus based on the target similar meaning words and the high-frequency question words to update the target question corpus.

7. The corpus generating method according to claim 1, wherein said performing text similarity matching between said high frequency question words and a plurality of historical question sentences in a historical question-and-answer library to obtain a plurality of historical question sentences as a historical question template comprises:

acquiring the text similarity between the high-frequency question words and each historical question sentence;

the obtained multiple text similarities are arranged in a descending order, and the previous N historical question sentences are obtained as the historical question template; alternatively, the first and second electrodes may be,

and based on the plurality of acquired text similarity, taking the historical question sentence with the text similarity larger than a preset text similarity threshold as the historical question template.

8. A corpus generating device, comprising:

9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the corpus generation method according to any one of claims 1 to 7 when executing the computer program.

10. A computer storage medium, in which a computer program is stored, which, when being executed by a processor, carries out the steps of the corpus generation method according to any one of claims 1 to 7.