CN111143531A

CN111143531A - Question-answer pair construction method, system, device and computer readable storage medium

Info

Publication number: CN111143531A
Application number: CN201911349116.5A
Authority: CN
Inventors: 蒋芳清; 熊友军
Original assignee: Ubtech Robotics Corp
Current assignee: Ubtech Robotics Corp
Priority date: 2019-12-24
Filing date: 2019-12-24
Publication date: 2020-05-12

Abstract

The invention discloses a question-answer pair construction method, a system, a device and a computer readable storage medium, wherein the method comprises the following steps: extracting sentences of potential question-answer pairs in the text paragraphs; sorting the sentences of the potential question-answer pairs to generate candidate question-answer pairs; and scoring and screening the candidate question-answer pairs to obtain the question-answer pairs with the scores higher than a set threshold value. Through the method, the potential question-answer pairs in the documents are automatically extracted to construct high-quality question-answer pairs, so that the rapidness and the accuracy of constructing the question-answer pairs are improved, and the quality of a question-answer knowledge base is improved.

Description

Question-answer pair construction method, system, device and computer readable storage medium

Technical Field

The invention relates to the technical field of natural language processing and knowledge base storage, in particular to a question-answer pair construction method, a question-answer pair construction system, a question-answer pair construction device and a computer readable storage medium.

Background

The existing question-answer knowledge base consists of scenes, questions and corresponding answers, the knowledge sources of the question-answer knowledge base mainly comprise documents such as rule terms, user manuals and the like, and the documents all have some simple descriptions of facts, such as 'when a person takes a high-speed rail, the person cannot take a high-speed rail for a pet', and 'the person can return goods without reason within 7 days after goods inspection and acceptance'.

In the question-answering system based on question-answering pairs, a question-answering knowledge base formed by the question-answering pairs is a knowledge source of the question-answering system, and the accuracy and richness of knowledge determine the quality of the question-answering system, so that the knowledge base formed by the question-answering pairs is an important ring of the question-answering system.

The construction of the existing knowledge base depends on the traditional manual editing mode, question-answer pairs are extracted from text documents such as rule terms and user manuals, and the question-answer pairs are scored in a manual screening mode. This construction requires manual intervention, which not only requires high operating and maintenance costs, but also makes it difficult to control the quality of the knowledge base.

Disclosure of Invention

Aiming at the defects in the prior art, the invention mainly solves the technical problem of providing a question-answer pair construction method, a system, a device and a computer readable storage medium, and the question-answer knowledge base is constructed by automatically extracting question-answer pairs in a document based on natural processing and deep learning technology, so that the automatic construction of the question-answer knowledge base is realized, the labor cost is reduced, and the quality of the question-answer knowledge base is improved.

In order to solve the technical problems, one technical scheme adopted by the invention is to provide a question-answer pair construction method, which comprises the following steps: extracting sentences of potential question-answer pairs in the text paragraphs; sorting the sentences of the potential question-answer pairs to generate candidate question-answer pairs; and scoring the candidate question-answer pairs and screening to obtain the question-answer pairs with scores higher than a set threshold value.

Before the step of extracting sentences of potential question-answer pairs in the text paragraphs, the method comprises the following steps: extracting text paragraphs from an input text document, and performing segmentation processing on the text paragraphs by adopting a segmentation method; and performing text preprocessing on the text paragraphs after the segmentation processing.

The step of extracting the sentences of the potential question-answer pairs in the text paragraphs specifically comprises the following steps: performing syntactic dependency analysis on sentences in the text paragraphs and outputting dependency relationships among words in the sentences; extracting a main stem of the sentence according to the dependency relationship; judging whether the backbone of the sentence has potential question-answer knowledge or not; and when the judgment result is yes, extracting sentences which are potential question-answer pairs.

The step of sorting the sentences of the potential question-answer pairs to generate candidate question-answer pairs specifically comprises the following steps: simplifying the sentences of the potential question-answer pairs; performing entity recognition on the simplified sentence; extracting entity construction question answer pairs of sentences; and rewriting the question-answer pairs by adopting a question-answer pair rewriting method based on a depth generation model to obtain candidate question-answer pairs.

The step of scoring and screening the candidate question-answer pairs to obtain the question-answer pairs specifically comprises the following steps: grading the candidate question answers by adopting a grading method based on a rapid text classification model; and obtaining the question-answer pairs with the scores higher than the set threshold value by adopting a screening method based on sorting filtering according to the scoring results of the candidate question-answer pairs.

In order to solve the above technical problem, another technical solution adopted by the present invention is to provide a question-answer pair construction system, including: the extraction module is used for extracting sentences of potential question-answer pairs in the text paragraphs; the candidate question-answer pair generating module is used for sorting sentences of the potential question-answer pairs and generating candidate question-answer pairs; the scoring module is used for training a rapid text classification model to score the candidate question answers for classification; and the screening module is used for screening out question-answer pairs with the scores higher than a set threshold value through the sorting filter.

Wherein, the question-answer pair construction system further comprises: the input module is used for inputting a text document; and the preprocessing module is used for preprocessing the text paragraphs.

Wherein, the question-answer pair extraction system further comprises: the judging module is used for judging whether the sentences in the text paragraphs have potential question-answer knowledge or not; and the output module is used for outputting the question-answer pairs with the scores higher than the set threshold value.

In order to solve the above technical problem, another technical solution adopted by the present invention is to provide a question-answer pair constructing apparatus, including: a memory for storing program data which, when executed, implements the steps of the question-answer pair construction method described in any one of the above; a processor for executing the program instructions stored in the memory to implement the steps of the question-answer pair construction method described in any one of the above.

In order to solve the above technical problem, another technical solution adopted by the present invention is to provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and the computer program, when executed by a processor, implements the steps in the question-answer pair construction method described in any one of the above.

The invention has the beneficial effects that: different from the situation of the prior art, the candidate question-answer pairs are generated by automatically extracting sentences of potential question-answer pairs in the document from the document, the candidate question-answer pairs are rewritten by adopting a method based on combination of a question template and a deep learning model, the accuracy and the diversity of generated questions are ensured, and finally, the candidate question-answer pairs are scored and screened based on a scoring model and a rapid text classification model, the correlation between the generated questions and answers is ensured, and high-quality question-answer pairs are obtained. By the method, the system and the device, the automatic construction of the question-answer pairs is realized, the degree of dependence on the traditional manual editing is reduced, the labor cost is reduced, the question-answer pairs are constructed more quickly and accurately, and the quality of the question-answer knowledge base is improved.

Drawings

FIG. 1 is a schematic flow chart diagram of an embodiment of a question-answer pair construction method according to the present invention;

FIG. 2 is a schematic flow chart of one embodiment of step 11 of FIG. 1;

FIG. 3 is a schematic flow chart of one embodiment of step 12 of FIG. 1;

FIG. 4 is a schematic diagram of a question template and a corresponding question and answer pair according to an embodiment of the present invention;

FIG. 5 is a schematic flow chart diagram of one embodiment of step 13 of FIG. 1;

FIG. 6 is a block diagram of an embodiment of a question-answer pair construction system of the present invention;

FIG. 7 is a schematic structural diagram of an embodiment of a question-answer pair constructing apparatus according to the present invention;

FIG. 8 is a schematic structural diagram of an embodiment of a computer-readable storage medium of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the present invention and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, and "a" and "an" generally include at least two, but do not exclude at least one, unless the context clearly dictates otherwise.

It should be understood that the term "and/or" as used herein is merely one type of association that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

It should be understood that the terms "comprises," "comprising," or any other variation thereof, as used herein, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

A question-answer pair construction method, system, apparatus, and computer-readable storage medium according to embodiments of the present invention will be described in detail below with reference to the accompanying drawings. It should be noted that these examples are not intended to limit the scope of the present disclosure.

Referring to fig. 1, fig. 1 is a schematic flow chart of an embodiment of a question-answer pair construction method according to the present invention, in which the method includes:

s11: sentences of potential question-answer pairs in the text paragraphs are extracted.

In an embodiment of the invention, text paragraphs are extracted from the input original text document. After the text paragraphs are extracted from the text document, the text paragraphs are segmented by adopting a segmentation method.

Specifically, the line feed character is used as a mark for paragraph distinction, and the number of characters of each paragraph is controlled within a preset interval range.

And further, preprocessing the text paragraphs after the segmentation processing.

Specifically, the text paragraphs are processed in at least one mode of abnormal character removal, case and case conversion, simplified and traditional body conversion, sentence breaking and word segmentation, part of speech tagging and sentence simplification, and the difficulty of subsequent processing is reduced by reducing character noise pollution.

In the embodiment of the invention, syntactic dependency analysis is carried out on the preprocessed sentence, the dependency relationship among the words in the sentence is output through the syntactic dependency analysis, the stem of the sentence is extracted according to the dependency relationship, and finally whether potential question-answering knowledge exists in the sentence is judged through the stem.

Specifically, referring to fig. 2, fig. 2 is a schematic flow chart of an embodiment of step 11 in fig. 1, in which the method includes:

s211: and performing syntactic dependency analysis on the preprocessed sentences and outputting the dependency relationship among the words in the sentences.

Syntactic dependency analysis, called dependency analysis for short, is used for identifying interdependencies among words in a sentence.

Specifically, the dependency syntax expresses the whole sentence structure by the dependency relationship among the words, which expresses the semantic dependency relationship among the components of the sentence, the dependency relationship among all the words forms a syntax tree, the root node of the tree is the sentence center word and is used for expressing the core content of the whole sentence, that is, each sentence has only one center word, and each word in the sentence has one word related to the center word.

In an embodiment of the invention, syntactic dependency analysis is to label dependencies between words in a sentence.

Specifically, the dependency relationship among words is at least one of a dominance relationship, a move-guest relationship, an inter-guest relationship, a preposition object, a bilingual, a centering relationship, a middle-form structure, a move-complement structure, a parallel relationship, a mediate-guest relationship, a left additional relationship, a right additional relationship, an independent structure, and a core relationship.

In the embodiment of the present invention, the syntax dependency analysis method or tool is not specifically limited, but the format of the syntax analysis result is limited.

Specifically, the output format of the syntactic dependency analysis result is defined as the CONLL markup format.

S212: and extracting the main stem of the sentence according to the dependency relationship.

In an embodiment of the present invention, the stems of sentences are extracted from the result of parsing the syntactic dependencies.

Specifically, a word having a core relationship with the headword dependency relationship is extracted as a predicate.

The central word is the center of the sentence and governs other components in the sentence, and is not governed by any other components.

Further, all the dominance relations in the sentence are traversed, and words with the predicate dependency relations as the dominance relations are extracted as the subjects.

Further, all the verb relations in the sentence are traversed, and the words with the verb dependency relations being the verb relations are extracted as the objects.

Further, the subject, predicate, and object extracted in the above steps are combined as the backbone of the sentence.

S213: and judging whether the backbone of the sentence has potential question-answering knowledge.

In the embodiment of the invention, whether the extracted backbone meets one of the conditions of containing a subject, a predicate and an object or containing a subject, a predicate and an object is judged, if yes, the sentence is judged to have potential knowledge of questioning and answering, otherwise, the sentence is not judged to exist.

S214: and when the judgment result is yes, extracting the sentence as a sentence of the potential question-answer pair.

In this step, the sentences with potential question-answer knowledge are extracted as the sentences of potential question-answer pairs.

S12: and sorting the sentences of the potential question-answer pairs to generate candidate question-answer pairs.

In an embodiment of the invention, sentences identified as potentially capable of extracting knowledge of question-answer pairs are collated to generate question-answer pairs, and the question-answer pairs generated in this step are candidate question-answer pairs.

Specifically, please refer to fig. 3, where fig. 3 is a schematic flowchart of an embodiment of step 12 in fig. 1, and in this embodiment, the method includes:

s311: simplifying the sentences of the potential question-answer pairs.

In embodiments of the present invention, simplifying a sentence of a complex potential question-and-answer pair refers to deleting meaningless clauses or components in the sentence.

The complex sentence is a sentence composed of a plurality of clauses or having a complex structure.

In the embodiment of the invention, whether the sentences of the potential question-answer pairs are complex sentences is judged firstly.

Wherein, the complex sentence is a sentence with the number of clauses larger than 1 or the syntactic dependency coefficient larger than 3.

Further, when the judgment result is yes, simplifying the sentence; if the determination result is negative, the process proceeds directly to S312.

In the embodiment of the invention, the sentence simplification comprises three steps of nonsense clause deletion, main stem extraction and main stem supplement.

In the step of deleting the nonsense clauses, firstly, a nonsense sentence set is defined, then clauses in the complex sentences are matched with the nonsense sentences in the defined set one by one, and the clauses are deleted if the matching is successful.

The stem extraction step is the same as S212, and is not described herein again.

Wherein, in the stem supplement step, the modified components of the stem and the extracted stem are combined to form a new stem.

The embodiment of the present invention does not limit the manner of the trunk supplement.

Optionally, in other embodiments of the present invention, the backbone supplementing step may not be performed.

S312: and performing entity recognition on the simplified sentence.

In an embodiment of the invention, the identified entity type includes at least one of a person name, a place name, an organization name, and a time.

Specifically, the embodiment of the present invention does not limit the recognition method, and the entity may be recognized by at least one of dictionary matching, training of an entity recognition model, and direct utilization of an open source tool.

S313: and extracting entities of the sentences to construct question-answer pairs.

In the embodiment of the invention, question-answer pairs are constructed by adopting a question-answer pair construction method based on a question template.

In embodiments of the present invention, the problem templates defined are divided into two broad categories, one is entity replacement and the other is template filling.

In a specific implementation scenario, based on a question template replaced by an entity, corresponding query words are directly adopted for several types of entities such as a person name, a place name, an organization name and time to generate a question, and the replaced entity is used as an answer.

For example, when the identified entity is "Roman", since Roman is the place name, the corresponding query word is "where", and "Roman" is directly used as the answer; when the identified entity is "at light o 'clock", since 8 points correspond to time, the corresponding interrogators are "where", "at light o' clock" directly as the answer.

In another specific implementation scenario, based on the problem template filled by the template, slot slots are first determined, and then phrases are extracted from the document to fill the slots to complete sentences and adjust sentence structures.

Specifically, please refer to fig. 4, fig. 4 is a schematic structural diagram of a question template and a corresponding question and answer pair according to an embodiment of the present invention.

In practical applications, inputting "X is Y" in the template slot 41, the following problem can be constructed in the problem slot 42: is X Y? "," What is Y? "," where is Y? "," Why is Y? "," Who is Y? ", and the corresponding answer slot 43 generates the following answer: "yes.", "x.

Alternatively, inputting "The X verbs Y" in template slot 41, The following question may be constructed in question slot 42: "Does X verbs Y? "," at does the X verbs? ", and the corresponding answer slot 43 generates the following answer: "x.", "y.".

S314: and rewriting the question answer pair by adopting a question-answer pair rewriting method based on a depth generation model to obtain the candidate question-answer pair.

As can be seen from the question-answer pairs generated in the steps, the problems constructed based on the predefined problem template have the problems of single structure, poor expression diversity and the like.

In the embodiment of the invention, a question-answer pair rewriting method for training a deep generation model is provided, the deep generation model is trained in advance in a supervised learning mode, the trained deep generation model is used for rewriting the question generated in S313, and the rewritten question answer pair is output.

Supervised learning is a machine learning mode, is often used in a scene with sufficient data, can learn a function (model parameters) from a given training data set, and can predict a result according to the function when new data comes.

The training requirement of supervised learning comprises input and output, targets in a training set are labeled by people, and a training sample set consists of samples with labels.

Specifically, an optimal model is obtained through training of an existing training sample, namely known data and corresponding output of the known data, all input is mapped into corresponding output by the optimal model, and the output is simply judged so as to achieve the purpose of classification.

In the embodiment of the invention, the labeled question has a changeable structure and expresses question answer pairs with high diversity as the optimal model.

The deep learning technology has a good effect in various tasks such as text classification, sequence labeling, machine translation and the like. In embodiments of the present invention, answers in candidate question-answer pairs may be generated by a user question through a depth-generating model.

Specifically, the depth generation model used in the embodiment of the present invention adopts a Sequence-to-Sequence (Sequence to Sequence) depth generation model.

Wherein, the Sequence to Sequence (Sequence to Sequence) model can translate one language Sequence into another language Sequence, and the whole process is to map one Sequence as output to another output Sequence by using a deep neural network.

Specifically, the deep neural network is LSTM (long short term memory network) or RNN (recurrent neural network).

S13: and scoring the candidate question-answer pairs and screening to obtain the question-answer pairs with scores higher than a set threshold value.

Specifically, referring to fig. 5, fig. 5 is a schematic flow chart of an embodiment of step 13 in fig. 1, in which the method includes:

s511: and (4) adopting a scoring method based on a rapid text classification model to score the candidate question answers for classification.

In the embodiment of the invention, the fast text classification model is trained in advance in a supervised learning mode, and then the trained classification model is applied to the classification scoring of the candidate question-answer pairs.

Specifically, training data are obtained through a question-answer pair generation method, then the training data are reviewed through a manual review process, the output class of the question-answer pair with the highest matching degree is labeled as 1, correspondingly, the output class of the question-answer pair with the lowest matching degree is labeled as 0, the question-answer pair in the training data is used as model input, the score of the matching degree of the question-answer pair is used as model output, and a fastText classification model is trained.

And further, classifying and scoring the generated candidate question-answer pairs by using the trained fastText classification model, and outputting scoring results of the candidate question-answer pairs.

S512: and obtaining the question-answer pairs with the scores higher than the set threshold value by adopting a question-answer pair screening method based on sorting filtering according to the scoring result.

In the embodiment of the invention, the candidate question-answer pairs are sorted from high to low according to the score value, a score threshold value is preset, the candidate question-answer pairs with the score value lower than the set threshold value are filtered, and the final question-answer pairs are screened out to construct the high-quality question-answer pairs.

Optionally, the score threshold is set to 0.8-0.9.

In a specific implementation scenario, if the score threshold is set to 0.9, the candidate question-answer pairs with output scores higher than 0.9 are retained, and the candidate question-answer pairs with output scores lower than 0.9 are deleted.

In another specific implementation scenario, if the score threshold is set to 0.8, the candidate question-answer pairs with output scores higher than 0.8 are retained, and the candidate question-answer pairs with output scores lower than 0.8 are deleted.

Referring to fig. 6, fig. 6 is a schematic diagram of a framework of an embodiment of a question-answer pair construction system according to the present invention, where the question-answer pair construction system includes an input module 61, a preprocessing module 62, a judging module 63, an extracting module 64, a candidate question-answer pair generating module 65, a scoring module 66, a screening module 67, and an output module 68.

An input module 61 for inputting a text document.

After the input module 61 obtains the original text document, the text paragraphs are further segmented.

And the preprocessing module 62 is connected with the input module 61 and is used for preprocessing the segmented text paragraphs.

The preprocessing comprises at least one of abnormal character removal, case and case conversion, simplified and traditional body conversion, sentence breaking and word segmentation, part of speech tagging and sentence simplification.

And the judging module 63 is connected with the preprocessing module 62 and is used for judging whether the sentences in the preprocessed text paragraphs have potential question-answer knowledge.

In an embodiment of the present invention, the step of determining, by the determining module 63, whether there is potential question-answering knowledge in the sentences in the preprocessed text paragraphs includes: and performing syntactic dependency analysis on the preprocessed sentence, outputting the dependency relationship among words in the sentence through the syntactic dependency analysis, extracting a main stem of the sentence according to the dependency relationship, and finally judging whether potential question-answer knowledge exists in the sentence through the main stem.

And the extracting module 64 is connected to the judging module 63 and is configured to extract the sentences of the identified potential question-answer pairs in the text paragraphs.

The candidate question-answer pair generating module 65 is connected to the extracting module 64, and is configured to sort the sentences of the potential question-answer pairs and generate candidate question-answer pairs.

In the embodiment of the present invention, the step of generating the candidate question-answer pair by the candidate question-answer pair generating module 65 specifically includes: simplifying the sentences of the potential question-answer pairs, carrying out entity recognition on the simplified sentences, extracting the entities of the sentences, constructing question-answer pairs based on the question templates, and finally training a deep generation model through a supervised learning mode to rewrite the question-answer pairs.

And the scoring module 66 is connected with the candidate question-answer pair generating module 65 and is used for training the rapid text classification model to score the candidate question-answer pairs.

And the screening module 67 is connected with the scoring module 66 and is used for screening out question-answer pairs with the scores higher than the set threshold value through the sorting filter.

And the output module 68 is connected with the screening module 67 and is used for outputting question-answer pairs with scores higher than the set threshold value.

Referring to fig. 7, fig. 7 is a schematic structural diagram of an embodiment of a question-answer pair constructing device according to the present invention, which includes a processor 71 and a memory 72.

Processor 71 is configured to execute program instructions stored in memory 72 to implement the steps of the question-answer pair construction method described in any of the above-described method embodiments.

Specifically, the processor 71 is configured to control itself and the memory 72 to implement the specific steps in the question-answer pair construction method described in any one of the above-mentioned method embodiments. The processor 71 may also be referred to as a CPU (Central processing unit). The processor 71 may be an integrated circuit chip having signal processing capabilities. The Processor 71 may also be a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. In addition, the processor 71 may be commonly implemented by a plurality of integrated circuit chips.

Referring to fig. 8, fig. 8 is a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present invention.

The computer-readable storage medium 80 includes a computer program 801 stored on the computer-readable storage medium 80, and when executed by the processor, the computer program 801 implements the specific steps in the question-answer pair construction method described in any one of the above method embodiments.

In particular, the integrated unit, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in a computer readable storage medium 80. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a computer-readable storage medium 80 and includes several instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method of the embodiments of the present application. And the aforementioned computer-readable storage medium 80 includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In the several embodiments provided in the present application, it should be understood that the disclosed method, system, and apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a module or a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes performed by the present specification and drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A question-answer pair construction method is characterized by comprising the following steps:

extracting sentences of potential question-answer pairs in the text paragraphs;

sorting the sentences of the potential question-answer pairs to generate candidate question-answer pairs;

and scoring the candidate question-answer pairs and screening to obtain the question-answer pairs with the scores higher than a set threshold value.

2. The question-answer pair construction method according to claim 1, characterized by comprising, before the step of extracting sentences of potential question-answer pairs in a text passage, the steps of:

extracting text paragraphs from an input original text document, and performing segmentation processing on the text paragraphs by adopting a segmentation method;

and performing text preprocessing on the text paragraphs after the segmentation processing.

3. The question-answer pair construction method according to claim 1, wherein in the step of extracting sentences of potential question-answer pairs in a text passage, the method specifically comprises:

performing syntactic dependency analysis on sentences in the text paragraphs and outputting dependency relationships among words in the sentences;

extracting a main stem of the sentence according to the dependency relationship;

judging whether potential question-answer knowledge exists in the main stem of the sentence;

and when the judgment result is yes, extracting the sentence as a sentence of the potential question-answer pair.

4. The question-answer pair construction method according to claim 1, wherein in the step of sorting the sentences of the potential question-answer pairs to generate candidate question-answer pairs, the method specifically comprises:

simplifying the sentences of the potential question-answer pairs;

performing entity recognition on the simplified sentence;

extracting entity construction question answer pairs of the sentences;

and rewriting the question answer pair by adopting a question-answer pair rewriting method based on a depth generation model to obtain the candidate question-answer pair.

5. The question-answer pair construction method according to claim 1, wherein in the step of scoring and screening the candidate question-answer pairs to obtain question-answer pairs with scores higher than a set threshold, the method specifically comprises:

grading the candidate question answers by adopting a grading method based on a rapid text classification model;

and obtaining the question-answer pairs with the scores higher than a set threshold value by adopting a screening method based on sorting filtering according to the scoring results of the candidate question-answer pairs.

6. A question-answer pair construction system, comprising:

the extraction module is used for extracting sentences of potential question-answer pairs in the text paragraphs;

the candidate question-answer pair generating module is used for sorting the sentences of the potential question-answer pairs and generating candidate question-answer pairs;

the scoring module is used for training a rapid text classification model to score the candidate question answers for classification;

and the screening module is used for screening out question-answer pairs with the scores higher than a set threshold value through the sorting filter.

7. The question-answer pair construction system according to claim 6, characterized by further comprising:

the input module is used for inputting a text document;

and the preprocessing module is used for preprocessing the text paragraphs.

8. The question-answer pair construction system according to claim 7, characterized in that the question-answer pair construction system further comprises:

the judging module is used for judging whether the sentences in the text paragraphs have potential question-answer knowledge or not;

and the output module is used for outputting the question-answer pairs with the scores higher than the set threshold value.

9. A question-answer pair construction apparatus comprising:

a memory for storing program data which, when executed, implements the steps in the question-answer pair construction method according to any one of claims 1 to 5;

a processor for executing the program instructions stored by the memory to implement the steps in the question-answer pair construction method according to any one of claims 1 to 5.

10. A computer-readable storage medium, having a computer program stored thereon, which, when being executed by a processor, implements the steps in the question-answer pair construction method according to any one of claims 1 to 5.