CN116467412A

CN116467412A - Knowledge graph-based question and answer method, system and storage medium

Info

Publication number: CN116467412A
Application number: CN202310269794.0A
Authority: CN
Inventors: 崇庆魏; 刘沁园; 蔡炎松; 窦辰晓
Original assignee: Nanhu Research Institute Of Electronic Technology Of China
Current assignee: Nanhu Research Institute Of Electronic Technology Of China
Priority date: 2023-03-16
Filing date: 2023-03-16
Publication date: 2023-07-21

Abstract

The invention discloses a question and answer method, a question and answer system and a storage medium based on a knowledge graph, wherein the method comprises the following steps: carrying out entity identification on the input problems to form an entity set; performing first sorting screening based on the entity set and the knowledge base to obtain a candidate set containing a plurality of triples; extracting the question features, and performing second sorting screening based on the questions, the candidate sets and the question features to obtain a triplet serving as an initial answer of the questions; based on the questions and the triples as initial answers, outputting declarative answers as final answers, and ending the question and answer. The invention improves the universality, reduces the labor cost and improves the user experience.

Description

Knowledge graph-based question and answer method, system and storage medium

Technical Field

The invention belongs to the technical field of knowledge questions and answers, and particularly relates to a knowledge graph-based question and answer method, a knowledge graph-based question and answer system and a knowledge graph-based question and answer storage medium.

Background

The question and answer system refers to a system that a questioner inputs questions into the system, and the system returns answers to the questions after internal processing. The question-answering system is mainly divided into two types according to the corpus of answers: knowledge base based and text based. Based on knowledge base means that the answer is completed by utilizing the prior knowledge such as knowledge graph; text-based refers to a given batch of documents or paragraphs in which answers to questions are found, such as draba of Facebook.

In recent years, along with the development of knowledge graph technology, a plurality of large-scale general knowledge graphs such as OpenKG and the like and open-source field knowledge graphs such as finance, medical treatment and the like appear. For the question-answering task, when the knowledge graph exists in the field, the structured question-answering based on the knowledge graph has the advantages of high accuracy, accurate language description, low resource consumption and the like. With the advent of the Encoder-Decoder framework such as a transducer, a text generation task is vigorously developed as one of branches of a natural processing task, and various excellent text generation models such as T5 and GPT are developed, so that unstructured natural language generation of structured texts is possible.

Currently, the primary mode of knowledge-based questions and answers is "search-match". For example, in chinese patent application No. CN202111387891.7, a method for constructing a citrus control question-answering module based on a knowledge graph and a question-answering system are disclosed, wherein the data set of the question-answering module has 3 columns, the first column is a question, the second column is a database query sentence, and the third column is a corresponding answer in a knowledge base. The data set of the patent uses the knowledge graph query statement, and when the knowledge graph query statement is migrated to other fields, the query statement needs to be reconstructed, the direct flow cannot be realized, and a large amount of labeling manpower needs to be consumed.

It can be seen that the main idea of knowledge-graph-based question and answer in the prior art is to traverse all paths, find a direct query identical to the query statement through the sorting algorithm to obtain a corresponding answer, but complexity is higher when the combination of the entity and the predicate is exhausted. And the questions and answers are usually directly fed back to the results of the knowledge graph, and most of the results are in a [ subject, relation and object ] triplet format, belong to the structured text and have no fluency of natural language.

Disclosure of Invention

The invention aims to provide a question-answering method, a question-answering system and a storage medium based on a knowledge graph, which are used for improving universality, reducing labor cost and improving user experience.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

a knowledge-graph-based question-answering method, the knowledge graph being mapped to a knowledge base composed of triples, the knowledge-graph-based question-answering method comprising:

carrying out entity identification on the input problems to form an entity set;

performing first sorting screening based on the entity set and the knowledge base to obtain a candidate set containing a plurality of triples;

extracting the question features, and performing second sorting screening based on the questions, the candidate sets and the question features to obtain a triplet serving as an initial answer of the questions;

based on the questions and the triples as initial answers, outputting declarative answers as final answers, and ending the question and answer.

The following provides several alternatives, but not as additional limitations to the above-described overall scheme, and only further additions or preferences, each of which may be individually combined for the above-described overall scheme, or may be combined among multiple alternatives, without technical or logical contradictions.

Preferably, the first sorting filtering based on the entity set and the knowledge base, to obtain a candidate set including a plurality of triples, includes:

carrying out query matching on the entity set in a knowledge base to obtain a query result;

inputting the questions and the query results into the entity chain finger model, outputting the relevance scores of the triples and the questions in the query results, and taking a plurality of triples with the highest relevance scores as candidate sets.

Preferably, the extracting the problem features includes: and performing equal-frequency discretization on the relevance score based on the candidate set to obtain a candidate feature set.

Preferably, the second sorting filtering is performed based on the questions, the candidate set and the characteristics of the questions, so as to obtain a triplet as an initial answer of the questions, including:

forming a plurality of question-triplet data pairs from the questions and triples in the candidate set;

extracting the characteristics of the 'question-triplet' data pair as combined characteristics;

each combined feature is spliced with the problem feature to obtain a spliced feature;

and sorting the triples in the candidate set according to the matching degree based on the splicing characteristics, and taking the triples with the highest matching degree with the questions as initial answers of the questions.

Preferably, the extracting the features of the "question-triplet" data pair as combined features includes: the 'problem-triplet' data pair is used as the input of the BERT model, and the final [ CLS ] feature of the BERT model is used as the combination feature.

Preferably, the sorting the triples in the candidate set based on the stitching features includes:

inputting the spliced characteristics into a forward neural network to obtain the output of the forward neural network;

inputting the output of the forward neural network into a classifier, and taking the probability score output by the classifier as the matching degree of the current triplet and the problem;

and sorting the matching degree of the triples in the candidate set according to the matching degree.

Preferably, the outputting the declarative answer based on the question and the triplet as the initial answer includes:

selecting a predefined conversion template according to the problem;

outputting the triples as declarative statements based on the conversion templates as declarative answers.

combining the triples of the questions as initial answers to form a question-answer data pair;

the "question-answer" data is input to the text generation model, and the output of the text generation model is taken as a declarative answer.

A knowledge-graph-based question-answering system, the knowledge-graph mapping to a knowledge base of triples, the knowledge-graph-based question-answering system comprising:

the entity analysis module is configured to identify the input problem and form an entity set;

the first-order sorting module is configured to perform first sorting screening based on the entity set and the knowledge base to obtain a candidate set containing a plurality of triples;

the second order sorting module is configured to extract the problem features, and performs second sorting screening based on the problems, the candidate set and the problem features to obtain a triplet serving as an initial answer of the problems;

the text generation module is configured to output a declarative answer as a final answer based on the question and the triplet serving as an initial answer, and end the question and answer.

A computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of the knowledge-graph based question-answering method.

The question-answering method, the question-answering system and the storage medium based on the knowledge graph provided by the invention have the advantages that the question-answering is completed only by using the information of the knowledge graph, no additional query data is added, and the method and the system have universality in various fields; the whole question-answering system is processed, and the original method for traversing the combination of entity predicates is replaced by an entity chain finger method, so that the complexity of the process is reduced, and meanwhile, the marking manpower is saved; and carrying out secondary processing on the answers, replacing the structured answers with declarative sentences, so that the user experience is better, and finally, realizing the generation of natural language answers from the user to smoothly putting questions in an end-to-end mode.

Drawings

FIG. 1 is a flow chart of a knowledge-based question-answering method of the present invention;

fig. 2 is a schematic diagram of data transmission of a knowledge-based question-answering method according to the present invention;

FIG. 3 is a flow chart of a second order screening of the present invention;

FIG. 4 is a flow chart of the present invention for generating a declarative answer;

FIG. 5 is a schematic diagram of a part of the prediction results and labels in the experiment according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.

Example 1

The embodiment provides a question answering method based on a knowledge graph, which realizes that a declarative corpus is output as an answer based on an input question. The knowledge graph consists of entities and entity relations and is expressed in the form of a graph. The knowledge-graph organizes data by triples, in which a node-edge-node can be seen as a record, the first node as the subject, the edge as the predicate, and the second node as the object.

Based on the nodes and edges in the knowledge graph, the subjects are extracted as subjects, predicates are extracted as relations, objects are extracted as objects, and the knowledge graph is mapped into a knowledge base composed of triples [ subjects, relations, objects ], so that the knowledge graph is not limited by a storage database of the knowledge graph.

The question-answering method proposed in the present embodiment may be a question-answering method for a general domain, or may be a question-answering method for a specific professional domain (e.g., finance, medical, etc.), and the applied domain is determined by the domain to which the knowledge graph belongs.

As shown in fig. 1, the knowledge-graph-based question-answering method of the present embodiment includes the following steps:

and step 1, carrying out entity identification on the input problems to form an entity set.

Firstly, carrying out named entity identification on the problems, obtaining one or more entities, and forming an entity set by the obtained entities.

The NER model is preferably used for entity identification in this embodiment, and in other embodiments, other entity identification models, such as LSTM-CRF, may be substituted.

And step 2, performing first sorting screening based on the entity set and the knowledge base to obtain a candidate set containing a plurality of triples.

As shown in fig. 2, because the repository has the rename subject, the entity set extracted in step 1 needs to be queried and matched in the repository, and when the query and the match are performed, each entity in the entity set is matched with the subject and the object in the repository one by one, so as to obtain the query result.

And then inputting the questions and the query results into a physical chain finger model for sorting, wherein the physical chain finger model outputs the relevance scores of the triples and the questions contained in the query results, sorting the triples in the query results from high to low based on the relevance scores, taking a plurality of triples with the highest relevance scores as candidate sets, namely taking all relations and objects spliced by the first N subjects with the highest relevance scores as candidate sets.

In the embodiment, only the problems and the knowledge graph are used, and the knowledge graph query statement is not used, so that the method of the embodiment has universality in various fields, and the whole question-answering system is subjected to flow. In addition, the triples are obtained by introducing the entity chain finger model to replace the 'entity-predicate' query, so that the triples can be linked to the knowledge graph, a large number of irrelevant subjects can be filtered, and subjects strongly related to the problem can be returned.

And step 3, extracting the characteristics of the questions, and performing second sorting screening based on the questions, the candidate set and the characteristics of the questions to obtain a triplet serving as an initial answer of the questions.

When the problem features are extracted, the relevance scores are subjected to equal-frequency discretization based on the candidate sets to obtain the candidate feature sets as the problem features. In the embodiment, through equal-frequency discretization of the TopN main body association degree scores, the strengthening characteristic can be obtained, and the ordering channel is increased.

In other embodiments, as the complexity of the problem increases, the problem features are not limited to the candidate feature set, but may include features of the problem itself, such as length, and features of entities identified by named entities, such as attributes of the entities, parts of speech, and so on.

The problems and the characteristics of the problems are introduced in the secondary sorting screening, so that the finally obtained answers have more directionality. Since the data structure of the candidate set is a plurality of [ subject, relation, object ] triples, as shown in fig. 3, the present embodiment first combines the problem with the triples in the candidate set two by two to form a plurality of "problem-triples" data pairs, and extracts the combined feature based on the "problem-triples" data pairs.

The combination feature fusion expresses the problem and the relation between the problem and each triplet, and the embodiment takes the problem-triplet data pair as the input of the BERT model and takes the final [ CLS ] feature of the BERT model as the combination feature.

The present embodiment preferably uses the BERT model, but is not limited to only using the BERT model, and may be replaced by a variety of BERT variants such as ALBERT, deBERT, roBERT.

When the feature dimension is higher, the accuracy of prediction is often correspondingly increased, so that in the embodiment, each combined feature is spliced with the problem feature in pairs to obtain spliced features, the triples in the candidate set are subjected to matching degree sequencing based on the spliced features, and the triples with the highest matching degree with the problem are taken as initial answers of the problem.

When matching degree sequencing is carried out on triples in the candidate set, the spliced characteristics are input into a forward neural network to obtain output of the forward neural network; then inputting the output of the forward neural network into a classifier, and taking the probability score output by the classifier as the matching degree of the current triplet and the problem; and sorting the matching degree of the triples in the candidate set according to the matching degree. Therefore, the top triplet in the ordered triples is the triplet with the highest matching degree with the problem.

The problem features used in the embodiment include candidate feature sets obtained by discretizing the relevance scores, each candidate triplet in the candidate sets has a relevance score, the id of the discretization of the relevance scores maps out a vector with high latitude and a [ CLS ] feature vector of the BERT model to be spliced, and then the triples which are matched with the problem best can be obtained by using several layers mlp and sigmod output scores as matching degree sequencing basis. In addition, the pluggable feature module is selected to facilitate the expansion and introduction of other novel problem features such as attributes, parts of speech and the like.

And 4, outputting a declarative answer as a final answer based on the questions and the triples serving as initial answers, and ending the question and answer.

Because the triples are structured data and have no fluency of natural language, the embodiment converts the triples into declarative answers for output, and user experience is improved.

In the conversion of structured data into unstructured data, the final declarative answer can be obtained through multiple modules according to questions and triplet results. According to prior experience, designing a conversion template according to the problem, selecting a predefined conversion template according to the problem after obtaining the triplet with the highest matching degree, and outputting the triplet as a declarative sentence based on the conversion template as a declarative answer. The operation of converting structured data into unstructured data based on templates may be implemented based on the prior art, such as the method disclosed in chinese patent application No. CN201811244279.2, and this embodiment will not be described in detail.

In addition, as shown in fig. 4, for the case that a declarative answer cannot be constructed through a template, it may be implemented based on a text generation model, combining a question and a triplet as an initial answer to form a "question-answer" data pair; the "question-answer" data is modeled for the input text, and the declarative text description of the output of the text generation model is taken as the declarative answer. In order to avoid information repetition, when constructing the data pair of the question-answer, only the object in the triplet can be extracted, and the object and the question are combined to form the data pair of the question-answer.

The text generation model is preferably an mT5 model, but is not limited to the mT5 model, and can be replaced by a plurality of pre-training text generation models such as T5 and GPT, BART, jointGT.

When the text generation model is trained, a set of declarative answer corpus is firstly constructed, a three-tuple object part is extracted and a question is combined to form a question-answer data pair, then the mT5 model is used, the question-answer data pair is used as input, and the declarative answer corpus is used as a label for training. In the prediction process, a final declarative answer can be obtained by only feeding a 'question-answer' data pair without additionally constructing a declarative answer corpus.

According to the embodiment, a text generation model is introduced and combined with a custom template, the optimal triples are flexibly processed, sentence-type declarative description is generated, and the smoothness of natural language of an answer is ensured.

After the questions are received, the named entity recognition, entity chain pointing and sequencing model is automatically carried out, the optimal triplet result is selected, sentence-type declarative answers are directly generated based on the triplet structure, the end-to-end questions can be finally given, and then the effect of the declarative answers is displayed in a screen.

The effectiveness of each step in the method of this example is described below through experiments.

1) And taking the whole of the steps 1-3 as a sequencing model, wherein the entity recognition model of the step 1 adopts an NER model, the BERT model is used in the process of extracting the combined features in the step 2, and the Rank model is used in the forward neural network in the step 3. And using the OpenKG universal knowledge graph as the knowledge graph in the experiment.

And the self-made data set comprises manually paired problems and triplet pairs, and the triples in the problems and triplet pairs are used as labels to verify the triples output by the sorting model. The results of the verification are shown in table 1.

TABLE 1 ranking model validation results

F1	Recall	AUC
			0.765	0.837	0.903

In table 1, the index F1 is used for comprehensively judging the accuracy and Recall condition of the ranking model, the index Recall refers to the Recall rate of the photographing model, the index AUC refers to the area under the ROC curve, the ranking capability of the ranking model can be described, and the higher the score, the stronger the ranking capability of the model is.

As can be seen from table 1, the ranking capacity of the ranking model in this embodiment on the homemade dataset reaches more than 0.9, which has a strong ranking capacity and ensures the accuracy of answer screening.

2) And (3) taking the step (4) as a generation model, verifying a text conversion mode based on the text generation model, and taking the text generation model as an mT5 model. And using the OpenKG universal knowledge graph as the knowledge graph in the experiment.

The self-made data set comprises manually paired question-answer data pairs and declarative answer corpus, and the declarative answer corpus is used as a label to verify the text output by the generation model. The results of the verification are shown in table 2.

Table 2 generating model verification results

Rouge-1	Rouge-2	Rouge-L	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR
								100	100	100	0.96	0.95	0.94	0.93	0.963

The index Rouge (Recall-Oriented Understudy for Gisting Evaluation) in table 2 is based on N-gram to describe Recall conditions for generative models; the index BLEU (Bilingual Evaluation Understudy) is based on the N-gram and is used for describing the accuracy of the generated model; the index METEOR (Metric for Evaluation of Translation with Explicit ordered) is used for comprehensively describing the accuracy and recall condition of the generated model, and the higher the scores of the three indexes, the better the generated text effect.

As can be seen from table 2, the key index BLEU-1 of the generating model in this embodiment reaches 0.96 on the homemade dataset, which proves that the generating model in this application has a strong text generating capability. As shown in fig. 5, the prediction result (prediction) output by the generated model has a high overlap ratio with the label (labels) in the homemade dataset.

Example 2

For a specific limitation of a knowledge-based question-answering system, reference may be made to the above limitation of a knowledge-based question-answering method, which is not described herein. The embodiment uses the sorting module to screen answers in the knowledge graph and combines the sorting module with the text generation module to form an end-to-end complete question-answering system.

Example 3

A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of a knowledge-graph based question-answering method.

Those skilled in the art will appreciate that implementing all or part of the above described embodiment methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the above described embodiment methods.

Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples merely represent a few embodiments of the present invention, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of the invention should be assessed as that of the appended claims.

Claims

1. A knowledge graph-based question-answering method, wherein the knowledge graph is mapped into a knowledge base composed of triples, the knowledge graph-based question-answering method comprising:

carrying out entity identification on the input problems to form an entity set;

2. The knowledge-based question-answering method according to claim 1, wherein the first sorting screening based on the entity set and the knowledge base to obtain a candidate set including a plurality of triples comprises:

3. The knowledge-based question-answering method according to claim 2, wherein the extracting the question feature comprises: and performing equal-frequency discretization on the relevance score based on the candidate set to obtain a candidate feature set.

4. The knowledge-based question-answering method according to claim 1, wherein the second order screening based on the questions, candidate sets and characteristics of the questions to obtain a triplet as an initial answer of the questions comprises:

5. The knowledge-based question-answering method according to claim 4, wherein the extracting features of "question-triplet" data pairs as combined features comprises: the 'problem-triplet' data pair is used as the input of the BERT model, and the final [ CLS ] feature of the BERT model is used as the combination feature.

6. The knowledge-based question-answering method according to claim 4, wherein the ranking the triples in the candidate set based on the stitching features for matching degree comprises:

7. The knowledge-based question-answering method according to claim 1, wherein the outputting of the declarative answer based on the questions and the triples as the initial answer includes:

selecting a predefined conversion template according to the problem;

8. The knowledge-based question-answering method according to claim 1, wherein the outputting of the declarative answer based on the questions and the triples as the initial answer includes:

9. A knowledge-graph-based question-answering system, wherein the knowledge graph is mapped to a knowledge base composed of triples, the knowledge-graph-based question-answering system comprising:

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 8.