CN111581326A

CN111581326A - Method for extracting answer information based on heterogeneous external knowledge source graph structure

Info

Publication number: CN111581326A
Application number: CN202010238159.2A
Authority: CN
Inventors: 虎嵩林; 吕尚文; 朱福庆; 周薇; 韩冀中
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2020-03-30
Filing date: 2020-03-30
Publication date: 2020-08-25
Anticipated expiration: 2040-03-30
Also published as: CN111581326B

Abstract

The invention provides a method for extracting answer information based on a heterogeneous external knowledge source graph structure, which belongs to the field of natural language processing and is used for improving answer quality returned by question answering. The method can rearrange the answers according to the matching degree of the questions and the answers, displays the answers concerned and expected by the user at the top, makes the search result more pertinent, and makes the user obtain the more desired answers in shorter query time.

Description

Method for extracting answer information based on heterogeneous external knowledge source graph structure

Technical Field

The invention belongs to the field of natural language processing, and provides a method for extracting answer information based on a heterogeneous external knowledge source graph structure.

Background

The question-answering system aims to retrieve answers capable of answering questions from a massive background text database and return the answers to the users by understanding queries input by the users. For example, a specific query is input in a conventional hundredth search engine, and the hundredth search engine retrieves answers corresponding to the query from a background text database, thereby providing a portal for a user to retrieve related knowledge and know about the internet. In addition, the question-answering system has important application value in specific fields such as bank system question-answering and E-business question-answering. Generally, a question-answering system retrieves a batch of answers relatively relevant to the query from background massive text data according to the received user query and sorts and returns the answers, so that the answers more relevant to the user query can be arranged in front of a query result, and the expectation of the user on the query can be met more quickly. For example, when using a hundred-degree search engine, a user expects answers of his or her own interest to be presented at the most advanced number of pages and positions.

In the question-answering system, the results are returned mainly by using a traditional manual feature method such as TFIDF, BM25 and the like or a deep learning matching model, and answers with higher matching degrees are put at the front position. The returned results play a crucial role in the ranking of the final answers.

In the field of deep learning, relevant technologies such as text matching and retrieval applied to a question-answering system are becoming mature day by day, and a plurality of important achievements are obtained on a plurality of tasks by predicting results according to occurrence probability distribution of words in a text. In recent years, pre-trained language models such as BERT, XLNet, etc. have achieved a lot of results on various natural language processing tasks, and many tasks have performed even more than humans. This is mainly benefited by its strong a priori knowledge and representation learning ability.

However, most of the existing matching models represent and learn texts based on the probability distribution of the texts, and the sparsity problem, the common sense problem and the like are difficult to be solved well. The main reason is that for the common sense problem, not only the information provided by the currently given text needs to be considered, but also valuable information is selected from the information according to the experience of daily life to provide corresponding decision basis for reasoning. This presents new challenges to existing text matching and search model approaches.

Disclosure of Invention

In view of problems and defects in the prior art, in order to improve the quality of answers returned by a question-answering system, the invention provides a method for extracting answer information based on a heterogeneous external knowledge source graph structure, two heterogeneous knowledge sources of structured knowledge and unstructured knowledge are combined, after corresponding knowledge is obtained, the structured technology is used for carrying out graph building processing on the knowledge, and the representation of a graph is learned by combining the graph convolution network technology, and finally, the answer capable of correctly answering the user question is returned from background text data by combining the information of texts and graphs. The method can rearrange the answers according to the matching degree of the questions and the answers, displays the answers concerned and expected by the user at the top, makes the search result more pertinent, and makes the user obtain the desired answers in shorter query time.

In order to solve the problems, the invention adopts the following technical scheme:

a method for extracting answer information based on a heterogeneous external knowledge source graph structure comprises the following steps:

(1) extracting relevant path knowledge from questions to answers from a structured knowledge base (such as ConceptNet, WordNet and the like) according to questions input by a user query, and extracting relevant sentences similar to the expression of the questions and the answers from an unstructured knowledge base (such as English Wikipedia and the like):

(2) splicing the extracted related path knowledge, related sentences and query input of a user together, and inputting the spliced related path knowledge, related sentences and query input into a pre-training language model to obtain the whole semantic representation < cls >; < cls > has important implications in a number of pre-trained language models, such as BERT, XLNET, RoBERTA, etc., which represent a vector representation of the entire input; for example, in the classification task, the vector representation of cls can be used as the representation of sentences for classification output; the method uses cls to represent the matching degree of the query and the return result through a layer of linear network;

(3) respectively establishing graphs for the extracted related path knowledge (such as Concept) and sentences (such as English Wikipedia) to utilize the structured knowledge;

(4) using a graph convolution network to perform representation learning on the two established graphs to obtain vector representation of each node; because the constructed connection relation between the neighbor nodes of the graph can provide more information for the semantic representation of the nodes, the method uses a graph network representation learning method to utilize the structural information on the graph;

(5) similarity matching is carried out on each node in the graph by using < cls >, an attention weight of each node in the graph is obtained, and a final representation (a final matching vector) is obtained through the weighted sum of the weight and the node vector:

(6) and (3) scoring the relevance of the obtained final representation by using the final representation obtained in the step (5) through one or more layers of linear transformation networks to obtain the relevance scores of all answers, and sorting the answers from high to low according to the scores, wherein the higher the score is, the higher the matching degree of the answer with the query input of the user is, the more relevant the answer is, the more front the position of the row is, and the more desired the answer can be obtained by the user.

Further, when extracting knowledge from the structured knowledge base, firstly, the entities (people, places, organizations and the like) in the questions and the answers are identified, then the intermediate entities passing through the question entities and the answer entities are found, and finally the question entities, the intermediate entities and the answer entities jointly form the related path knowledge.

Further, when extracting knowledge from the unstructured knowledge base, firstly, using a tool to perform sentence segmentation on the whole corpus and establishing an index, then splicing the questions and the answers as input, and screening the first K sentences with the highest similarity from the whole corpus; specifically, word frequency-inverse document frequency (TF-IDF) is used for measuring similarity, and documents with high word coverage have larger similarity.

Further, the related path knowledge extracted from the structured knowledge base is represented as a natural language sentence, and the original related path knowledge is represented by a < e1, r1, e2> triple structured representation and is converted into a sentence representation of e1, r1, e 2; meanwhile, the sentences extracted from the unstructured knowledge base are spliced one by one; subsequently, the question is spliced with the answer and fed into a pre-trained language model to get the overall representation < cls >.

Further, for the relevant path knowledge extracted by the structured knowledge base, each triple is a node, and if there is a common entity between the nodes, an edge is added between the nodes to establish a graph.

Further, for the sentences extracted by the unstructured knowledge base, firstly, a semantic role Labeling (semantic role Labeling) tool is used for extracting predicates and predicate elements in the sentences, each predicate element and predicate element in the predicates are used for establishing a graph for one node in the graph, and if a certain contact ratio between the two nodes is met, an edge is added between the two nodes.

Further, a node in the graph may be a word or a phrase, and the representation of the node is an average value of a vector of each word.

Further, when the graph convolution network is used, the vector representation of the node and the connection relation of the node are used as reference items.

Further, the vector of nodes is represented using an undirected graph network.

Further, when the graph network is represented, the node vectors are represented by using an undirected graph version of the graph network due to the characteristic that the directed graph may have overfitting.

Drawings

FIG. 1 is a flowchart illustrating an overall method for extracting answer information based on a heterogeneous external knowledge source graph structure according to an embodiment;

fig. 2 is a flowchart of representing a graph using a graph convolution network and obtaining a final result in an embodiment.

Detailed Description

The invention will be further described with reference to the accompanying drawings and specific embodiments.

External knowledge plays a crucial role in computing the degree of matching of queries and search answers. For example, when searching for "height of wife of xiao ming" in a hundred degree, the conventional search engine retrieves answers from documents including keywords such as "xiao ming", "wife", "height" and the like according to text matching, and it is often difficult to return accurate answers. The invention utilizes knowledge in the external structured knowledge map to identify that Xiaoming is an entity, then identifies that a wife of Xiaoming is Xiaoli, and finally finds the height attribute of Xiaoli from the external knowledge map as an answer to return. It can be seen that structured knowledge provides high quality knowledge, but also suffers from low coverage. And the coverage rate of the unstructured knowledge is wide, and good supplement is provided for the structured knowledge. For example, when searching for "what people usually do when playing guitar", the actual answer should be "singing", but the answer that the user wants is hard to get in the prior art because it is only based on the text matching method. Related knowledge gets a good answer in unstructured knowledge. When using structured knowledge, graphs are built to represent the knowledge and then make inferences. When using unstructured knowledge, the knowledge is exploited using corresponding text matching techniques. But neither can make effective use of both structured and unstructured knowledge information. The invention improves the traditional question-answering method by utilizing the two methods, provides a method for extracting answer information based on a heterogeneous external knowledge source graph structure, and ensures that the method has the capability of reasoning and returning answers by utilizing external knowledge, and the following embodiment is used for explaining the method.

The embodiment provides a method for extracting answer information based on a heterogeneous external knowledge source graph structure, as shown in fig. 1, the method is composed of two parts, namely heterogeneous knowledge extraction and graph network-based reasoning;

(1) heterogeneous knowledge extraction involves extracting from structured knowledge, which employs ConceptNet, and unstructured knowledge, which employs Wikipedia.

(1-1) when knowledge is extracted from the ConceptNet, firstly, entities in questions and answers are identified, and then paths from the question entities to the answer entities are selected from the ConceptNet graph to serve as the knowledge;

(1-2) when the knowledge is extracted from the Wikipedia, firstly, a tool is used for dividing sentences of the whole corpus and establishing indexes, and then the question and the answer are spliced to be used as input to screen K sentences with high similarity from the whole corpus.

(1-3) according to the extracted knowledge in the step (1-1), carrying out structuring processing on the concept net knowledge to obtain a graph, wherein each triple is a node for the concept net, and if a common entity exists among the nodes, an edge is added among the nodes;

and (1-4) extracting predicates and predicate elements from the Wikipedia knowledge by using a semantic role Labeling (semantic role Labeling) tool according to the extracted knowledge in the step (1-2), establishing a graph by using each predicate element and predicate element in the graph as a node in the graph, and adding an edge between the two nodes if the two nodes meet certain contact ratio.

And (1-5) establishing a graph structure according to the graph structures (1-3) and (1-4), using a graph convolution network to obtain a representation, and outputting a final answer.

(2) Referring to fig. 2, a method for representing a graph by using a graph convolution network and obtaining a final result specifically includes:

(2-1) performing representation learning on the two established graphs by using a graph convolution network to obtain a vector representation of each node;

(2-2) the node in the graph may be a word or a phrase, and the representation of the node is the average value of the vector of each word;

and (2-3) when the graph network is represented, the undirected graph version is used for representing the node vectors due to the characteristic that the directed graph may have overfitting.

(2-4) according to (2-2) and (2-3), using < cls > to make an entry with each node in the graph, obtaining the weight of each node in the graph, and using the weighted sum of the weight and the vector to obtain a final representation:

and (2-5) according to the step (2-4), the obtained final representation is subjected to relevance scoring by using a multilayer linear network to obtain the relevance score of a certain answer, and the answer with the highest score in the answers is finally selected.

The method of the present invention is not limited to the examples described in the specific embodiments, and other embodiments derived from the technical solutions of the present invention by those skilled in the art also belong to the technical innovation scope of the present invention.

Claims

1. A method for extracting answer information based on a heterogeneous external knowledge source graph structure is characterized by comprising the following steps:

according to the questions inquired and input by the user, extracting relevant path knowledge from the questions to the answers from the structured knowledge base, and extracting relevant sentences with the highest similarity to the questions and the answers from the unstructured knowledge base;

splicing the extracted related path knowledge, related sentences and the information input by query together, and inputting the information into a pre-training language model to obtain the whole semantic expression < cls >;

respectively establishing graphs for the extracted related path knowledge and related sentences, and performing representation learning on the two established graphs by using a graph convolution network to obtain vector representation of each node;

similarity matching is carried out on < cls > and each node in the graph to obtain an attention weight of each node in the graph, and a final representation is obtained by using a weighted sum of the weight and the node vector;

and (4) using a linear transformation network to score the relevance of the final representation to obtain the relevance score of each answer, and sequencing each answer from high to low according to the relevance score.

2. The method of claim 1, wherein the method of extracting knowledge of relevant paths is: and identifying entities in the question and the answer, finding out intermediate entities from the question entities to the answer entities, and combining the question entities, the intermediate entities and the answer entities into related path knowledge.

3. The method of claim 1, converting the extracted triplet representation of related path knowledge into a natural language statement.

4. The method of claim 1, wherein for the extracted knowledge of the relevant paths, each triplet is a node and an edge is added between nodes if there are entities in common between the nodes.

5. The method of claim 1, wherein the relevant sentences are extracted by: the method comprises the steps of carrying out sentence segmentation on the whole corpus in an unstructured knowledge base, establishing an index, splicing the questions and answers as input, and screening the first K sentences with the highest TF-IDF from the whole corpus to serve as related sentences.

6. The method of claim 1, wherein for the extracted relevant sentences, a semantic role labeling tool is used to extract predicates and predicate elements therein, each predicate element and predicate element are used to respectively establish a graph for one node in the graph, and if a preset contact ratio between two nodes is met, an edge is added between the two nodes.

7. The method of claim 1, wherein each node in the graph is a word or phrase, and the representation of the node is an average of vectors for each word.

8. The method of claim 1, wherein a vector representation of a node and a connection relationship of the node are used as reference terms when using a graph-convolution network.

9. The method of claim 1, wherein the vectors of nodes are represented using an undirected graph network.

10. The method of claim 1, wherein the structured knowledge base comprises ConceptNet, WordNet, the unstructured knowledge base comprises wikipedia, and the pre-trained language model comprises BERT, XLNet, RoBERTa.