CN111581326A - Method for extracting answer information based on heterogeneous external knowledge source graph structure - Google Patents

Method for extracting answer information based on heterogeneous external knowledge source graph structure Download PDF

Info

Publication number
CN111581326A
CN111581326A CN202010238159.2A CN202010238159A CN111581326A CN 111581326 A CN111581326 A CN 111581326A CN 202010238159 A CN202010238159 A CN 202010238159A CN 111581326 A CN111581326 A CN 111581326A
Authority
CN
China
Prior art keywords
knowledge
graph
node
answers
answer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010238159.2A
Other languages
Chinese (zh)
Other versions
CN111581326B (en
Inventor
虎嵩林
吕尚文
朱福庆
周薇
韩冀中
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
Original Assignee
Institute of Information Engineering of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS filed Critical Institute of Information Engineering of CAS
Priority to CN202010238159.2A priority Critical patent/CN111581326B/en
Publication of CN111581326A publication Critical patent/CN111581326A/en
Application granted granted Critical
Publication of CN111581326B publication Critical patent/CN111581326B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a method for extracting answer information based on a heterogeneous external knowledge source graph structure, which belongs to the field of natural language processing and is used for improving answer quality returned by question answering. The method can rearrange the answers according to the matching degree of the questions and the answers, displays the answers concerned and expected by the user at the top, makes the search result more pertinent, and makes the user obtain the more desired answers in shorter query time.

Description

Method for extracting answer information based on heterogeneous external knowledge source graph structure
Technical Field
The invention belongs to the field of natural language processing, and provides a method for extracting answer information based on a heterogeneous external knowledge source graph structure.
Background
The question-answering system aims to retrieve answers capable of answering questions from a massive background text database and return the answers to the users by understanding queries input by the users. For example, a specific query is input in a conventional hundredth search engine, and the hundredth search engine retrieves answers corresponding to the query from a background text database, thereby providing a portal for a user to retrieve related knowledge and know about the internet. In addition, the question-answering system has important application value in specific fields such as bank system question-answering and E-business question-answering. Generally, a question-answering system retrieves a batch of answers relatively relevant to the query from background massive text data according to the received user query and sorts and returns the answers, so that the answers more relevant to the user query can be arranged in front of a query result, and the expectation of the user on the query can be met more quickly. For example, when using a hundred-degree search engine, a user expects answers of his or her own interest to be presented at the most advanced number of pages and positions.
In the question-answering system, the results are returned mainly by using a traditional manual feature method such as TFIDF, BM25 and the like or a deep learning matching model, and answers with higher matching degrees are put at the front position. The returned results play a crucial role in the ranking of the final answers.
In the field of deep learning, relevant technologies such as text matching and retrieval applied to a question-answering system are becoming mature day by day, and a plurality of important achievements are obtained on a plurality of tasks by predicting results according to occurrence probability distribution of words in a text. In recent years, pre-trained language models such as BERT, XLNet, etc. have achieved a lot of results on various natural language processing tasks, and many tasks have performed even more than humans. This is mainly benefited by its strong a priori knowledge and representation learning ability.
However, most of the existing matching models represent and learn texts based on the probability distribution of the texts, and the sparsity problem, the common sense problem and the like are difficult to be solved well. The main reason is that for the common sense problem, not only the information provided by the currently given text needs to be considered, but also valuable information is selected from the information according to the experience of daily life to provide corresponding decision basis for reasoning. This presents new challenges to existing text matching and search model approaches.
Disclosure of Invention
In view of problems and defects in the prior art, in order to improve the quality of answers returned by a question-answering system, the invention provides a method for extracting answer information based on a heterogeneous external knowledge source graph structure, two heterogeneous knowledge sources of structured knowledge and unstructured knowledge are combined, after corresponding knowledge is obtained, the structured technology is used for carrying out graph building processing on the knowledge, and the representation of a graph is learned by combining the graph convolution network technology, and finally, the answer capable of correctly answering the user question is returned from background text data by combining the information of texts and graphs. The method can rearrange the answers according to the matching degree of the questions and the answers, displays the answers concerned and expected by the user at the top, makes the search result more pertinent, and makes the user obtain the desired answers in shorter query time.
In order to solve the problems, the invention adopts the following technical scheme:
a method for extracting answer information based on a heterogeneous external knowledge source graph structure comprises the following steps:
(1) extracting relevant path knowledge from questions to answers from a structured knowledge base (such as ConceptNet, WordNet and the like) according to questions input by a user query, and extracting relevant sentences similar to the expression of the questions and the answers from an unstructured knowledge base (such as English Wikipedia and the like):
(2) splicing the extracted related path knowledge, related sentences and query input of a user together, and inputting the spliced related path knowledge, related sentences and query input into a pre-training language model to obtain the whole semantic representation < cls >; < cls > has important implications in a number of pre-trained language models, such as BERT, XLNET, RoBERTA, etc., which represent a vector representation of the entire input; for example, in the classification task, the vector representation of cls can be used as the representation of sentences for classification output; the method uses cls to represent the matching degree of the query and the return result through a layer of linear network;
(3) respectively establishing graphs for the extracted related path knowledge (such as Concept) and sentences (such as English Wikipedia) to utilize the structured knowledge;
(4) using a graph convolution network to perform representation learning on the two established graphs to obtain vector representation of each node; because the constructed connection relation between the neighbor nodes of the graph can provide more information for the semantic representation of the nodes, the method uses a graph network representation learning method to utilize the structural information on the graph;
(5) similarity matching is carried out on each node in the graph by using < cls >, an attention weight of each node in the graph is obtained, and a final representation (a final matching vector) is obtained through the weighted sum of the weight and the node vector:
(6) and (3) scoring the relevance of the obtained final representation by using the final representation obtained in the step (5) through one or more layers of linear transformation networks to obtain the relevance scores of all answers, and sorting the answers from high to low according to the scores, wherein the higher the score is, the higher the matching degree of the answer with the query input of the user is, the more relevant the answer is, the more front the position of the row is, and the more desired the answer can be obtained by the user.
Further, when extracting knowledge from the structured knowledge base, firstly, the entities (people, places, organizations and the like) in the questions and the answers are identified, then the intermediate entities passing through the question entities and the answer entities are found, and finally the question entities, the intermediate entities and the answer entities jointly form the related path knowledge.
Further, when extracting knowledge from the unstructured knowledge base, firstly, using a tool to perform sentence segmentation on the whole corpus and establishing an index, then splicing the questions and the answers as input, and screening the first K sentences with the highest similarity from the whole corpus; specifically, word frequency-inverse document frequency (TF-IDF) is used for measuring similarity, and documents with high word coverage have larger similarity.
Further, the related path knowledge extracted from the structured knowledge base is represented as a natural language sentence, and the original related path knowledge is represented by a < e1, r1, e2> triple structured representation and is converted into a sentence representation of e1, r1, e 2; meanwhile, the sentences extracted from the unstructured knowledge base are spliced one by one; subsequently, the question is spliced with the answer and fed into a pre-trained language model to get the overall representation < cls >.
Further, for the relevant path knowledge extracted by the structured knowledge base, each triple is a node, and if there is a common entity between the nodes, an edge is added between the nodes to establish a graph.
Further, for the sentences extracted by the unstructured knowledge base, firstly, a semantic role Labeling (semantic role Labeling) tool is used for extracting predicates and predicate elements in the sentences, each predicate element and predicate element in the predicates are used for establishing a graph for one node in the graph, and if a certain contact ratio between the two nodes is met, an edge is added between the two nodes.
Further, a node in the graph may be a word or a phrase, and the representation of the node is an average value of a vector of each word.
Further, when the graph convolution network is used, the vector representation of the node and the connection relation of the node are used as reference items.
Further, the vector of nodes is represented using an undirected graph network.
Further, when the graph network is represented, the node vectors are represented by using an undirected graph version of the graph network due to the characteristic that the directed graph may have overfitting.
Drawings
FIG. 1 is a flowchart illustrating an overall method for extracting answer information based on a heterogeneous external knowledge source graph structure according to an embodiment;
fig. 2 is a flowchart of representing a graph using a graph convolution network and obtaining a final result in an embodiment.
Detailed Description
The invention will be further described with reference to the accompanying drawings and specific embodiments.
External knowledge plays a crucial role in computing the degree of matching of queries and search answers. For example, when searching for "height of wife of xiao ming" in a hundred degree, the conventional search engine retrieves answers from documents including keywords such as "xiao ming", "wife", "height" and the like according to text matching, and it is often difficult to return accurate answers. The invention utilizes knowledge in the external structured knowledge map to identify that Xiaoming is an entity, then identifies that a wife of Xiaoming is Xiaoli, and finally finds the height attribute of Xiaoli from the external knowledge map as an answer to return. It can be seen that structured knowledge provides high quality knowledge, but also suffers from low coverage. And the coverage rate of the unstructured knowledge is wide, and good supplement is provided for the structured knowledge. For example, when searching for "what people usually do when playing guitar", the actual answer should be "singing", but the answer that the user wants is hard to get in the prior art because it is only based on the text matching method. Related knowledge gets a good answer in unstructured knowledge. When using structured knowledge, graphs are built to represent the knowledge and then make inferences. When using unstructured knowledge, the knowledge is exploited using corresponding text matching techniques. But neither can make effective use of both structured and unstructured knowledge information. The invention improves the traditional question-answering method by utilizing the two methods, provides a method for extracting answer information based on a heterogeneous external knowledge source graph structure, and ensures that the method has the capability of reasoning and returning answers by utilizing external knowledge, and the following embodiment is used for explaining the method.
The embodiment provides a method for extracting answer information based on a heterogeneous external knowledge source graph structure, as shown in fig. 1, the method is composed of two parts, namely heterogeneous knowledge extraction and graph network-based reasoning;
(1) heterogeneous knowledge extraction involves extracting from structured knowledge, which employs ConceptNet, and unstructured knowledge, which employs Wikipedia.
(1-1) when knowledge is extracted from the ConceptNet, firstly, entities in questions and answers are identified, and then paths from the question entities to the answer entities are selected from the ConceptNet graph to serve as the knowledge;
(1-2) when the knowledge is extracted from the Wikipedia, firstly, a tool is used for dividing sentences of the whole corpus and establishing indexes, and then the question and the answer are spliced to be used as input to screen K sentences with high similarity from the whole corpus.
(1-3) according to the extracted knowledge in the step (1-1), carrying out structuring processing on the concept net knowledge to obtain a graph, wherein each triple is a node for the concept net, and if a common entity exists among the nodes, an edge is added among the nodes;
and (1-4) extracting predicates and predicate elements from the Wikipedia knowledge by using a semantic role Labeling (semantic role Labeling) tool according to the extracted knowledge in the step (1-2), establishing a graph by using each predicate element and predicate element in the graph as a node in the graph, and adding an edge between the two nodes if the two nodes meet certain contact ratio.
And (1-5) establishing a graph structure according to the graph structures (1-3) and (1-4), using a graph convolution network to obtain a representation, and outputting a final answer.
(2) Referring to fig. 2, a method for representing a graph by using a graph convolution network and obtaining a final result specifically includes:
(2-1) performing representation learning on the two established graphs by using a graph convolution network to obtain a vector representation of each node;
(2-2) the node in the graph may be a word or a phrase, and the representation of the node is the average value of the vector of each word;
and (2-3) when the graph network is represented, the undirected graph version is used for representing the node vectors due to the characteristic that the directed graph may have overfitting.
(2-4) according to (2-2) and (2-3), using < cls > to make an entry with each node in the graph, obtaining the weight of each node in the graph, and using the weighted sum of the weight and the vector to obtain a final representation:
and (2-5) according to the step (2-4), the obtained final representation is subjected to relevance scoring by using a multilayer linear network to obtain the relevance score of a certain answer, and the answer with the highest score in the answers is finally selected.
The method of the present invention is not limited to the examples described in the specific embodiments, and other embodiments derived from the technical solutions of the present invention by those skilled in the art also belong to the technical innovation scope of the present invention.

Claims (10)

1. A method for extracting answer information based on a heterogeneous external knowledge source graph structure is characterized by comprising the following steps:
according to the questions inquired and input by the user, extracting relevant path knowledge from the questions to the answers from the structured knowledge base, and extracting relevant sentences with the highest similarity to the questions and the answers from the unstructured knowledge base;
splicing the extracted related path knowledge, related sentences and the information input by query together, and inputting the information into a pre-training language model to obtain the whole semantic expression < cls >;
respectively establishing graphs for the extracted related path knowledge and related sentences, and performing representation learning on the two established graphs by using a graph convolution network to obtain vector representation of each node;
similarity matching is carried out on < cls > and each node in the graph to obtain an attention weight of each node in the graph, and a final representation is obtained by using a weighted sum of the weight and the node vector;
and (4) using a linear transformation network to score the relevance of the final representation to obtain the relevance score of each answer, and sequencing each answer from high to low according to the relevance score.
2. The method of claim 1, wherein the method of extracting knowledge of relevant paths is: and identifying entities in the question and the answer, finding out intermediate entities from the question entities to the answer entities, and combining the question entities, the intermediate entities and the answer entities into related path knowledge.
3. The method of claim 1, converting the extracted triplet representation of related path knowledge into a natural language statement.
4. The method of claim 1, wherein for the extracted knowledge of the relevant paths, each triplet is a node and an edge is added between nodes if there are entities in common between the nodes.
5. The method of claim 1, wherein the relevant sentences are extracted by: the method comprises the steps of carrying out sentence segmentation on the whole corpus in an unstructured knowledge base, establishing an index, splicing the questions and answers as input, and screening the first K sentences with the highest TF-IDF from the whole corpus to serve as related sentences.
6. The method of claim 1, wherein for the extracted relevant sentences, a semantic role labeling tool is used to extract predicates and predicate elements therein, each predicate element and predicate element are used to respectively establish a graph for one node in the graph, and if a preset contact ratio between two nodes is met, an edge is added between the two nodes.
7. The method of claim 1, wherein each node in the graph is a word or phrase, and the representation of the node is an average of vectors for each word.
8. The method of claim 1, wherein a vector representation of a node and a connection relationship of the node are used as reference terms when using a graph-convolution network.
9. The method of claim 1, wherein the vectors of nodes are represented using an undirected graph network.
10. The method of claim 1, wherein the structured knowledge base comprises ConceptNet, WordNet, the unstructured knowledge base comprises wikipedia, and the pre-trained language model comprises BERT, XLNet, RoBERTa.
CN202010238159.2A 2020-03-30 2020-03-30 Method for extracting answer information based on heterogeneous external knowledge source graph structure Active CN111581326B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010238159.2A CN111581326B (en) 2020-03-30 2020-03-30 Method for extracting answer information based on heterogeneous external knowledge source graph structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010238159.2A CN111581326B (en) 2020-03-30 2020-03-30 Method for extracting answer information based on heterogeneous external knowledge source graph structure

Publications (2)

Publication Number Publication Date
CN111581326A true CN111581326A (en) 2020-08-25
CN111581326B CN111581326B (en) 2022-05-31

Family

ID=72113555

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010238159.2A Active CN111581326B (en) 2020-03-30 2020-03-30 Method for extracting answer information based on heterogeneous external knowledge source graph structure

Country Status (1)

Country Link
CN (1) CN111581326B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113536795A (en) * 2021-07-05 2021-10-22 杭州远传新业科技有限公司 Method, system, electronic device and storage medium for entity relation extraction

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017077383A1 (en) * 2015-11-04 2017-05-11 EDUCATION4SIGHT GmbH Systems and methods for instrumentation of education processes
CN109902145A (en) * 2019-01-18 2019-06-18 中国科学院信息工程研究所 A kind of entity relationship joint abstracting method and system based on attention mechanism
CN110704640A (en) * 2019-09-30 2020-01-17 北京邮电大学 Representation learning method and device of knowledge graph
CN110717047A (en) * 2019-10-22 2020-01-21 湖南科技大学 Web service classification method based on graph convolution neural network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017077383A1 (en) * 2015-11-04 2017-05-11 EDUCATION4SIGHT GmbH Systems and methods for instrumentation of education processes
CN109902145A (en) * 2019-01-18 2019-06-18 中国科学院信息工程研究所 A kind of entity relationship joint abstracting method and system based on attention mechanism
CN110704640A (en) * 2019-09-30 2020-01-17 北京邮电大学 Representation learning method and device of knowledge graph
CN110717047A (en) * 2019-10-22 2020-01-21 湖南科技大学 Web service classification method based on graph convolution neural network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SHANGWEN LV, WANHUI QIAN, LONGTAO HUANG, JIZHONG HAN, SONGLIN HU: "Integrating Event-Level and Chain-Level Attentions to Predict What Happens Next", 《AAAI 2019》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113536795A (en) * 2021-07-05 2021-10-22 杭州远传新业科技有限公司 Method, system, electronic device and storage medium for entity relation extraction
CN113536795B (en) * 2021-07-05 2022-02-15 杭州远传新业科技有限公司 Method, system, electronic device and storage medium for entity relation extraction

Also Published As

Publication number Publication date
CN111581326B (en) 2022-05-31

Similar Documents

Publication Publication Date Title
US10678816B2 (en) Single-entity-single-relation question answering systems, and methods
RU2487403C1 (en) Method of constructing semantic model of document
CN107180045B (en) Method for extracting geographic entity relation contained in internet text
CN105528437B (en) A kind of question answering system construction method extracted based on structured text knowledge
Zubrinic et al. The automatic creation of concept maps from documents written using morphologically rich languages
CN109271505A (en) A kind of question answering system implementation method based on problem answers pair
CN110674252A (en) High-precision semantic search system for judicial domain
KR20190015797A (en) The System and the method of offering the Optimized answers to legal experts utilizing a Deep learning training module and a Prioritization framework module based on Artificial intelligence and providing an Online legal dictionary utilizing a character Strings Dictionary Module that converts legal information into significant vector
Van de Camp et al. The socialist network
CN114065758A (en) Document keyword extraction method based on hypergraph random walk
CN110888991A (en) Sectional semantic annotation method in weak annotation environment
CN112507109A (en) Retrieval method and device based on semantic analysis and keyword recognition
CN112036178A (en) Distribution network entity related semantic search method
CN117312499A (en) Big data analysis system and method based on semantics
CN111666374A (en) Method for integrating additional knowledge information into deep language model
CN111581364A (en) Chinese intelligent question-answer short text similarity calculation method oriented to medical field
CN112417170B (en) Relationship linking method for incomplete knowledge graph
CN111581326B (en) Method for extracting answer information based on heterogeneous external knowledge source graph structure
Abimbola et al. A Noun-Centric Keyphrase Extraction Model: Graph-Based Approach
CN112084312A (en) Intelligent customer service system constructed based on knowledge graph
CN116562280A (en) Literature analysis system and method based on general information extraction
Das et al. An improvement of Bengali factoid question answering system using unsupervised statistical methods
Lokman et al. A conceptual IR chatbot framework with automated keywords-based vector representation generation
Niranjan et al. Question answering system for agriculture domain using machine learning techniques: literature survey and challenges
Song et al. Research on intelligent question answering system based on college enrollment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant