CN116383357A

CN116383357A - Knowledge graph-oriented query graph generation method and system

Info

Publication number: CN116383357A
Application number: CN202310363988.7A
Authority: CN
Inventors: 徐建; 张帆; 邱婉春
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2023-04-06
Filing date: 2023-04-06
Publication date: 2023-07-04

Abstract

The invention discloses a knowledge-graph-oriented query graph generation method and a knowledge-graph-oriented query graph generation system, wherein the method comprises the following steps: constructing a relation detection model, and selecting a main relation path with highest matching degree with the problem from the candidate relation set; serializing the query graph, and uniformly coding the problems and the query graph sequence; and constructing a query graph ordering model, ordering the query graphs in the candidate set according to the semantic similarity scores of the query graphs and the questions, and selecting the optimal query graph. The invention provides a new method for the query graph generation task in the knowledge graph question and answer, and improves the overall performance of the knowledge graph question and answer by completing each subtask in the query graph generation process, including relation detection, query graph serialization and query graph sequencing; compared with the prior art, the model provided by the invention is based on end-to-end matching, does not introduce the characteristic of manual design, and is simple to realize.

Description

Knowledge graph-oriented query graph generation method and system

Technical Field

The invention relates to a query graph generation technology, in particular to a knowledge graph-oriented query graph generation method and system.

Background

At present, under the large background of the Internet age, people are used to acquire information through a network, and only key words are needed to be input, and a search engine can return various information related to the key words, so that the daily work and life of people are greatly facilitated. However, in the face of the natural language question presented by the user, the conventional search engine can only simply match and combine keywords in the question, so that it is difficult to accurately understand complex logic relations in the question, and the result of the search feedback is a webpage list related to the keywords of the question, rather than the final answer of the question, which requires the user to further screen, thus reducing the search efficiency.

In recent years, with the development of knowledge graph, information retrieval, deep learning and other technologies, knowledge graph questions and answers become a new technology for solving question and answer tasks, which uses rich semantic association information in the knowledge graph to analyze natural language questions presented by users, fully understand user intention, retrieve the questions in the knowledge graph and return answers to the questions to the users. The query graph generation method simplifies the semantic analysis process of the questions into the query graph generation process, and the logical form of the questions can be clearly and intuitively expressed by mapping the natural language questions to the query graph, so that the computer understanding is facilitated, and the efficiency and performance of knowledge graph question and answer are further improved.

In order to enable the query graph to express semantic information of the problems as accurately as possible, researchers decompose a mapping process of the problems to the query graph into different subtasks, and complete each subtask based on a rule template or a neural network. Neural network based approaches are currently the dominant approach because predefined rule templates require excessive human intervention. Bao et al propose a query graph generation method (Bao J, duan N, yan Z, et al constraint-based Question Answering with Knowledge Graph [ C ]// Proceedings of COLING 2016,the 26th International Conference on Computational Linguistics:Technical Papers.2016:2503-2514.) for multi-constraint problems, transform the multi-constraint problems into multi-constraint query graphs, encode candidate query graphs using two CNN models, and manually design a series of constraint-related features such as the number of each constraint, the sum of constant vertex entity link scores in entity constraints, and the like in addition to the problems, entities and relationships of the CNN encoded features. Luo et al propose a semantic matching model (Luo K, lin F, luo X, et al knowledge Base Question Answering via Encoding of Complex Query Graphs [ C ]// Proceedings of the 2018Conference on Empirical Methods in Natural Language Processing.2018:2185-2194.) for matching natural language questions to query graphs, which model is based on a "code-compare" framework implementation, encoding the natural language questions and query graphs, respectively. Firstly, using Bi-GRU models to encode the problems in the global and local directions respectively, and taking the sum of the two as the encoding of the problems; then, a predicate sequence on a relation path is used for representing the query graph, and the predicate id sequence and the predicate name sequence are utilized to obtain the code of the query graph; and finally, calculating the similarity between the problem and the query graph by using a Cosine distance formula.

The method proposed by Bao et al decomposes the query graph into a plurality of semantic components, then comprehensively encodes the semantic components, ignores the structural information of the query graph, introduces the characteristics of manual design, and increases the complexity of encoding.

The method proposed by Luo et al is based on a "coding-comparison" framework, directly codes predicate sequences of a problem and a query graph into vector sequences respectively, compresses the vector sequences into a single vector through aggregation operation, and finally compares the similarity of the problem vector and the query graph vector. The method is used for carrying out high abstraction on the problems and the query graph, the mutual influence of the internal information of the problems and the query graph is not considered, and the vector aggregation is easy to lose key information required by matching.

Because of structural differences between the query graph and the problem, the query graph is independently encoded by the method, and the internal information interaction between the query graph and the problem cannot be realized.

Disclosure of Invention

The invention aims to provide a knowledge-graph-oriented query graph generation method and a knowledge-graph-oriented query graph generation system, which solve the problem that a coding-comparison framework ignores information interaction by using a bidirectional attention mechanism to realize information interaction of problems and relations; the method has the advantages that the local information extracted by local interaction is used as the supplement to global information obtained by aggregation, so that the problem that semantic information is easy to lose in aggregation operation is solved; converting the graph structure into a linear structure by utilizing query graph serialization to realize unified coding of the query graph and the problem, and solving the problem of structural difference between the problem and the query graph; by solving the key problems, the accuracy of question and answer is improved.

The technical scheme for realizing the purpose of the invention is as follows: in a first aspect, the present invention provides a knowledge graph-oriented query graph generating method, including the following steps:

constructing a relation detection model, and selecting a main relation path with highest matching degree with the problem from the candidate relation set;

serializing the query graph, and uniformly coding the problems and the query graph sequence;

and constructing a query graph ordering model, ordering the query graphs in the candidate set according to the semantic similarity scores of the query graphs and the questions, and selecting the optimal query graph.

In a second aspect, the present invention provides a knowledge-graph-oriented query graph generating system, including:

the relation detection model construction module is used for selecting a main relation path with highest matching degree with the problem from the candidate relation set;

the query graph serialization module is used for serializing the query graph and uniformly coding the problems and the query graph sequence;

and the query graph ordering model construction module is used for ordering the query graphs in the candidate set according to the semantic similarity scores of the query graphs and the problems, and selecting the optimal query graph.

In a third aspect, the present invention provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method of the first aspect when the computer program is executed.

In a fourth aspect, the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method of the first aspect.

Compared with the prior art, the invention has the remarkable advantages that:

(1) In the relation detection model, attention interaction of the problem and the relation is realized by using a bidirectional attention mechanism, and the mutual influence of the problem and the relation is considered, so that the model focuses on the part of the problem and the relation which are related to each other, and the extraction of key information of the problem and the relation matching is facilitated.

(2) In the relation detection model, global features and local features are comprehensively considered, and semantic information lost in the aggregation operation is effectively supplemented by extracting the local features through local interaction.

(3) In the query graph ordering task, the query graph is serialized according to the main relation path and different constraint sub-paths, so that certain structural characteristics of the query graph are reserved, the defect caused by structural differences of the query graph and the problem is overcome, and the problem and the query graph can be uniformly coded.

(4) In the query graph ordering model, the BERT is utilized to encode the entity mention and the entity name of the character level representation, so that the robustness of the model for processing the unregistered words is enhanced, and the overall performance of the model is improved.

Drawings

FIG. 1 is a diagram of a relationship detection model.

FIG. 2 is a diagram of a query graph serialization process.

FIG. 3 is a diagram of a query graph ranking model.

Detailed Description

The technical scheme of the present invention will be clearly and completely described below with reference to the accompanying drawings. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by one of ordinary skill in the art without creative efforts, are within the scope of the present invention based on the embodiments of the present invention.

The invention provides two models for a relation detection task and a query graph ordering task on the basis of completing the natural language problem and the link of the knowledge graph entities so as to improve the accuracy and efficiency of knowledge graph questions and answers.

FIG. 1 is a block diagram of a model of relationship detection for the purpose of selecting a primary relationship path from a candidate relationship set that matches a problem to a highest degree, comprising the steps of:

(1) Problem and relationship preprocessing. The questions are represented in word-level, first using physically linked knotsIf the entity in the question is mentioned to be replaced by a universal mark<e>The question is then represented as a word sequence W by segmentation _q . The candidate relation adopts two granularity expressions of word level and relation level, divides a plurality of relations forming a relation path into a word sequence and a relation name sequence, and combines W of the word sequence and the relation name sequence _r As a relational representation.

(2) Question and relation coding. First, the word embedding layer is used to represent the problem sequence W _q And the relationship represents the sequence W _r Respectively mapped to its corresponding word embedding sequence E _q And E is _r The word embedding layer is initialized by using a Glove pre-training language model; then using two Bi-LSTM pairs to respectively R _q And E is _r And carrying out semantic coding to obtain semantic vector representations Q and R of the problem and the candidate relation.

(3) Implementing a bi-directional attention mechanism. Attention weights of problems and relationships affecting each other are first calculated

And->

Wherein (1)>

W _a Is a trainable parameter matrix; then calculate the weighted sum of the questions and relations under the attentive mechanisms +.>

And->

Attention representation of the question and relation>

And->

Where n represents the number of words in the question and m representsNumber of words in the relationship sequence.

(4) A global comparison score is calculated. Respectively to

And->

Performing maximum pooling operation to obtain global semantic vector q of problems and relations _g And r _g And calculating cosine similarity of the two to obtain a global comparison score of the problem and the relation.

(5) A local interaction score is calculated. First sum Q and

the vectors at the corresponding positions are spliced to obtain an interaction vector sequence C; then, extracting local semantic features of the C by using a new Bi-LSTM to obtain a local feature representation T; and finally, reducing the dimension of T by utilizing a feedforward neural network and maximum pooling to obtain the local interaction score of the problem and the relation.

(6) And calculating semantic similarity. And adding the global comparison score and the local interaction score to obtain a semantic similarity total score of the problem and the relation.

In order to overcome the coding difficulty caused by structural difference between the query graph and the problem, the invention sequences the query graph before coding is implemented, and then uniformly codes the problem and the query graph sequence. FIG. 2 illustrates an exemplary serialization process for a query graph, comprising the following steps:

(1) Query graph splitting. Firstly, splitting a query graph into five parts according to constraint types defined in a query graph expansion method proposed by Luo et al: a main relationship path, an entity constraint sub-path, a type constraint sub-path, a time constraint sub-path and an ordinal constraint sub-path; the entity name in the main relationship path is then replaced with the universal label [ unused1], and the literal value of the answer node in the main relationship path is replaced with the answer type.

In question "who is the first team leader of chinese men after 2012? "for example, the splitting result of the corresponding query graph is shown in fig. 2, and because the problem is a multi-constraint complex problem, the query graph further comprises four types of constraints besides the main relationship path. The main relation path { athlete, basketball athlete } indicates the main topic of the question, the main relation path { title, captain } is used for restricting the answering entity, the type restriction sub-path { man } indicates the type of the answer, the time restriction sub-path { >,2012} restricts the time range of the question, and the ordinal restriction sub-path { incumbent time, earliest } corresponds to the constraint of 'first incumbent' in the question. The information of various sub-paths corresponds to the constraint in the problem one by one, and the semantic information of the problem is completely represented.

(2) Sub-path serialization. Traversing each sub-path by using depth-first search, and sequentially recording information of corresponding nodes and directed edges to obtain a serialization representation of each sub-path.

(3) The query graph is serialized in its entirety. And merging the sub-path sequences according to the sequence of the main relation path, the entity constraint sub-path, the type constraint sub-path, the time constraint sub-path and the ordinal constraint sub-path to obtain the serialization representation of the complete query graph.

FIG. 3 is a diagram of a model of query graph ranking, wherein the task of query graph ranking is to rank query graphs in a candidate set according to semantic similarity scores of the query graphs and questions, and the best query graph is selected, and the steps are as follows:

(1) Question and query graph coding. The problem and query graph sequences are encoded using the BERT model as an encoder. First the entity mention in the question is replaced by the universal label [ unused0 ]]And representing the question and query graph sequences as word sequences W _q And W is _q The method comprises the steps of carrying out a first treatment on the surface of the Then introduce special tags of BERT model [ CLS ]]And [ SEP ]]Will W _q And W is _g Inputting BERT model in the form of "sentence pair", inputting [ CLS ]]Corresponding output vector t _g As an overall feature representation of the sequence of question and query graphs.

(2) And (5) entity coding. Since entity names are often proper nouns, word-level representations contain a large number of unregistered words, and thus use the character level in encoding an entityAnd (3) representing. First, the entity references in the question and the entity names in the query graph sequence are respectively represented as a character sequence W _eq And W is _eg The method comprises the steps of carrying out a first treatment on the surface of the Then add [ CLS ]]And [ SEP ]]Tagging, using BERT model pair "sentence pair" W _eq And W is _eg Encoding [ CLS ]]Corresponding output vector t _e As an overall semantic representation of entity references and entity names.

(3) A comparison score is calculated. Firstly, splicing output vectors corresponding to [ CLS ] in the BERT models in the steps (1) and (2); then using a linear layer to reduce the dimension of the spliced vector; finally, a new linear layer is used to calculate the comparison scores of the questions and the query graph.

Based on the same inventive concept, the invention also provides a knowledge-graph-oriented query graph generation system, which comprises:

The specific implementation method of each module is the same as the query graph generation method facing the knowledge graph, and is not repeated here.

The invention provides a new method for the query graph generation task in the knowledge graph question and answer, and improves the overall performance of the knowledge graph question and answer by completing each subtask in the query graph generation process, including relation detection, query graph serialization and query graph sequencing. In addition, compared with the prior art, the model provided by the invention is based on end-to-end matching, does not introduce the characteristic of manual design, and is simple to realize.

Claims

1. The knowledge graph-oriented query graph generation method is characterized by comprising the following steps of:

2. The knowledge-graph-oriented query graph generation method of claim 1, wherein the building of the relation detection model selects a main relation path with the highest matching degree with the problem from the candidate relation set, and the method is specifically as follows:

(1) Problem and relationship preprocessing; the questions are expressed in word level, and the results of entity links are utilized first to replace the entity references in the questions with common marks<e>The question is then represented as a word sequence W by segmentation _q The method comprises the steps of carrying out a first treatment on the surface of the The candidate relation adopts two granularity expressions of word level and relation level, divides a plurality of relations forming a relation path into a word sequence and a relation name sequence, and combines W of the word sequence and the relation name sequence _r As a relational representation;

(2) Question and relationship coding; first, the word embedding layer is used to represent the problem sequence W _q And the relationship represents the sequence W _r Respectively mapped to its corresponding word embedding sequence E _q And E is _r The word embedding layer is initialized by using a Glove pre-training language model; then use two Bi-LSTM pairs E _q And E is _r Carrying out semantic coding to obtain semantic vector representations Q and R of the problem and the candidate relation;

(3) Implementing a bi-directional attention mechanism; attention weights of problems and relationships affecting each other are first calculated

And->

Wherein (1)>

q _i ∈Q，r _j ∈R，W _a Is a trainable parameter matrix; then calculate the weighted sum of the questions and relations under the attentive mechanisms +.>

And->

n represents the number of words in the question, m represents the number of words in the sequence of relations, resulting in the attention representation of the question and the relation +.>

And->

(4) Calculating a global comparison score; respectively to

And->

Performing maximum pooling operation to obtain global semantic vector q of problems and relations _g And r _g Calculating cosine similarity of the two to obtain a global comparison score of the problem and the relation;

(5) Calculating a local interaction score; first sum Q and

the vectors at the corresponding positions are spliced to obtain an interaction vector sequence C; then, extracting local semantic features of the C by using a new Bi-LSTM to obtain a local feature representation T; finally, reducing the dimension of T by utilizing a feedforward neural network and maximum pooling to obtain a local interaction score of the problem and the relation;

(6) Calculating semantic similarity; and adding the global comparison score and the local interaction score to obtain a semantic similarity total score of the problem and the relation.

3. The knowledge-graph-oriented query graph generation method according to claim 1, wherein the query graph is serialized, and the problem and the query graph sequence are uniformly coded, and the specific steps are as follows:

(1) Splitting a query graph; firstly, splitting a query graph into five parts according to constraint types defined when the query graph is expanded: a main relationship path, an entity constraint sub-path, a type constraint sub-path, a time constraint sub-path and an ordinal constraint sub-path; then, the entity name in the main relation path is replaced by a universal mark [ unused1], and the literal value of the answer node in the main relation path is replaced by the answer type;

(2) Sub-path serialization; traversing each sub-path by using depth-first search, and sequentially recording information of corresponding nodes and directed edges to obtain a serialization representation of each sub-path;

(3) Serializing the whole query graph; and merging the sub-path sequences according to the sequence of the main relation path, the entity constraint sub-path, the type constraint sub-path, the time constraint sub-path and the ordinal constraint sub-path to obtain the serialization representation of the complete query graph.

4. The knowledge-graph-oriented query graph generation method of claim 1, wherein the constructing a query graph ordering model orders query graphs in a candidate set according to semantic similarity scores of the query graphs and the questions, and selects an optimal query graph, which is specifically as follows:

(1) Question and query graph coding; encoding the sequence of questions and query graphs using the BERT model as an encoder; first the entity mention in the question is replaced by the universal label [ unused0 ]]And representing the question and query graph sequences as word sequences W _q And W is _g The method comprises the steps of carrying out a first treatment on the surface of the Then introduce special tags of BERT model [ CLS ]]And [ SEP ]]Will W _q And W is _g Inputting BERT model in the form of "sentence pair", inputting [ CLS ]]Corresponding output vector t _g As a problem anda global feature representation of the query graph sequence;

(2) Entity coding; first, the entity references in the question and the entity names in the query graph sequence are respectively represented as a character sequence W _eq And W is _eg The method comprises the steps of carrying out a first treatment on the surface of the Then add [ CLS ]]And [ SEP ]]Tagging, using BERT model pair "sentence pair" W _eq And W is _eg Encoding [ CLS ]]Corresponding output vector t _e As an overall semantic representation of entity references and entity names;

(3) Calculating a comparison score; firstly, splicing output vectors corresponding to [ CLS ] in the BERT models in the steps (1) and (2); then using a linear layer to reduce the dimension of the spliced vector; finally, a new linear layer is used to calculate the comparison scores of the questions and the query graph.

5. A knowledge-graph-oriented query graph generation system, comprising:

6. The knowledge-graph-oriented query graph generation system of claim 5, wherein the relationship detection model construction module is configured to select a main relationship path with a highest degree of matching with a problem from a candidate relationship set, and specifically comprises:

(1) Problem and relationship preprocessing; the questions are expressed in word level, and the results of entity links are utilized first to replace the entity references in the questions with common marks<e>The question is then represented as a word sequence W by segmentation _q The method comprises the steps of carrying out a first treatment on the surface of the The candidate relation adopts two granularity expressions of word level and relation level, divides a plurality of relations forming a relation path into a word sequence and a relation name sequence, and divides the two into twoUnion of people W _r As a relational representation;

And->

Wherein (1)>

And->

And->

(4) Calculating a global comparison score; respectively to

And->

(5) Calculating a local interaction score; first sum Q and

7. The knowledge-graph-oriented query graph generation system of claim 5, wherein the query graph serialization module is configured to serialize a query graph, uniformly encode a problem and a query graph sequence, and specifically comprises:

8. The knowledge-graph-oriented query graph generation system of claim 5, wherein the query graph ranking model construction module ranks query graphs in the candidate set according to semantic similarity scores of the query graphs and the questions, and selects an optimal query graph, specifically:

(1) Question and query graph coding; encoding the sequence of questions and query graphs using the BERT model as an encoder; first the entity mention in the question is replaced by the universal label [ unused0 ]]And representing the question and query graph sequences as word sequences W _q And W is _g The method comprises the steps of carrying out a first treatment on the surface of the Then introduce special tags of BERT model [ CLS ]]And [ SEP ]]Will W _q And W is _g Inputting BERT model in the form of "sentence pair", inputting [ CLS ]]Corresponding output vector t _g As an overall feature representation of the sequence of question and query graphs;

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any one of claims 1-4 when the computer program is executed.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any one of claims 1-4.