CN109241078B

CN109241078B - Knowledge graph organization query method based on mixed database

Info

Publication number: CN109241078B
Application number: CN201811005179.4A
Authority: CN
Inventors: 李新川; 姚宏; 陈仁谣; 李圣文; 梁庆中; 郑坤
Original assignee: China University of Geosciences
Current assignee: China University of Geosciences
Priority date: 2018-08-30
Filing date: 2018-08-30
Publication date: 2021-07-20
Anticipated expiration: 2038-08-30
Also published as: CN109241078A

Abstract

The invention relates to a knowledge graph organization query method based on a mixed database, which comprises the following steps: acquiring a triple set in a preset data set; distinguishing an entity triple set and a relation triple set from the triple set; storing the entity triple set on Neo4j to obtain a knowledge base with entities; constructing an index for the knowledge base with the entity to obtain the knowledge base with the index and the entity; storing the relation triple set on Neo4j to obtain a knowledge base with indexes, entities and relations; storing entity ambiguity information on MySQL to construct an entity ambiguity word list; and storing the constructed entity ambiguity word list into a knowledge base with indexes, entities and relations to obtain a complete knowledge base. The invention provides a knowledge graph organization method based on a mixed database by combining the advantages of a relational database and a graph database, is suitable for a general knowledge graph in a large-scale open field, and improves the query efficiency of the knowledge graph while optimizing the storage structure of the knowledge graph.

Description

Knowledge graph organization query method based on mixed database

Technical Field

The invention particularly relates to a knowledge graph organization query method based on a mixed database.

Background

As an efficient information organization and retrieval method, the knowledge graph has raised a hot learning trend since Google 2012. The aspects of entity extraction, attribute extraction, relationship extraction between entities, knowledge reasoning, knowledge representation learning and the like are more research hotspots, but few documents mention how to perform underlying storage of the graph and how to combine with an interface for storing design queries, or, although mention is made, the description of the aspects is incomplete and scattered. Storage and query usually appear as a whole, efficient query needs a good storage structure to support, and storage needs to be continuously optimized in combination with the characteristics of query.

Conventional databases, such as relational databases. The method can well perform clustering storage according to the information of the Schema layer of the knowledge graph, and has high efficiency when accessing certain class of data, but in other words, before the storage, Schema hierarchical information of the data needs to be known in advance, and once the Schema is determined, great change is difficult to be made, however, for the knowledge graph in the large-scale open field, the types of entities and relations are usually many and complex, and the Schema hierarchical information in the graph is difficult to be determined; secondly, when a multi-table connection (usually the connection depth is more than 2) query is faced, the relational database also seems inattentive, but the query operation is a very basic requirement of the knowledge graph.

For NOSQL database, such as primary key value database, column family storage database, document oriented database, graphic database, etc. The data structure of the graph database is closest to the knowledge graph and is represented as a huge graph structure model consisting of a large number of entity nodes and incidence relations among entities, and the graph structure model can well represent the relation among concrete or abstract things; meanwhile, the requirement of local access characteristics of the graph can be well met. However, how to store information that does not satisfy the graph data structure in the graph, such as ambiguity information between entities, becomes a problem to be solved.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a knowledge graph organization query method based on a hybrid database to solve the above problems, aiming at the above disadvantages of the conventional relational database and graph database technology.

A knowledge graph organization query method based on a mixed database comprises the following steps:

step 1, acquiring a triple set in a preset data set;

step 2, distinguishing an entity triple set and a relation triple set from the triple set obtained in the step 1;

step 3, storing the entity triple set on Neo4j to obtain a knowledge base with the entity;

step 4, constructing indexes aiming at entity nodes stored in the knowledge base with the entities to obtain the knowledge base with the indexes and the entities;

step 5, storing the relation triple set on Neo4j to obtain a knowledge base with indexes, entities and relations;

step 6, storing entity ambiguity on MySQL to construct an entity ambiguity word list;

step 7, storing the entity ambiguity word list constructed in the step 6 into the knowledge base with indexes, entities and relations obtained in the step 5 to obtain a complete knowledge base;

and 8, inputting an entity to be queried, and querying in the complete knowledge base obtained in the step 7 by adopting a two-stage query method of MySQL + Neo4j to obtain complete entity information.

Further, the preset data set in step 2 refers to general descriptions of entities and relations, and is any one or combination of structured data, unstructured data and semi-structured data.

Further, the specific storage method in step 3 is as follows: and distinguishing different entity nodes from the entity triple set and storing the entity nodes.

Further, the specific storage method in step 5 is as follows: and (4) distinguishing head and tail entity nodes from the relation triple set, then inquiring the head and tail entities in the knowledge base with the index and the entities obtained in the step (4), if the head and tail entities are hit, constructing a relation for the head and tail nodes, and if the head and tail entities are not hit, cancelling the relation.

Further, the entity ambiguity in step 6 refers to the situation of word ambiguity and synonyms existing between entities.

Further, the two-level query structure of MySQL + Neo4j specifically includes:

(1) inputting an entity to be queried, firstly, performing SQL query in a MySQL database, and judging whether the query hits: if the SQL query is hit, judging that the entity to be queried has ambiguity, returning all ambiguous entities corresponding to the ambiguous entities to the user, disambiguating the entities, and inputting the disambiguated entities into a Neo4j database for CQL query; if the SQL query is not hit, judging that the entity to be queried has no ambiguity, and directly transmitting the entity to be queried to a Neo4j database for CQL query;

(2) and taking the entity to be queried or the entity after disambiguation as the input of the Neo4j database to perform CQL query, and obtaining complete entity information as final output.

Further, the method for judging whether the query hits in the SQL query is as follows: and (4) comparing the entity to be queried with the entity ambiguity word list obtained in the step (6), if matching exists, querying is hit, otherwise, querying is not hit.

The invention has the advantages that: the knowledge graph organization method based on the mixed database is provided by combining the advantages of the relational database and the graph database, is suitable for the knowledge graph in the general large-scale open field, optimizes the storage structure of the knowledge graph and improves the query efficiency of the knowledge graph.

Drawings

The invention will be further described with reference to the accompanying drawings and examples, in which:

FIG. 1 is a flow chart of a knowledge-graph organization query method based on a hybrid database according to the present invention;

FIG. 2 is a two-level query structure diagram of MySQL + Neo4j of the present invention.

Detailed Description

For a more clear understanding of the technical features, objects and effects of the present invention, embodiments of the present invention will now be described in detail with reference to the accompanying drawings.

As shown in fig. 1, a method for querying a knowledge-graph organization based on a hybrid database includes:

step 1, acquiring a triple set in a preset data set, wherein the preset data set refers to general description of entities and relations and comprises structured data, unstructured data and semi-structured data;

step 3, storing the entity triple set on Neo4j, distinguishing different entity nodes from the entity triple set, and storing to obtain a knowledge base with an entity;

step 4, constructing indexes aiming at entity nodes stored in the knowledge base with the entities to obtain the knowledge base with the indexes and the entities

Step 5, storing the relation triple set on Neo4j, distinguishing head and tail entity nodes from the relation triple set, then inquiring the head and tail entities in the knowledge base with the index and the entity obtained in step 4, if the head and tail entities are hit, establishing a relation for the head and tail nodes, and if the relation is not invalidated, obtaining the knowledge base with the index, the entity and the relation;

and 6, storing entity ambiguity information on MySQL to construct an entity ambiguity word list, wherein the entity ambiguity refers to the condition of one word multiple meaning and synonyms existing between entities. (ii) a

And 7, storing the entity ambiguity word list constructed in the step 6 into the knowledge base with indexes, entities and relations obtained in the step 5 to obtain a complete knowledge base.

The two-stage query method of MvSQL + Neo4j specifically comprises the following steps: firstly, whether entity ambiguity information exists in the entity is inquired in the MvSOL, if so, the entity ambiguity information is disambiguated, and then the entity is inquired in Neo4j, otherwise, the entity ambiguity information is directly inquired in Neo4 j. As shown in fig. 2, the query process is as follows:

1. SQL query (as shown by the number 1 in FIG. 2)

Since it is unknown whether the input entity name is ambiguous, the input entity name first needs to be SQL queried in the MySQL database, that is, the input entity name is matched with the first column of the ambiguous vocabulary in fig. 2 (the first column of the ambiguous vocabulary is the entity name, the second column is ambiguous entity, for example, the key value pair < S1, < E1, E2> indicates that the entity name S1 is ambiguous, and the ambiguous entities E1 and E2 point to the same string S1). According to whether the query hits or not, the following two cases are processed:

1) SQL query hit:

that is, the input entity name is ambiguous (as shown in fig. 2, the input entity name Sm is ambiguous, so that the ambiguous entities Ek to Ek + n pointing to the same character string Sm are returned after query hit), all the ambiguous entities Ek to Ek + n corresponding to the input are returned to the user, and the entities are disambiguated (as shown by reference numeral 2 in fig. 2, a specific disambiguation mode is determined by a specific application scenario), and the disambiguated entities (Ek + i) are input into the Neo4j database for CQL query (as shown by reference numeral 3 in fig. 2).

2) SQL query miss:

namely, the input entity name is not ambiguous, and the CQL query is directly carried out.

2. CQL query (i.e., query to knowledge base in FIG. 2)

Whether the SQL query is hit or not, only the entity name is finally obtained. In order to obtain the complete information of the entity, the obtained entity name is required to be used as the input of the Neo4j database to perform the CQL query, so as to obtain the complete entity information as the final response to the input of the user.

Specific query examples are as follows:

query example 1: input entity name with entity ambiguity

1) Inputting an entity: radix Et rhizoma Rhei

2) SQL query: ambiguous vocabulary queries in MySQL

3) SQL query hits (which represent an ambiguity in the input entity name "qilixiang"), returning an ambiguous entity pointing to "qilixiang":

qilixiang (Zhoujilun 2004 album)

Qilixiang (Murraya plant of Rutaceae)

Qilixiang (Zhou Jie Lun singing song)

Qilixiang (poem song name, poem collection name)

Qilixiang (Thailand TV series)

Qilixiang (Chinese medicine)

Qilixiang (novel seven lixiang)

………………

4) Entity disambiguation:

assume that at this point entity disambiguation is performed according to context.

The context is: "Zhou Jilun Qilixiang is a song that I like.

So the entities disambiguated according to context are: qilixiang (Zhou Jie Lun singing song)

5) CQL query:

the entity information query is carried out on the entity 'Qilixiang (Zhongjilun singing song)' after disambiguation in Neo4j, and the final output is obtained:

qilixiang (Zhou Jie Lun singing song)

Baidusag: musical composition/single song

The name of Chinese: radix Et rhizoma Rhei

Release time: 2004, the year

Original singing of songs: zhou Jie Lun

Word filling: fangwenshan

The album to which it belongs: chinese medicinal preparation containing seven kinds of Zingiber officinale (published by Zhou Jie Lun 2004)

Duration of song: 4:56

Song language: mandarin Chinese

And (3) song editing: people with clock center

And (3) music composing: zhou Jie Lun

Music style: chinese wind

………………

Query example 2: assuming that the entity name of the input is not entity ambiguous

1) Inputting an entity: qilixiang (Zhou Jie Lun singing song)

2) SQL query: ambiguous vocabulary queries in MySQL

3) SQL query miss (representing no ambiguity in entity name input at this time)

4) CQL query:

entity information query is performed in Neo4j, resulting in the final output:

qilixiang (Zhou Jie Lun singing song)

Baidusag: musical composition/single song

The name of Chinese: radix Et rhizoma Rhei

Release time: 2004, the year

Original singing of songs: zhou Jie Lun

Word filling: fangwenshan

Duration of song: 4:56

Song language: mandarin Chinese

And (3) song editing: people with clock center

And (3) music composing: zhou Jie Lun

Music style: chinese wind

………………

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A knowledge graph organization query method based on a mixed database is characterized by comprising the following steps:

step 1, acquiring a triple set in a preset data set;

2. The method of claim 1, wherein the preset data set in step 2 refers to general description of entities and relationships, and is a combination of any one or more of structured data, unstructured data and semi-structured data.

3. The knowledge-graph organization query method based on the hybrid database according to claim 1, wherein the specific storage method in step 3 is: and distinguishing different entity nodes from the entity triple set and storing the entity nodes.

4. The knowledge-graph organization query method based on the hybrid database according to claim 1, wherein the specific storage method in step 5 is: and (4) distinguishing head and tail entity nodes from the relation triple set, then inquiring the head and tail entities in the knowledge base with the index and the entities obtained in the step (4), if the head and tail entities are hit, constructing a relation for the head and tail nodes, and if the head and tail entities are not hit, cancelling the relation.

5. The method of claim 1, wherein the entity ambiguity in step 6 is the presence of word ambiguity and synonyms between entities.

6. The knowledge-graph organization query method based on the hybrid database according to claim 1, wherein the two-stage query structure of MySQL + Neo4j specifically comprises:

7. The method of claim 6, wherein the method of determining whether the query hits in the SQL query is: and (4) comparing the entity to be queried with the entity ambiguity word list obtained in the step (6), if matching exists, querying is hit, otherwise, querying is not hit.