WO2020135050A1

WO2020135050A1 - Knowledge mapping system and map server thereof

Info

Publication number: WO2020135050A1
Application number: PCT/CN2019/124555
Authority: WO
Inventors: 周游; 顾江; 刘涛
Original assignee: 颖投信息科技(上海)有限公司
Priority date: 2018-12-29
Filing date: 2019-12-11
Publication date: 2020-07-02
Also published as: CN109670089A

Abstract

A knowledge mapping system and a map server (10) thereof, the map server (10) comprising: a map database interface (11), configured for receiving a user data operation request, and, in accordance with the type of the data operation request, calling a corresponding interface so as to perform an operation with regard to a distributed data storage module; a map data write interface (12), configured for using a type of data to be written so as to create or update node data or edge data within a distributed data storage module, and for returning a unique index of said data in the distributed data storage module; a map data query interface (13), configured for using query conditions to obtain data stored in a distributed data storage module, and for returning same to a user in accordance with preset node-data and edge-data formats; and a distributed data storage module (14), configured for providing data storage and query services to a map server. The system is able to effectively solve the problem of map databases being unable to adapt to map data management and search in big-data environments.

Description

Knowledge graph system and graph server

This application requires the priority of the Chinese invention patent application with the application number 201811635242.2 and the invention titled "Knowledge Graph System and its Graph Server" filed on December 29, 2018, the entire content of which is hereby incorporated by reference.

Technical field

The present application relates to the technical field of knowledge graph processing, and in particular, to a knowledge graph system and graph server.

Background technique

The knowledge graph is a database that realizes semantic search by storing various entities and their relationships in the real world, and stores and queries data in a graph data structure. Among them, each entity is identified by a globally unique identifier (ID, IDentifier), and the "property-property value" pair (PVP, Property Value) Pair is used to represent the internal characteristics of the entity, and the relationship is used to connect the two. Entities, representing the association between them. The knowledge graph can be regarded as a huge graph. The nodes of the graph represent entities, and the edges represent the relationships between nodes (edges are composed of attributes and relationships).

Taking the financial knowledge graph as an example, it represents companies, management, news events, and personal preferences of users as entities and establishes links between entities to make financial data search more efficient and provide investors with targeted Investment advice.

For the data of knowledge graph, it is the mainstream choice to use graph database for data storage and query. At present, Neo4j is a more advanced native graph query database, which can provide native graph data storage, retrieval and processing. Neo4j has been specially optimized for the storage of graphs, which can greatly improve the efficiency and speed of graph traversal. Neo4j provides Cypher as the query language of graphs, with simple semantics and convenient use.

However, Neo4j is more suitable for lightweight scenarios in practical applications. In the case of large data loads, graph data insertion and traversal performance is poor; in addition, due to the limitations of software architecture, Neo4j can only work on a single machine, the system The scalability and fault tolerance are impossible to talk about. With the rapid rise of enterprise data volume, Neo4j under stand-alone deployment is obviously unable to meet the data management and retrieval requirements of knowledge graph.

Summary of the invention

This application provides a knowledge graph system and graph server, which are used to solve the problem that graph data management and retrieval under the situation of large data volume cannot be adapted to the existing graph database.

A graph server disclosed in the present application includes a graph database interface, a graph data writing interface, a graph data query interface, and a distributed data storage module, wherein: the graph database interface is configured to receive user data operation requests and The type of the data operation request calls a graph data writing interface or a graph data query interface to implement the operation on the distributed data storage module; the graph data writing interface is configured according to the type of data to be written in the data operation request, Create or update the data of nodes or edges in the distributed data storage module, and return the unique index of the data in the distributed data storage module; the graph data query interface is configured according to the query conditions in the data operation request To obtain the data stored in the distributed data storage module and return it to the user according to the preset node and edge data format; the distributed data storage module is a distributed file system or a distributed database and is configured as a graph server Provide data storage and query services.

Preferably, the graph server further includes a query disassembly module configured to disassemble a query request with a complexity greater than a preset condition into multiple sub-query requests, and call the graph data query interface in sequence or concurrently to implement the user's data query request.

Preferably, the graph server is further provided with a memory cache, which is configured to cache data recently accessed by the user and/or data with a number of query hits greater than or equal to a preset thermal data threshold.

Preferably, the graph server is provided with a service discovery mechanism of a distributed data storage module; each storage server of the distributed data storage module is provided with a heartbeat detection interface to report the device status to the graph server in real time; when there is new storage When the server joins or the existing storage server exits, the graph server automatically updates the configuration of the distributed data storage module through the service discovery mechanism, and switches the storage and query service to the corresponding storage server.

Preferably, the graph server further includes a first data preprocessing module configured to extract structured or unstructured original data and convert it into node data and/or edge data of the graph database.

A knowledge graph system disclosed in this application includes a client and the graph server described above; the client is connected to the graph server through a network; the client includes a user interface configured to receive user data operations Request, and send to the graph database interface of the graph server through the network, and receive and display the data operation results of the graph server.

Preferably, the client further includes a second data pre-processing module configured to convert the structured or unstructured original data into node data or edge data of the graph database.

Preferably, the client further includes an intermediate persistent file system configured to temporarily store node data and edge data processed by the second data preprocessing module.

Preferably, the user interface establishes a connection with the graph server by means of hypertext transfer protocol, websocket protocol or remote procedure call protocol.

Preferably, the data operation request adopts the syntax format of Gremlin, GSQL or SPARQL language.

Compared with the prior art, this application has the following advantages:

The server embodiment of the application of this application decouples the system by setting interfaces at various key points. The storage layer can rapidly expand horizontally according to the growth of data volume. In the case of multiple machine deployments, it can effectively solve the problem that the server is inaccessible or The problem of unavailable data. The flexible interface definition not only enables the system to adapt to various database types, but also can select the appropriate storage method according to the business needs; it can also realize the flexible deployment of graph data query interfaces for multiple machines to adapt to highly concurrent application scenarios.

In a further preferred embodiment, each interface uses a unified graph traversal language for interaction without concern for the implementation of the underlying architecture, thereby ensuring the stability of upper-layer applications.

BRIEF DESCRIPTION

The drawings are only for the purpose of showing the preferred embodiments, and are not considered to limit the present application. Furthermore, throughout the drawings, the same reference symbols are used to denote the same components. In the drawings:

FIG. 1 is a schematic structural diagram of an embodiment of a graph server according to this application;

2 is a schematic structural diagram of an embodiment of a knowledge graph system for application;

3 is a schematic diagram of a graph data writing process according to an embodiment of this application;

4 is a schematic diagram of a graph data query process according to an embodiment of the present application.

detailed description

In order to make the above objects, features and advantages of the present application more obvious and understandable, the present application will be described in further detail below with reference to the accompanying drawings and specific embodiments.

In the description of the present application, it should be understood that the terms “first” and “second” are used for descriptive purposes only, and cannot be understood as indicating or implying relative importance or implicitly indicating the number of indicated technical features. Thus, the features defined as "first" and "second" may explicitly or implicitly include one or more of the features. The meaning of "plurality" is two or more, unless specifically defined otherwise. The terms "including", "including" and similar terms should be understood as open terms, ie "including/including but not limited to". The term "based on" is "based at least in part on." The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one other embodiment". Related definitions of other terms will be given in the description below.

One of the inventive concepts of the present application is to adjust the architecture of the entire knowledge graph system into an application layer, a query engine, and an underlying storage layer for the problem of the existing graph database, to achieve system decoupling and reduce the dependency between modules. Among them, the underlying storage layer is used to implement a common interface for graph data storage and query (that is, a virtual graph data layer); by encapsulating common graph processing operations in the graph data layer, the underlying database only needs to provide basic operations such as addition, deletion, modification, and inspection. Yes, thereby reducing the coupling to the underlying database and making the underlying database replaceable; using distributed storage, the graph data can be expanded and redundantly backed up to a file system composed of multiple machines to achieve data consistency and fault tolerance, So that the underlying storage has high I/O performance and flexible schema definition, suitable for the storage of nodes, edges and their attributes. The query engine is used to parse the query language, generate a query plan for graph traversal, and call the underlying storage interface to complete the storage and acquisition of data; after the query engine parses the query language, it can be optimized for the shortest path planning, data aggregation, and other operations. Distributed computing and caching can improve query performance based on data volume and computing resources, and provide high concurrency and short delay services. The application layer provides a unified query language and connection methods, such as Hypertext Transfer Protocol (HTTP, HyperText Transfer), websocket (a protocol defined by RFC 6455 standard for full-duplex communication on a single TCP connection) and remote Protocols such as Procedure Call (RPC, Remote Procedure) Call are connected to the graph server remotely, and interact with the query engine through graph query languages such as Gremlin, GSQL, SPARQL.

Referring to FIG. 1, there is shown a schematic structural diagram of an embodiment of a graph server of the present application, including graph database interface 11, graph data writing interface 12, graph data query interface 13 and distributed data storage module 14.

The graph database interface 11 is used to receive a data operation request issued by a user, and call the graph data writing interface 12 or the graph data query interface 13 to implement operations on the distributed data storage module 14 according to the type of the data operation request.

The types of data operation requests include the creation or update of graph database node data, the creation or update of edge data, and the query of graph database. In specific implementation, the data operation request sent to the graph database interface 11 can use the syntax format of graph data manipulation languages such as Gremlin, GSQL, SPARQL, etc. Taking Gremlin as an example, suppose a node and an edge need to be created in the graph database g. Use the g.addV() command to issue a node creation request, and the g.addE() command to issue an edge creation request.

The graph data writing interface 12 is used to create or update the node data or edge data in the distributed data storage module 14 according to the type of data to be written (including node data and edge data) in the data operation request, and return all the data The unique index of the data in the distributed data storage module 14.

The graph data query interface 13 is used to obtain the data stored in the distributed data storage module 14 according to the query conditions in the data operation request, and return it to the user in a preset data format of nodes and edges.

For example, for a social graph database (g) composed of node persons, there is a friendship relationship between node persons. If you need to query the 2-hop neighbor of "Zhang San" in graph object g, taking Gremlin language as an example, you can use the following procedure:

＝>g.V().has(‘name’,‘Zhang San’)

.repeat(bothE().hasLabel(‘friendship’).otherV().hasLable(‘person’))

.times(2)

In the above example, it is defined that starting from the node "Zhang San", find the person type node through the relationship of the friendship and repeat it twice, that is, find the 2-hop neighbor of "Zhang San".

The distributed data storage module 14 is a distributed file system or a distributed database, and is used to provide data storage and query services for the graph server 10.

In specific implementation, the service discovery mechanism of the distributed data storage module 14 can be set on the graph server 10; meanwhile, a heartbeat detection interface is set on each storage server of the distributed data storage module 14 to report the device status to the graph server 10 in real time When a new storage server is added or an existing storage server is withdrawn, the graph server 10 can automatically update the configuration of the distributed data storage module through the above service discovery mechanism, and switch the storage and query service to the corresponding storage server.

This application encapsulates common graph processing operations in the graph data layer (that is, graph data writing interface 12 and graph data query interface 13), so that the underlying data storage module only needs to provide basic operations such as addition, deletion, modification, and inspection. Reduce the degree of coupling to the underlying data storage module, and make the underlying data storage module replaceable.

The underlying data storage module can customize the storage format according to its storage structure. For example, if the underlying data storage module is a relational database, the nodes and edges can be stored as a table with the following two-dimensional structure:

Node storage structure:

节点idNode id	节点属性1Node attribute 1	节点属性2Node attribute 2
A	A	A

Edge storage structure:

边idEdge id	源节点idSource node id	目标节点idTarget node id	边属性1Edge attribute 1	边属性2Edge attribute 2
A	A	A	A	A

If the underlying data storage module is unstructured data, the storage format that can be used is:

node

In a further preferred embodiment, in order to improve query performance, the graph server is also provided with a query disassembly module, which is used to disassemble a query request whose complexity is greater than a preset condition into multiple sub-query requests, which are called sequentially or concurrently The graph data query interface 13 implements the user's data query request.

For example, for the shortest path query request within 5 steps, when the graph data query interface 13 can only complete the path query within 3 steps, the above request can be disassembled into two sub-query requests, and the second sub-query can use the first sub-query The output of the query is input, and the second subquery can be decomposed into multiple concurrent calls to the graph data query interface 13 as needed.

This application can dismantle complex queries into multiple calls to the graph data query interface, which can expand a single query with limited capabilities into concurrent operation of multiple query servers, thereby greatly improving the query efficiency of the graph database. .

In another further preferred embodiment, in order to further improve the query response speed of graph data, the graph server is further provided with a memory cache for caching the data recently accessed by the user and/or the number of query hits greater than or equal to the preset heat Data threshold data.

In high concurrency scenarios, an optimization method is to cache all (or most) graph data in memory. At this time, the implementation of distributed data storage modules can be divided into distributed memory data systems and distributed persistent storage systems. When writing graph data, write to the persistence system and update to the memory cache. When querying graph data, the distributed memory data system is used to quickly find and calculate, so as to achieve efficient data operation. When the data fails or the system restarts, the memory cache data can be recovered from the persistent system to ensure data security.

Referring to FIG. 2, a schematic structural diagram of an embodiment of the knowledge graph system of the present application is shown, including a client 20 connected through a network and the above-mentioned graph server 10 shown in FIG. 1; wherein:

The client 20 is provided with a user interface 21 for receiving a user's data operation request, and sending it to the graph database interface of the graph server 10 through the network, and receiving and displaying the data operation result of the graph server 10.

In specific implementation, the user interface 21 may establish a connection with the graph server 10 through protocols such as HTTP, websocket, or RPC; the data operation request may use the grammatical format of Gremlin, GSQL, or SPARQL language.

In a further preferred embodiment, in order to adapt to the bulk import of raw data in different data formats in a big data scenario, the knowledge graph system may also be provided with a data preprocessing module and an intermediate persistent file system, where: data preprocessing The module is used to extract the structured or unstructured original data and convert it into node data and/or edge data of the graph database. The intermediate persistent file system is used to temporarily store node data and edge data processed by the data preprocessing module.

During specific implementation, the above-mentioned data pre-processing module can be deployed either on the client (second data pre-processing module) or on the graph server (first data pre-processing module) according to actual needs, or on the client and Graph servers are deployed. The intermediate persistent file system can select Hadoop distributed file system (HDFS, Hadoop Distributed File), simple storage service system (S3, Simple Storage Service) or object storage service system (OSS, Object Storage Service), etc. as needed.

Taking the data pre-processing module deployed on the client as an example, for data from a relational database source, you can define the data model of nodes and edges by table name or SQL, pull metadata into the intermediate persistent file system, and specify Timestamp and other ways to incrementally pull the original data from the original database to generate graph data. For graph database, node data and edge data can be exported separately and written into the intermediate file system. Then, read the data in the intermediate persistent file system, generate node and edge creation requests respectively, and establish a connection with the graph server, then send the request to the graph server through HTTP and other protocol methods to complete the writing of node data and edge data Into.

The graph data writing and query process of the knowledge graph system of the present application will be described below with reference to FIGS. 3 and 4, respectively.

Referring to FIG. 3, a graph data storage and modification process according to an embodiment of the present application is shown, including:

Step S31: The data preprocessing module extracts the structured and unstructured original data, converts it into node data and/or edge data in the form of a graph database, and stores it in the intermediate persistent file system.

Step S32: call the graph database interface and send a request to create or update node data and edge data.

To read the data in the intermediate file system, you can use the grammatical format of graph data manipulation languages such as Gremlin, GSQL, SPARQL, etc. to generate the creation node and edge requests, and after establishing a connection with the server, send the request to the graph server through HTTP and other protocol forms. .

Step S33: The graph server responds and parses the request.

Taking Gremlin as an example, after receiving the client's request, the graph database interface of the graph server parses the request through Gremlin syntax in the current session. According to Gemlin grammar, the request may include graph data write, modify or query operations. The graph server calls the corresponding interface according to the currently requested operation type (for graph data writing and modifying requests, it is implemented by calling the graph data writing interface; for graph data query requests, it is achieved by calling the graph data query interface).

The graph data writing interface persists the data according to the type of data written (node, edge), and returns the unique index of the data in the distributed data storage module.

The graph data query interface finds the data stored in the distributed data storage module according to the conditions of the incoming query, parses it into a preset data format (node, edge) and returns.

Through the current configuration and service discovery mechanism, the graph server establishes a connection with each storage server of the distributed data storage module at startup, and dynamically sends heartbeat monitoring to detect the availability of the storage server. When the graph data write interface receives the write operation request, the write information is serialized and sent to the distributed data storage module.

Step S34: The distributed data storage module writes the graph data to the file system or other persistent storage to complete the persistence.

Any storage system provided with a storage layer interface can be used as a distributed data storage module. For example, it can be a distributed database or a distributed file system. After starting, the storage server needs to register with the service discovery mechanism (to ensure that the graph server can discover itself), and provides a heartbeat detection interface to report the device status in real time. Redundant data replication is implemented between storage servers to ensure fault tolerance. When a new node accesses or an old node exits, the graph server can automatically update the configuration of the distributed data storage module through the service discovery mechanism and switch to the corresponding storage server.

Referring to FIG. 4, a graph data query process of an embodiment of the present application is shown, including:

Step S41: The user initiates a graph data query request through the user interface of the client.

Step S42: The graph database interface of the graph server responds and parses the query request.

For complex queries, the graph database interface may need to be disassembled into multiple query calls to the graph data query interface. For example, for the shortest path query request within 5 steps, it can be disassembled into two sub-query requests, and the output of the first sub-query is used as the input for the second sub-query.

The graph data query interface accesses the data on the hard disk or cached in memory according to the query conditions and the established index conditions.

During specific implementation, aggregation operations such as count and avg can be pushed down to the database of the distributed data storage module to perform calculations. However, for graph calculations such as sub-graph operations and shortest path queries, it is necessary to query the database multiple times and persist the data in memory for further calculation.

This application can reduce the data support requirements for the underlying database by constructing a virtual graph data layer (that is, graph data writing interface and graph data query interface). By encapsulating common graph processing operations in the graph data layer, the underlying database only needs It is sufficient to provide basic operations such as addition, deletion, modification and checking, thereby reducing the coupling degree to the underlying database and making the underlying database replaceable.

It should be noted that the above device embodiments are preferred embodiments, and the units and modules involved are not necessarily required by this application. The embodiments in this specification are described in a progressive manner. Each embodiment focuses on the differences from other embodiments, and the same or similar parts between the embodiments may refer to each other. The above-described embodiments are only schematic, wherein the modules described as separate components may or may not be physically separated, and may be located in one place or may be distributed on multiple network elements (as described above). The data pre-processing module in the system embodiment is taken as an example. The data pre-processing module can be deployed on the client, the graph server, or both the client and the graph server according to actual needs.) Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art can understand and implement without paying creative efforts.

In this article, specific examples are used to explain the principle and implementation of this application. The descriptions of the above examples are only used to help understand the method and core ideas of this application; meanwhile, for ordinary technicians in this field, according to this application The thoughts of this book may change in the specific implementation mode and application scope. In summary, the content of this specification should not be understood as a limitation to this application.

Claims

A graph server, including graph database interface, graph data writing interface, graph data query interface and distributed data storage module, in which:

The graph database interface is configured to receive a user's data operation request, and call a graph data writing interface or a graph data query interface according to the type of the data operation request to implement operations on the distributed data storage module;

The graph data writing interface is configured to create or update data of nodes or edges in the distributed data storage module according to the type of data to be written in the data operation request, and return the data in the distributed data storage module Unique index in

The graph data query interface is configured to obtain data stored in the distributed data storage module according to the query conditions in the data operation request, and return to the user according to a preset data format of nodes and edges;

The distributed data storage module is a distributed file system or a distributed database, and is configured to provide data storage and query services for the graph server.
The graph server according to claim 1, wherein the graph server further includes a query disassembly module configured to disassemble a query request with a complexity greater than a preset condition into multiple sub-query requests, and to call graph data in sequence or concurrently The query interface implements the user's data query request.
The graph server according to claim 1, wherein the graph server is further provided with a memory cache for caching data recently accessed by the user and/or data whose number of query hits is greater than or equal to a preset thermal data threshold.
The graph server according to claim 1, wherein the graph server is provided with a service discovery mechanism of a distributed data storage module; each storage server of the distributed data storage module is provided with a heartbeat detection interface, which maps the graph in real time The server reports the device status; when a new storage server is added or an existing storage server is withdrawn, the graph server automatically updates the configuration of the distributed data storage module through the service discovery mechanism, and switches the storage and query services to the corresponding storage On the server.
The graph server according to claim 1, wherein the graph server further includes a first data preprocessing module configured to extract structured or unstructured original data and convert it into node data of the graph database and/or Or edge data.
A knowledge graph system including a client and the graph server according to any one of claims 1 to 5; the client is connected to the graph server through a network;

The client includes a user interface configured to receive a user's data operation request and send it to the graph database interface of the graph server through the network, and receive and display the graph server's data operation result.
The knowledge graph system according to claim 6, wherein the client further includes a second data preprocessing module configured to convert the structured or unstructured original data into node data or edge data of the graph database.
The knowledge graph system according to claim 7, wherein the client further includes an intermediate persistent file system configured to temporarily store node data and edge data processed by the second data preprocessing module.
The knowledge graph system according to claim 6, wherein the user interface establishes a connection with the graph server through hypertext transfer protocol, websocket protocol or remote procedure call protocol.
The knowledge graph system according to claim 6, wherein the data operation request adopts a grammatical format of Gremlin, GSQL or SPARQL language.