WO2020135050A1 - Knowledge mapping system and map server thereof - Google Patents

Knowledge mapping system and map server thereof Download PDF

Info

Publication number
WO2020135050A1
WO2020135050A1 PCT/CN2019/124555 CN2019124555W WO2020135050A1 WO 2020135050 A1 WO2020135050 A1 WO 2020135050A1 CN 2019124555 W CN2019124555 W CN 2019124555W WO 2020135050 A1 WO2020135050 A1 WO 2020135050A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
graph
server
query
interface
Prior art date
Application number
PCT/CN2019/124555
Other languages
French (fr)
Chinese (zh)
Inventor
周游
顾江
刘涛
Original Assignee
颖投信息科技(上海)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 颖投信息科技(上海)有限公司 filed Critical 颖投信息科技(上海)有限公司
Publication of WO2020135050A1 publication Critical patent/WO2020135050A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9032Query formulation

Definitions

  • the present application relates to the technical field of knowledge graph processing, and in particular, to a knowledge graph system and graph server.
  • the knowledge graph is a database that realizes semantic search by storing various entities and their relationships in the real world, and stores and queries data in a graph data structure.
  • each entity is identified by a globally unique identifier (ID, IDentifier), and the "property-property value" pair (PVP, Property Value) Pair is used to represent the internal characteristics of the entity, and the relationship is used to connect the two. Entities, representing the association between them.
  • ID globally unique identifier
  • PVP Property Value
  • the financial knowledge graph represents companies, management, news events, and personal preferences of users as entities and establishes links between entities to make financial data search more efficient and provide investors with targeted Investment advice.
  • Neo4j is a more advanced native graph query database, which can provide native graph data storage, retrieval and processing.
  • Neo4j has been specially optimized for the storage of graphs, which can greatly improve the efficiency and speed of graph traversal.
  • Neo4j provides Cypher as the query language of graphs, with simple semantics and convenient use.
  • Neo4j is more suitable for lightweight scenarios in practical applications. In the case of large data loads, graph data insertion and traversal performance is poor; in addition, due to the limitations of software architecture, Neo4j can only work on a single machine, the system The scalability and fault tolerance are impossible to talk about. With the rapid rise of enterprise data volume, Neo4j under stand-alone deployment is obviously unable to meet the data management and retrieval requirements of knowledge graph.
  • This application provides a knowledge graph system and graph server, which are used to solve the problem that graph data management and retrieval under the situation of large data volume cannot be adapted to the existing graph database.
  • a graph server disclosed in the present application includes a graph database interface, a graph data writing interface, a graph data query interface, and a distributed data storage module, wherein: the graph database interface is configured to receive user data operation requests and The type of the data operation request calls a graph data writing interface or a graph data query interface to implement the operation on the distributed data storage module; the graph data writing interface is configured according to the type of data to be written in the data operation request, Create or update the data of nodes or edges in the distributed data storage module, and return the unique index of the data in the distributed data storage module; the graph data query interface is configured according to the query conditions in the data operation request To obtain the data stored in the distributed data storage module and return it to the user according to the preset node and edge data format; the distributed data storage module is a distributed file system or a distributed database and is configured as a graph server Provide data storage and query services.
  • the graph server further includes a query disassembly module configured to disassemble a query request with a complexity greater than a preset condition into multiple sub-query requests, and call the graph data query interface in sequence or concurrently to implement the user's data query request.
  • a query disassembly module configured to disassemble a query request with a complexity greater than a preset condition into multiple sub-query requests, and call the graph data query interface in sequence or concurrently to implement the user's data query request.
  • the graph server is further provided with a memory cache, which is configured to cache data recently accessed by the user and/or data with a number of query hits greater than or equal to a preset thermal data threshold.
  • the graph server is provided with a service discovery mechanism of a distributed data storage module; each storage server of the distributed data storage module is provided with a heartbeat detection interface to report the device status to the graph server in real time; when there is new storage When the server joins or the existing storage server exits, the graph server automatically updates the configuration of the distributed data storage module through the service discovery mechanism, and switches the storage and query service to the corresponding storage server.
  • the graph server further includes a first data preprocessing module configured to extract structured or unstructured original data and convert it into node data and/or edge data of the graph database.
  • a first data preprocessing module configured to extract structured or unstructured original data and convert it into node data and/or edge data of the graph database.
  • a knowledge graph system disclosed in this application includes a client and the graph server described above; the client is connected to the graph server through a network; the client includes a user interface configured to receive user data operations Request, and send to the graph database interface of the graph server through the network, and receive and display the data operation results of the graph server.
  • the client further includes a second data pre-processing module configured to convert the structured or unstructured original data into node data or edge data of the graph database.
  • the client further includes an intermediate persistent file system configured to temporarily store node data and edge data processed by the second data preprocessing module.
  • the user interface establishes a connection with the graph server by means of hypertext transfer protocol, websocket protocol or remote procedure call protocol.
  • the data operation request adopts the syntax format of Gremlin, GSQL or SPARQL language.
  • the server embodiment of the application of this application decouples the system by setting interfaces at various key points.
  • the storage layer can rapidly expand horizontally according to the growth of data volume. In the case of multiple machine deployments, it can effectively solve the problem that the server is inaccessible or The problem of unavailable data.
  • the flexible interface definition not only enables the system to adapt to various database types, but also can select the appropriate storage method according to the business needs; it can also realize the flexible deployment of graph data query interfaces for multiple machines to adapt to highly concurrent application scenarios.
  • each interface uses a unified graph traversal language for interaction without concern for the implementation of the underlying architecture, thereby ensuring the stability of upper-layer applications.
  • FIG. 1 is a schematic structural diagram of an embodiment of a graph server according to this application.
  • FIG. 2 is a schematic structural diagram of an embodiment of a knowledge graph system for application
  • FIG. 3 is a schematic diagram of a graph data writing process according to an embodiment of this application.
  • FIG. 4 is a schematic diagram of a graph data query process according to an embodiment of the present application.
  • first and second are used for descriptive purposes only, and cannot be understood as indicating or implying relative importance or implicitly indicating the number of indicated technical features.
  • the features defined as “first” and “second” may explicitly or implicitly include one or more of the features.
  • the meaning of “plurality” is two or more, unless specifically defined otherwise.
  • the terms “including”, “including” and similar terms should be understood as open terms, ie “including/including but not limited to”.
  • the term “based on” is “based at least in part on.”
  • one embodiment means “at least one embodiment”; the term “another embodiment” means “at least one other embodiment”.
  • Related definitions of other terms will be given in the description below.
  • One of the inventive concepts of the present application is to adjust the architecture of the entire knowledge graph system into an application layer, a query engine, and an underlying storage layer for the problem of the existing graph database, to achieve system decoupling and reduce the dependency between modules.
  • the underlying storage layer is used to implement a common interface for graph data storage and query (that is, a virtual graph data layer); by encapsulating common graph processing operations in the graph data layer, the underlying database only needs to provide basic operations such as addition, deletion, modification, and inspection.
  • the graph data can be expanded and redundantly backed up to a file system composed of multiple machines to achieve data consistency and fault tolerance, So that the underlying storage has high I/O performance and flexible schema definition, suitable for the storage of nodes, edges and their attributes.
  • the query engine is used to parse the query language, generate a query plan for graph traversal, and call the underlying storage interface to complete the storage and acquisition of data; after the query engine parses the query language, it can be optimized for the shortest path planning, data aggregation, and other operations.
  • Distributed computing and caching can improve query performance based on data volume and computing resources, and provide high concurrency and short delay services.
  • the application layer provides a unified query language and connection methods, such as Hypertext Transfer Protocol (HTTP, HyperText Transfer), websocket (a protocol defined by RFC 6455 standard for full-duplex communication on a single TCP connection) and remote Protocols such as Procedure Call (RPC, Remote Procedure) Call are connected to the graph server remotely, and interact with the query engine through graph query languages such as Gremlin, GSQL, SPARQL.
  • HTTP Hypertext Transfer Protocol
  • websocket a protocol defined by RFC 6455 standard for full-duplex communication on a single TCP connection
  • RPC Procedure Call
  • FIG. 1 there is shown a schematic structural diagram of an embodiment of a graph server of the present application, including graph database interface 11, graph data writing interface 12, graph data query interface 13 and distributed data storage module 14.
  • the graph database interface 11 is used to receive a data operation request issued by a user, and call the graph data writing interface 12 or the graph data query interface 13 to implement operations on the distributed data storage module 14 according to the type of the data operation request.
  • the types of data operation requests include the creation or update of graph database node data, the creation or update of edge data, and the query of graph database.
  • the data operation request sent to the graph database interface 11 can use the syntax format of graph data manipulation languages such as Gremlin, GSQL, SPARQL, etc. Taking Gremlin as an example, suppose a node and an edge need to be created in the graph database g. Use the g.addV() command to issue a node creation request, and the g.addE() command to issue an edge creation request.
  • the graph data writing interface 12 is used to create or update the node data or edge data in the distributed data storage module 14 according to the type of data to be written (including node data and edge data) in the data operation request, and return all the data The unique index of the data in the distributed data storage module 14.
  • the graph data query interface 13 is used to obtain the data stored in the distributed data storage module 14 according to the query conditions in the data operation request, and return it to the user in a preset data format of nodes and edges.
  • the distributed data storage module 14 is a distributed file system or a distributed database, and is used to provide data storage and query services for the graph server 10.
  • the service discovery mechanism of the distributed data storage module 14 can be set on the graph server 10; meanwhile, a heartbeat detection interface is set on each storage server of the distributed data storage module 14 to report the device status to the graph server 10 in real time
  • the graph server 10 can automatically update the configuration of the distributed data storage module through the above service discovery mechanism, and switch the storage and query service to the corresponding storage server.
  • This application encapsulates common graph processing operations in the graph data layer (that is, graph data writing interface 12 and graph data query interface 13), so that the underlying data storage module only needs to provide basic operations such as addition, deletion, modification, and inspection. Reduce the degree of coupling to the underlying data storage module, and make the underlying data storage module replaceable.
  • the underlying data storage module can customize the storage format according to its storage structure. For example, if the underlying data storage module is a relational database, the nodes and edges can be stored as a table with the following two-dimensional structure:
  • the storage format that can be used is:
  • the graph server is also provided with a query disassembly module, which is used to disassemble a query request whose complexity is greater than a preset condition into multiple sub-query requests, which are called sequentially or concurrently
  • the graph data query interface 13 implements the user's data query request.
  • the above request can be disassembled into two sub-query requests, and the second sub-query can use the first sub-query
  • the output of the query is input, and the second subquery can be decomposed into multiple concurrent calls to the graph data query interface 13 as needed.
  • This application can dismantle complex queries into multiple calls to the graph data query interface, which can expand a single query with limited capabilities into concurrent operation of multiple query servers, thereby greatly improving the query efficiency of the graph database. .
  • the graph server in order to further improve the query response speed of graph data, is further provided with a memory cache for caching the data recently accessed by the user and/or the number of query hits greater than or equal to the preset heat Data threshold data.
  • an optimization method is to cache all (or most) graph data in memory.
  • the implementation of distributed data storage modules can be divided into distributed memory data systems and distributed persistent storage systems.
  • write to the persistence system write to the memory cache.
  • the distributed memory data system is used to quickly find and calculate, so as to achieve efficient data operation.
  • the memory cache data can be recovered from the persistent system to ensure data security.
  • FIG. 2 a schematic structural diagram of an embodiment of the knowledge graph system of the present application is shown, including a client 20 connected through a network and the above-mentioned graph server 10 shown in FIG. 1; wherein:
  • the client 20 is provided with a user interface 21 for receiving a user's data operation request, and sending it to the graph database interface of the graph server 10 through the network, and receiving and displaying the data operation result of the graph server 10.
  • the user interface 21 may establish a connection with the graph server 10 through protocols such as HTTP, websocket, or RPC; the data operation request may use the grammatical format of Gremlin, GSQL, or SPARQL language.
  • the knowledge graph system may also be provided with a data preprocessing module and an intermediate persistent file system, where: data preprocessing The module is used to extract the structured or unstructured original data and convert it into node data and/or edge data of the graph database.
  • the intermediate persistent file system is used to temporarily store node data and edge data processed by the data preprocessing module.
  • the above-mentioned data pre-processing module can be deployed either on the client (second data pre-processing module) or on the graph server (first data pre-processing module) according to actual needs, or on the client and Graph servers are deployed.
  • the intermediate persistent file system can select Hadoop distributed file system (HDFS, Hadoop Distributed File), simple storage service system (S3, Simple Storage Service) or object storage service system (OSS, Object Storage Service), etc. as needed.
  • HDFS Hadoop distributed file system
  • S3, Simple Storage Service simple storage service system
  • OSS Object Storage Service
  • node data and edge data can be exported separately and written into the intermediate file system. Then, read the data in the intermediate persistent file system, generate node and edge creation requests respectively, and establish a connection with the graph server, then send the request to the graph server through HTTP and other protocol methods to complete the writing of node data and edge data Into.
  • FIG. 3 a graph data storage and modification process according to an embodiment of the present application is shown, including:
  • Step S31 The data preprocessing module extracts the structured and unstructured original data, converts it into node data and/or edge data in the form of a graph database, and stores it in the intermediate persistent file system.
  • Step S32 call the graph database interface and send a request to create or update node data and edge data.
  • Step S33 The graph server responds and parses the request.
  • the graph database interface of the graph server parses the request through Gremlin syntax in the current session.
  • the request may include graph data write, modify or query operations.
  • the graph server calls the corresponding interface according to the currently requested operation type (for graph data writing and modifying requests, it is implemented by calling the graph data writing interface; for graph data query requests, it is achieved by calling the graph data query interface).
  • the graph data writing interface persists the data according to the type of data written (node, edge), and returns the unique index of the data in the distributed data storage module.
  • the graph data query interface finds the data stored in the distributed data storage module according to the conditions of the incoming query, parses it into a preset data format (node, edge) and returns.
  • the graph server establishes a connection with each storage server of the distributed data storage module at startup, and dynamically sends heartbeat monitoring to detect the availability of the storage server.
  • the graph data write interface receives the write operation request, the write information is serialized and sent to the distributed data storage module.
  • Step S34 The distributed data storage module writes the graph data to the file system or other persistent storage to complete the persistence.
  • Any storage system provided with a storage layer interface can be used as a distributed data storage module.
  • it can be a distributed database or a distributed file system.
  • the storage server needs to register with the service discovery mechanism (to ensure that the graph server can discover itself), and provides a heartbeat detection interface to report the device status in real time. Redundant data replication is implemented between storage servers to ensure fault tolerance.
  • the graph server can automatically update the configuration of the distributed data storage module through the service discovery mechanism and switch to the corresponding storage server.
  • FIG. 4 a graph data query process of an embodiment of the present application is shown, including:
  • Step S41 The user initiates a graph data query request through the user interface of the client.
  • Step S42 The graph database interface of the graph server responds and parses the query request.
  • the graph database interface may need to be disassembled into multiple query calls to the graph data query interface. For example, for the shortest path query request within 5 steps, it can be disassembled into two sub-query requests, and the output of the first sub-query is used as the input for the second sub-query.
  • the graph data query interface accesses the data on the hard disk or cached in memory according to the query conditions and the established index conditions.
  • aggregation operations such as count and avg can be pushed down to the database of the distributed data storage module to perform calculations.
  • graph calculations such as sub-graph operations and shortest path queries, it is necessary to query the database multiple times and persist the data in memory for further calculation.
  • This application can reduce the data support requirements for the underlying database by constructing a virtual graph data layer (that is, graph data writing interface and graph data query interface).
  • a virtual graph data layer that is, graph data writing interface and graph data query interface.
  • the above device embodiments are preferred embodiments, and the units and modules involved are not necessarily required by this application.
  • the embodiments in this specification are described in a progressive manner. Each embodiment focuses on the differences from other embodiments, and the same or similar parts between the embodiments may refer to each other.
  • the above-described embodiments are only schematic, wherein the modules described as separate components may or may not be physically separated, and may be located in one place or may be distributed on multiple network elements (as described above).
  • the data pre-processing module in the system embodiment is taken as an example.
  • the data pre-processing module can be deployed on the client, the graph server, or both the client and the graph server according to actual needs.) Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art can understand and implement without paying creative efforts.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A knowledge mapping system and a map server (10) thereof, the map server (10) comprising: a map database interface (11), configured for receiving a user data operation request, and, in accordance with the type of the data operation request, calling a corresponding interface so as to perform an operation with regard to a distributed data storage module; a map data write interface (12), configured for using a type of data to be written so as to create or update node data or edge data within a distributed data storage module, and for returning a unique index of said data in the distributed data storage module; a map data query interface (13), configured for using query conditions to obtain data stored in a distributed data storage module, and for returning same to a user in accordance with preset node-data and edge-data formats; and a distributed data storage module (14), configured for providing data storage and query services to a map server. The system is able to effectively solve the problem of map databases being unable to adapt to map data management and search in big-data environments.

Description

知识图谱系统及其图服务器Knowledge graph system and graph server
本申请要求2018年12月29日提交的申请号为201811635242.2、发明名称为“知识图谱系统及其图服务器”的中国发明专利申请的优先权,其全文引用在此供参考。This application requires the priority of the Chinese invention patent application with the application number 201811635242.2 and the invention titled "Knowledge Graph System and its Graph Server" filed on December 29, 2018, the entire content of which is hereby incorporated by reference.
技术领域Technical field
本申请涉及知识图谱处理技术领域,特别地,涉及一种知识图谱系统及其图服务器。The present application relates to the technical field of knowledge graph processing, and in particular, to a knowledge graph system and graph server.
背景技术Background technique
知识图谱是一种通过保存现实世界中存在的各种实体及其实体间的关系来实现语义搜索的数据库,以图数据结构存储并查询数据。其中,每个实体用一个全局唯一确定的标识符(ID,IDentifier)来标识,用“属性-属性值”对(PVP,Property Value Pair)来表示实体的内在特性,用关系(Relation)连接两个实体,表示它们之间的关联。知识图谱可被看作是一张巨大的图,图的节点表示实体,边表示节点间的关系(边由属性和关系构成)。The knowledge graph is a database that realizes semantic search by storing various entities and their relationships in the real world, and stores and queries data in a graph data structure. Among them, each entity is identified by a globally unique identifier (ID, IDentifier), and the "property-property value" pair (PVP, Property Value) Pair is used to represent the internal characteristics of the entity, and the relationship is used to connect the two. Entities, representing the association between them. The knowledge graph can be regarded as a huge graph. The nodes of the graph represent entities, and the edges represent the relationships between nodes (edges are composed of attributes and relationships).
以金融知识图谱为例,其通过将公司、管理层、新闻事件以及使用者个人偏好等都表示为实体并建立实体之间的联系,使金融数据的搜索更加高效,能为投资者提供有针对性的投资建议。Taking the financial knowledge graph as an example, it represents companies, management, news events, and personal preferences of users as entities and establishes links between entities to make financial data search more efficient and provide investors with targeted Investment advice.
对于知识图谱的数据,采用图数据库进行数据的存储和查询是比较主流的选择。目前,Neo4j是比较先进的原生图查询数据库,可以提供原生的图数据存储,检索和处理。Neo4j对于图的存储经过特别优化,可较大程度地提高图的遍历的效率和速度,Neo4j提供Cypher作为图的查询语言,语义简洁,方便使用。For the data of knowledge graph, it is the mainstream choice to use graph database for data storage and query. At present, Neo4j is a more advanced native graph query database, which can provide native graph data storage, retrieval and processing. Neo4j has been specially optimized for the storage of graphs, which can greatly improve the efficiency and speed of graph traversal. Neo4j provides Cypher as the query language of graphs, with simple semantics and convenient use.
然而Neo4j在实际应用时比较适合轻量级的场景,在大数据负载情况下,图数据的插入和遍历性能较差;另外,由于软件架构的限制,Neo4j只能在单台机器上工作,系统的扩展性和容错能力都无从谈起。随着企业数据量的快速上升,单机部署下的Neo4j显然已经无法适应知识图谱的数据管理和检索需求。However, Neo4j is more suitable for lightweight scenarios in practical applications. In the case of large data loads, graph data insertion and traversal performance is poor; in addition, due to the limitations of software architecture, Neo4j can only work on a single machine, the system The scalability and fault tolerance are impossible to talk about. With the rapid rise of enterprise data volume, Neo4j under stand-alone deployment is obviously unable to meet the data management and retrieval requirements of knowledge graph.
发明内容Summary of the invention
本申请提供一种知识图谱系统及其图服务器,用于解决现有图数据库不能适应大数据量场景下的图数据管理和检索的问题。This application provides a knowledge graph system and graph server, which are used to solve the problem that graph data management and retrieval under the situation of large data volume cannot be adapted to the existing graph database.
本申请公开的一种图服务器,包括图数据库接口、图数据写入接口、图数据查询接口和分布式数据存储模块,其中:所述图数据库接口配置为接收用户的数据操作请求,并根据所述数据操作请求的类型调用图数据写入接口或图数据查询接口实现对分布式数据存储模块的操作;所述图数据写入接口配置为根据所述数据操作请求中的待写入数据类型,创建或更新分布式数据存储模块中的节点或边的数据,并返回所述数据在分布式数据存储模块中的唯一索引;所述图数据查询接口配置为根据所述数据操作请求中的查询条件,获得存储在分布式数据存储模块中的数据,并按预设的节点和边的数据格式返回给用户;所述分布式数据存储模块为分布式文件系统或分布式数据库,配置为为图服务器提供数据存储和查询服务。A graph server disclosed in the present application includes a graph database interface, a graph data writing interface, a graph data query interface, and a distributed data storage module, wherein: the graph database interface is configured to receive user data operation requests and The type of the data operation request calls a graph data writing interface or a graph data query interface to implement the operation on the distributed data storage module; the graph data writing interface is configured according to the type of data to be written in the data operation request, Create or update the data of nodes or edges in the distributed data storage module, and return the unique index of the data in the distributed data storage module; the graph data query interface is configured according to the query conditions in the data operation request To obtain the data stored in the distributed data storage module and return it to the user according to the preset node and edge data format; the distributed data storage module is a distributed file system or a distributed database and is configured as a graph server Provide data storage and query services.
优选地,所述图服务器还包括查询拆解模块,配置为将复杂度大于预设条件的查询请求拆解为多个子查询请求,按顺序或并发调用图数据查询接口实现用户的数据查询请求。Preferably, the graph server further includes a query disassembly module configured to disassemble a query request with a complexity greater than a preset condition into multiple sub-query requests, and call the graph data query interface in sequence or concurrently to implement the user's data query request.
优选地,所述图服务器还设置有内存缓存,配置为缓存用户最近访问的数据和/或查询命中次数大于或等于预设热数据阈值的数据。Preferably, the graph server is further provided with a memory cache, which is configured to cache data recently accessed by the user and/or data with a number of query hits greater than or equal to a preset thermal data threshold.
优选地,所述图服务器设置有分布式数据存储模块的服务发现机制;所述分布式数据存储模块的每个存储服务器上设置有心跳检测接口,实时向图服务器报告设备状态;当有新存储服务器加入或现有存储服务器退出时,所述图服务器通过所述服务发现机制自动更新分布式数据存储模块的配置,并将存储和查询服务切换到对应的存储服务器上。Preferably, the graph server is provided with a service discovery mechanism of a distributed data storage module; each storage server of the distributed data storage module is provided with a heartbeat detection interface to report the device status to the graph server in real time; when there is new storage When the server joins or the existing storage server exits, the graph server automatically updates the configuration of the distributed data storage module through the service discovery mechanism, and switches the storage and query service to the corresponding storage server.
优选地,所述图服务器还包括第一数据预处理模块,配置为对结构化或非结构化的原始数据进行抽取,并转换为图数据库的节点数据和/或边数据。Preferably, the graph server further includes a first data preprocessing module configured to extract structured or unstructured original data and convert it into node data and/or edge data of the graph database.
本申请公开的一种知识图谱系统,包括客户端和上文所述的图服务器;所述客户端通过网络与所述图服务器连接;所述客户端包括用户接口,配置为接收用户的数据操作请求,并通过网络发送至图服务器的图数据库接口,以及接收并显示图服务器的数据操作结果。A knowledge graph system disclosed in this application includes a client and the graph server described above; the client is connected to the graph server through a network; the client includes a user interface configured to receive user data operations Request, and send to the graph database interface of the graph server through the network, and receive and display the data operation results of the graph server.
优选地,所述客户端还包括第二数据预处理模块,配置为将结构化或非结 构化的原始数据转换为图数据库的节点数据或边数据。Preferably, the client further includes a second data pre-processing module configured to convert the structured or unstructured original data into node data or edge data of the graph database.
优选地,所述客户端还包括中间可持久化文件系统,配置为暂存所述第二数据预处理模块处理后的节点数据和边数据。Preferably, the client further includes an intermediate persistent file system configured to temporarily store node data and edge data processed by the second data preprocessing module.
优选地,所述用户接口通过超文本传输协议、websocket协议或远程过程调用协议方式与图服务器建立连接。Preferably, the user interface establishes a connection with the graph server by means of hypertext transfer protocol, websocket protocol or remote procedure call protocol.
优选地,所述数据操作请求采用Gremlin、GSQL或SPARQL语言的语法格式。Preferably, the data operation request adopts the syntax format of Gremlin, GSQL or SPARQL language.
与现有技术相比,本申请具有以下优点:Compared with the prior art, this application has the following advantages:
本申请图服务器实施例通过在各个关键点设置接口的方式对系统进行了解耦,存储层可以根据数据量的增长快速横向扩展,在多台机器部署的情况下,可有效解决服务器不可访问或者数据不可获取的问题。灵活的接口定义不仅使系统能够适应各种不同的数据库类型,可以根据业务需要选择合适的存储方式;还可以实现图数据查询接口的多机灵活部署,以适应高并发应用场景。The server embodiment of the application of this application decouples the system by setting interfaces at various key points. The storage layer can rapidly expand horizontally according to the growth of data volume. In the case of multiple machine deployments, it can effectively solve the problem that the server is inaccessible or The problem of unavailable data. The flexible interface definition not only enables the system to adapt to various database types, but also can select the appropriate storage method according to the business needs; it can also realize the flexible deployment of graph data query interfaces for multiple machines to adapt to highly concurrent application scenarios.
在进一步的优选实施例中,各接口使用统一的图遍历语言进行交互,无需关心底层的架构实现,从而可保证上层应用的稳定。In a further preferred embodiment, each interface uses a unified graph traversal language for interaction without concern for the implementation of the underlying architecture, thereby ensuring the stability of upper-layer applications.
附图说明BRIEF DESCRIPTION
附图仅用于示出优选实施方式的目的,而并不认为是对本申请的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:The drawings are only for the purpose of showing the preferred embodiments, and are not considered to limit the present application. Furthermore, throughout the drawings, the same reference symbols are used to denote the same components. In the drawings:
图1为本申请图服务器一实施例的结构示意图;FIG. 1 is a schematic structural diagram of an embodiment of a graph server according to this application;
图2为本申请知识图谱系统一实施例的结构示意图;2 is a schematic structural diagram of an embodiment of a knowledge graph system for application;
图3为本申请实施例的图数据写入流程示意图;3 is a schematic diagram of a graph data writing process according to an embodiment of this application;
图4为本申请实施例的图数据查询流程示意图。4 is a schematic diagram of a graph data query process according to an embodiment of the present application.
具体实施方式detailed description
为使本申请的上述目的、特征和优点能够更加明显易懂,下面结合附图和具体实施方式对本申请作进一步详细的说明。In order to make the above objects, features and advantages of the present application more obvious and understandable, the present application will be described in further detail below with reference to the accompanying drawings and specific embodiments.
在本申请的描述中,需要理解的是,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。“多个”的含义是两个或两个以上,除非另有明确具体的限定。 术语“包括”、“包含”及类似术语应该被理解为是开放性的术语,即“包括/包含但不限于”。术语“基于”是“至少部分地基于”。术语“一实施例”表示“至少一个实施例”;术语“另一实施例”表示“至少一个另外的实施例”。其他术语的相关定义将在下文描述中给出。In the description of the present application, it should be understood that the terms “first” and “second” are used for descriptive purposes only, and cannot be understood as indicating or implying relative importance or implicitly indicating the number of indicated technical features. Thus, the features defined as "first" and "second" may explicitly or implicitly include one or more of the features. The meaning of "plurality" is two or more, unless specifically defined otherwise. The terms "including", "including" and similar terms should be understood as open terms, ie "including/including but not limited to". The term "based on" is "based at least in part on." The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one other embodiment". Related definitions of other terms will be given in the description below.
本申请的发明构思之一在于:针对现有图数据库的问题,将整个知识图谱系统的架构调整成应用层、查询引擎和底层存储层,实现系统解耦,减少模块之间的依赖关系。其中,底层存储层用于实现图数据存储和查询的通用接口(即虚拟的图数据层);通过将常见的图处理操作封装在图数据层,底层数据库只需要提供增删改查等基本操作即可,从而降低对底层数据库的耦合度,并使得底层数据库可替换;采用分布式存储,使图数据可以扩展和冗余地备份至多台机器组成的文件系统,实现数据的一致性和容错能力,从而使底层存储具有较高的I/O性能和灵活的模式(schema)定义,适用于节点、边及其属性的存储。查询引擎用于解析查询语言,生成图遍历的查询计划,并调用底层存储的接口完成数据的存储和获取;查询引擎解析完查询语言,可针对最短路径规划,数据聚合等操作进行合理优化,实现分布式计算和缓存,能够依据数据量和计算资源,提高查询的性能,提供高并发和短延迟的服务。应用层提供统一的查询语言和连接方式,如可提供超文本传输协议(HTTP,HyperText Transfer Protocol)、websocket(RFC 6455标准定义的一种在单个TCP连接上进行全双工通信的协议)和远程过程调用(RPC,Remote Procedure Call)等协议远程连接到图服务器,并且通过Gremlin、GSQL、SPARQL等图查询语言与查询引擎进行交互。One of the inventive concepts of the present application is to adjust the architecture of the entire knowledge graph system into an application layer, a query engine, and an underlying storage layer for the problem of the existing graph database, to achieve system decoupling and reduce the dependency between modules. Among them, the underlying storage layer is used to implement a common interface for graph data storage and query (that is, a virtual graph data layer); by encapsulating common graph processing operations in the graph data layer, the underlying database only needs to provide basic operations such as addition, deletion, modification, and inspection. Yes, thereby reducing the coupling to the underlying database and making the underlying database replaceable; using distributed storage, the graph data can be expanded and redundantly backed up to a file system composed of multiple machines to achieve data consistency and fault tolerance, So that the underlying storage has high I/O performance and flexible schema definition, suitable for the storage of nodes, edges and their attributes. The query engine is used to parse the query language, generate a query plan for graph traversal, and call the underlying storage interface to complete the storage and acquisition of data; after the query engine parses the query language, it can be optimized for the shortest path planning, data aggregation, and other operations. Distributed computing and caching can improve query performance based on data volume and computing resources, and provide high concurrency and short delay services. The application layer provides a unified query language and connection methods, such as Hypertext Transfer Protocol (HTTP, HyperText Transfer), websocket (a protocol defined by RFC 6455 standard for full-duplex communication on a single TCP connection) and remote Protocols such as Procedure Call (RPC, Remote Procedure) Call are connected to the graph server remotely, and interact with the query engine through graph query languages such as Gremlin, GSQL, SPARQL.
参照图1,示出了本申请图服务器一实施例的组成结构示意图,包括图数据库接口11、图数据写入接口12、图数据查询接口13和分布式数据存储模块14。Referring to FIG. 1, there is shown a schematic structural diagram of an embodiment of a graph server of the present application, including graph database interface 11, graph data writing interface 12, graph data query interface 13 and distributed data storage module 14.
图数据库接口11用于接收用户发出的数据操作请求,并根据所述数据操作请求的类型调用图数据写入接口12或图数据查询接口13实现对分布式数据存储模块14的操作。The graph database interface 11 is used to receive a data operation request issued by a user, and call the graph data writing interface 12 or the graph data query interface 13 to implement operations on the distributed data storage module 14 according to the type of the data operation request.
其中的数据操作请求的类型包括图数据库节点数据的创建或更新、边数据的创建或更新、图数据库的查询等。具体实施时,发送到图数据库接口11的数据操作请求可以采用Gremlin、GSQL、SPARQL等图数据操作语言的语法格式,以Gremlin为例,假定需要在图数据库g中创建一个节点和一条边,可以采用g.addV()命令发出节点创建请求,采用g.addE()命令发出边创建请求。The types of data operation requests include the creation or update of graph database node data, the creation or update of edge data, and the query of graph database. In specific implementation, the data operation request sent to the graph database interface 11 can use the syntax format of graph data manipulation languages such as Gremlin, GSQL, SPARQL, etc. Taking Gremlin as an example, suppose a node and an edge need to be created in the graph database g. Use the g.addV() command to issue a node creation request, and the g.addE() command to issue an edge creation request.
图数据写入接口12用于根据所述数据操作请求中的待写入数据类型(包括节点数据和边数据),创建或更新分布式数据存储模块14中的节点数据或边数据,并返回所述数据在分布式数据存储模块14中的唯一索引。The graph data writing interface 12 is used to create or update the node data or edge data in the distributed data storage module 14 according to the type of data to be written (including node data and edge data) in the data operation request, and return all the data The unique index of the data in the distributed data storage module 14.
图数据查询接口13用于根据所述数据操作请求中的查询条件,获得存储在分布式数据存储模块14中的数据,并按预设的节点和边的数据格式返回给用户。The graph data query interface 13 is used to obtain the data stored in the distributed data storage module 14 according to the query conditions in the data operation request, and return it to the user in a preset data format of nodes and edges.
例如,对于节点人(person)组成的社交图数据库(g),节点人(person)之间具有朋友(friendship)关系。假如需要查询图对象g中“张三”的2跳(2-hop)邻居,以Gremlin语言为例,可以通过下述程序实现:For example, for a social graph database (g) composed of node persons, there is a friendship relationship between node persons. If you need to query the 2-hop neighbor of "Zhang San" in graph object g, taking Gremlin language as an example, you can use the following procedure:
=>g.V().has(‘name’,‘张三’)=>g.V().has(‘name’,‘Zhang San’)
.repeat(bothE().hasLabel(‘friendship’).otherV().hasLable(‘person’)).repeat(bothE().hasLabel(‘friendship’).otherV().hasLable(‘person’))
.times(2).times(2)
上述例子中,定义了从“张三”这个节点出发,通过边friendship的关系找到person类型的节点并且重复两次,即找到“张三”的2跳邻居。In the above example, it is defined that starting from the node "Zhang San", find the person type node through the relationship of the friendship and repeat it twice, that is, find the 2-hop neighbor of "Zhang San".
分布式数据存储模块14为分布式文件系统或分布式数据库,用于为图服务器10提供数据存储和查询服务。The distributed data storage module 14 is a distributed file system or a distributed database, and is used to provide data storage and query services for the graph server 10.
具体实施时,可以在图服务器10上设置分布式数据存储模块14的服务发现机制;同时,在分布式数据存储模块14的每个存储服务器上设置心跳检测接口,实时向图服务器10报告设备状态;当有新存储服务器加入或现有存储服务器退出时,图服务器10可以通过上述服务发现机制自动更新分布式数据存储模块的配置,并将存储和查询服务切换到对应的存储服务器上。In specific implementation, the service discovery mechanism of the distributed data storage module 14 can be set on the graph server 10; meanwhile, a heartbeat detection interface is set on each storage server of the distributed data storage module 14 to report the device status to the graph server 10 in real time When a new storage server is added or an existing storage server is withdrawn, the graph server 10 can automatically update the configuration of the distributed data storage module through the above service discovery mechanism, and switch the storage and query service to the corresponding storage server.
本申请通过将常见的图处理操作封装在图数据层(即图数据写入接口12和图数据查询接口13)的手段,使底层数据存储模块只需要提供增删改查等基本操作即可,从而降低对底层数据存储模块的耦合度,并使得底层数据存储模块可替换。This application encapsulates common graph processing operations in the graph data layer (that is, graph data writing interface 12 and graph data query interface 13), so that the underlying data storage module only needs to provide basic operations such as addition, deletion, modification, and inspection. Reduce the degree of coupling to the underlying data storage module, and make the underlying data storage module replaceable.
底层数据存储模块可根据自身存储结构,自定义存储的格式。例如,如果底层数据存储模块为关系型数据库,则节点和边可以存储成如下二维结构的表:The underlying data storage module can customize the storage format according to its storage structure. For example, if the underlying data storage module is a relational database, the nodes and edges can be stored as a table with the following two-dimensional structure:
节点的存储结构:Node storage structure:
节点idNode id 节点属性1Node attribute 1 节点属性2Node attribute 2
 A  A  A
边的存储结构:Edge storage structure:
边idEdge id 源节点idSource node id 目标节点idTarget node id 边属性1Edge attribute 1 边属性2Edge attribute 2
 A  A  A  A  A
如果底层数据存储模块为非结构化数据,则可以采用的存储形式为:If the underlying data storage module is unstructured data, the storage format that can be used is:
节点node
Figure PCTCN2019124555-appb-000001
Figure PCTCN2019124555-appb-000001
在进一步的优选实施例中,为提高查询性能,所述图服务器还设置有查询拆解模块,用于将复杂度大于预设条件的查询请求拆解为多个子查询请求,按顺序或并发调用图数据查询接口13实现用户的数据查询请求。In a further preferred embodiment, in order to improve query performance, the graph server is also provided with a query disassembly module, which is used to disassemble a query request whose complexity is greater than a preset condition into multiple sub-query requests, which are called sequentially or concurrently The graph data query interface 13 implements the user's data query request.
例如,对于5步以内的最短路径查询请求,当图数据查询接口13只能完成3步以内的路径查询时,可以将上述请求拆解成二个子查询请求,第二个子查询以第一个子查询的输出结果为输入,而且还可以根据需要将第二个子查询分解成多个对图数据查询接口13的并发调用。For example, for the shortest path query request within 5 steps, when the graph data query interface 13 can only complete the path query within 3 steps, the above request can be disassembled into two sub-query requests, and the second sub-query can use the first sub-query The output of the query is input, and the second subquery can be decomposed into multiple concurrent calls to the graph data query interface 13 as needed.
本申请通过上述将复杂查询拆解成对图数据查询接口多次调用的手段,可将能力有限的单台查询,扩展成多个查询服务器并发操作,从而较大程度地提高图数据库的查询效率。This application can dismantle complex queries into multiple calls to the graph data query interface, which can expand a single query with limited capabilities into concurrent operation of multiple query servers, thereby greatly improving the query efficiency of the graph database. .
在另一进一步的优选实施例中,为进一步提高图数据的查询响应速度,所述图服务器还设置有内存缓存,用于缓存用户最近访问的数据和/或查询命中次数大于或等于预设热数据阈值的数据。In another further preferred embodiment, in order to further improve the query response speed of graph data, the graph server is further provided with a memory cache for caching the data recently accessed by the user and/or the number of query hits greater than or equal to the preset heat Data threshold data.
在高并发场景下,一种优化方式是将所有(或大部分)图数据缓存在内存中。此时分布式数据存储模块的实现可以分为分布式内存数据系统和分布式可持久化存储系统。在写入图数据时,写入持久化系统,并更新至内存缓存中。在查询图数据时,通过分布式内存数据系统完成快速查找和计算,从而实现数据的高效操作。在数据失效或者系统重启时,内存缓存数据可以从持久化系统中恢复,以保证数据安全。In high concurrency scenarios, an optimization method is to cache all (or most) graph data in memory. At this time, the implementation of distributed data storage modules can be divided into distributed memory data systems and distributed persistent storage systems. When writing graph data, write to the persistence system and update to the memory cache. When querying graph data, the distributed memory data system is used to quickly find and calculate, so as to achieve efficient data operation. When the data fails or the system restarts, the memory cache data can be recovered from the persistent system to ensure data security.
参照图2,示出了本申请知识图谱系统一实施例的组成结构示意图,包括通过网络连接的客户端20和附图1所示的上述图服务器10;其中:Referring to FIG. 2, a schematic structural diagram of an embodiment of the knowledge graph system of the present application is shown, including a client 20 connected through a network and the above-mentioned graph server 10 shown in FIG. 1; wherein:
客户端20设置有用户接口21,用于接收用户的数据操作请求,并通过网络发送至图服务器10的图数据库接口,以及接收并显示图服务器10的数据操作结果。The client 20 is provided with a user interface 21 for receiving a user's data operation request, and sending it to the graph database interface of the graph server 10 through the network, and receiving and displaying the data operation result of the graph server 10.
具体实施时,用户接口21可以通过HTTP、websocket或RPC等协议与图服务器10建立连接;所述数据操作请求可以选用Gremlin、GSQL或SPARQL语言的语法格式。In specific implementation, the user interface 21 may establish a connection with the graph server 10 through protocols such as HTTP, websocket, or RPC; the data operation request may use the grammatical format of Gremlin, GSQL, or SPARQL language.
在进一步的优选实施例中,为适应大数据场景下的不同数据格式的原始数据的批量导入,所述知识图谱系统还可以设置数据预处理模块和中间可持久化文件系统,其中:数据预处理模块用于对结构化或非结构化的原始数据进行抽取,并转换为图数据库的节点数据和/或边数据。中间可持久化文件系统,用于暂存所述数据预处理模块处理后的节点数据和边数据。In a further preferred embodiment, in order to adapt to the bulk import of raw data in different data formats in a big data scenario, the knowledge graph system may also be provided with a data preprocessing module and an intermediate persistent file system, where: data preprocessing The module is used to extract the structured or unstructured original data and convert it into node data and/or edge data of the graph database. The intermediate persistent file system is used to temporarily store node data and edge data processed by the data preprocessing module.
具体实施时,上述数据预处理模块根据实际需要既可以部署在客户端(第二数据预处理模块),也可以部署在图服务器端(第一数据预处理模块),还可以同时在客户端和图服务器均部署。中间可持久化文件系统可以根据需要选用Hadoop分布式文件系统(HDFS,Hadoop Distributed File System)、简单存储服务系统(S3,Simple Storage Service)或对象存储服务系统(OSS,Object Storage Service)等。During specific implementation, the above-mentioned data pre-processing module can be deployed either on the client (second data pre-processing module) or on the graph server (first data pre-processing module) according to actual needs, or on the client and Graph servers are deployed. The intermediate persistent file system can select Hadoop distributed file system (HDFS, Hadoop Distributed File), simple storage service system (S3, Simple Storage Service) or object storage service system (OSS, Object Storage Service), etc. as needed.
以数据预处理模块部署在客户端为例,对于关系型数据库来源的数据,可以通过表名或者SQL定义节点和边的数据模型,拉取元数据写入中间可持久化 文件系统,并通过指定时间戳等方式增量的从原数据库中拉取原始数据产生图数据。对于图数据库可以单独导出节点数据和边数据,写入中间文件系统。然后,读取中间可持久化文件系统中的数据,分别生成节点和边的创建请求,与图服务器建立连接后,通过HTTP等协议方式将请求发送到图服务器,完成节点数据和边数据的写入。Taking the data pre-processing module deployed on the client as an example, for data from a relational database source, you can define the data model of nodes and edges by table name or SQL, pull metadata into the intermediate persistent file system, and specify Timestamp and other ways to incrementally pull the original data from the original database to generate graph data. For graph database, node data and edge data can be exported separately and written into the intermediate file system. Then, read the data in the intermediate persistent file system, generate node and edge creation requests respectively, and establish a connection with the graph server, then send the request to the graph server through HTTP and other protocol methods to complete the writing of node data and edge data Into.
下面,分别结合图3和图4说明本申请知识图谱系统的图数据写入和查询流程。The graph data writing and query process of the knowledge graph system of the present application will be described below with reference to FIGS. 3 and 4, respectively.
参考图3,示出了本申请实施例的图数据存储和修改流程,包括:Referring to FIG. 3, a graph data storage and modification process according to an embodiment of the present application is shown, including:
步骤S31:数据预处理模块对结构化和非结构化的原始数据进行抽取,转换为图数据库形式的节点数据和/或边数据,并存储在中间可持久化文件系统中。Step S31: The data preprocessing module extracts the structured and unstructured original data, converts it into node data and/or edge data in the form of a graph database, and stores it in the intermediate persistent file system.
步骤S32:调用图数据库接口,发送创建或者更新节点数据和边数据的请求。Step S32: call the graph database interface and send a request to create or update node data and edge data.
读取中间文件系统中的数据,可以采用Gremlin、GSQL、SPARQL等图数据操作语言的语法格式,生成创建节点和边请求,与服务器端建立连接后,通过HTTP等协议形式将请求发送至图服务器。To read the data in the intermediate file system, you can use the grammatical format of graph data manipulation languages such as Gremlin, GSQL, SPARQL, etc. to generate the creation node and edge requests, and after establishing a connection with the server, send the request to the graph server through HTTP and other protocol forms. .
步骤S33:图服务器响应并解析请求。Step S33: The graph server responds and parses the request.
以Gremlin为例,图服务器的图数据库接口接收到客户端的请求后,在当前会话中,通过Gremlin语法解析请求。根据Gemlin语法,请求中可能包括图数据写入、修改或者查询操作。图服务器根据当前请求的操作类型,调用相应的接口(对于图数据写入和修改请求,通过调用图数据写入接口实现;对于图数据查询请求,通过调用图数据查询接口实现)。Taking Gremlin as an example, after receiving the client's request, the graph database interface of the graph server parses the request through Gremlin syntax in the current session. According to Gemlin grammar, the request may include graph data write, modify or query operations. The graph server calls the corresponding interface according to the currently requested operation type (for graph data writing and modifying requests, it is implemented by calling the graph data writing interface; for graph data query requests, it is achieved by calling the graph data query interface).
图数据写入接口根据写入数据类型(节点、边),将数据持久化,并且返回数据在分布式数据存储模块的唯一索引。The graph data writing interface persists the data according to the type of data written (node, edge), and returns the unique index of the data in the distributed data storage module.
图数据查询接口根据传入查询的条件,找到存储在分布式数据存储模块的数据后,解析成预设数据格式(节点,边)并返回。The graph data query interface finds the data stored in the distributed data storage module according to the conditions of the incoming query, parses it into a preset data format (node, edge) and returns.
图服务器通过当前配置和服务发现机制,在启动时和分布式数据存储模块的各存储服务器建立连接,并动态发送心跳监控检测存储服务器的可用性。当图数据写入接口收到写入操作请求时,将写入信息序列化以后,发送给分布式数据存储模块。Through the current configuration and service discovery mechanism, the graph server establishes a connection with each storage server of the distributed data storage module at startup, and dynamically sends heartbeat monitoring to detect the availability of the storage server. When the graph data write interface receives the write operation request, the write information is serialized and sent to the distributed data storage module.
步骤S34:分布式数据存储模块将图数据写入文件系统或其他可持久存储,完成持久化。Step S34: The distributed data storage module writes the graph data to the file system or other persistent storage to complete the persistence.
设置有存储层接口的任何存储系统都可以作为分布式数据存储模块,例如,可以是分布式数据库,也可以是分布式文件系统。存储服务器在启动后需要向服务发现机制注册(以确保图服务器能够发现自己),并提供心跳检测接口,实时报告设备状态。存储服务器之间实现数据的冗余复制,保证容错能力。在有新的节点接入或者老的节点退出时,通过服务发现机制,图服务器可以自动更新分布式数据存储模块的配置,并切换至相应的存储服务器。Any storage system provided with a storage layer interface can be used as a distributed data storage module. For example, it can be a distributed database or a distributed file system. After starting, the storage server needs to register with the service discovery mechanism (to ensure that the graph server can discover itself), and provides a heartbeat detection interface to report the device status in real time. Redundant data replication is implemented between storage servers to ensure fault tolerance. When a new node accesses or an old node exits, the graph server can automatically update the configuration of the distributed data storage module through the service discovery mechanism and switch to the corresponding storage server.
参考图4,示出了本申请实施例的图数据查询流程,包括:Referring to FIG. 4, a graph data query process of an embodiment of the present application is shown, including:
步骤S41:用户通过客户端的用户接口发起图数据查询请求。Step S41: The user initiates a graph data query request through the user interface of the client.
步骤S42:图服务器的图数据库接口响应并解析查询请求。Step S42: The graph database interface of the graph server responds and parses the query request.
对于复杂查询,图数据库接口可能需要拆解成多次对图数据查询接口的查询调用。例如,对于5步以内的最短路径查询请求,可以拆解成二个子查询请求,第二个子查询以第一个子查询的输出结果为输入。For complex queries, the graph database interface may need to be disassembled into multiple query calls to the graph data query interface. For example, for the shortest path query request within 5 steps, it can be disassembled into two sub-query requests, and the output of the first sub-query is used as the input for the second sub-query.
图数据查询接口根据查询条件和建立的索引条件,访问硬盘或者缓存在内存中的数据。The graph data query interface accesses the data on the hard disk or cached in memory according to the query conditions and the established index conditions.
具体实施时,对于count、avg等聚合操作,可以下推到分布式数据存储模块的数据库中执行计算。但是对于子图(sub-graph)操作、最短路径查询等图计算,需要多次从数据库查询后将数据持久化在内存中再进一步计算。During specific implementation, aggregation operations such as count and avg can be pushed down to the database of the distributed data storage module to perform calculations. However, for graph calculations such as sub-graph operations and shortest path queries, it is necessary to query the database multiple times and persist the data in memory for further calculation.
本申请通过上述构建虚拟图数据层(即图数据写入接口和图数据查询接口),可以降低对底层数据库的数据支持要求,通过将常见的图处理操作封装在图数据层,底层数据库只需要提供增删改查等基本操作即可,从而降低对底层数据库的耦合度,并使得底层数据库可替换。This application can reduce the data support requirements for the underlying database by constructing a virtual graph data layer (that is, graph data writing interface and graph data query interface). By encapsulating common graph processing operations in the graph data layer, the underlying database only needs It is sufficient to provide basic operations such as addition, deletion, modification and checking, thereby reducing the coupling degree to the underlying database and making the underlying database replaceable.
需要说明的是,上述装置实施例属于优选实施例,所涉及的单元和模块并不一定是本申请所必须的。本说明书中的各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似的部分互相参见即可。以上所描述的实施例仅仅是示意性的,其中所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,既可以位于一个地方或者也可以分布到多个网络单元上(以上述系统实施例中的数据预处理模块为例,该数据预处理模块根据实际需要既可以部署在客户端,也可以部署在图服务器端,还可以同时在客户端和图服务器端均部署)。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员 在不付出创造性劳动的情况下,即可以理解并实施。It should be noted that the above device embodiments are preferred embodiments, and the units and modules involved are not necessarily required by this application. The embodiments in this specification are described in a progressive manner. Each embodiment focuses on the differences from other embodiments, and the same or similar parts between the embodiments may refer to each other. The above-described embodiments are only schematic, wherein the modules described as separate components may or may not be physically separated, and may be located in one place or may be distributed on multiple network elements (as described above). The data pre-processing module in the system embodiment is taken as an example. The data pre-processing module can be deployed on the client, the graph server, or both the client and the graph server according to actual needs.) Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art can understand and implement without paying creative efforts.
本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。In this article, specific examples are used to explain the principle and implementation of this application. The descriptions of the above examples are only used to help understand the method and core ideas of this application; meanwhile, for ordinary technicians in this field, according to this application The thoughts of this book may change in the specific implementation mode and application scope. In summary, the content of this specification should not be understood as a limitation to this application.

Claims (10)

  1. 一种图服务器,包括图数据库接口、图数据写入接口、图数据查询接口和分布式数据存储模块,其中:A graph server, including graph database interface, graph data writing interface, graph data query interface and distributed data storage module, in which:
    所述图数据库接口配置为接收用户的数据操作请求,并根据所述数据操作请求的类型调用图数据写入接口或图数据查询接口实现对分布式数据存储模块的操作;The graph database interface is configured to receive a user's data operation request, and call a graph data writing interface or a graph data query interface according to the type of the data operation request to implement operations on the distributed data storage module;
    所述图数据写入接口配置为根据所述数据操作请求中的待写入数据类型,创建或更新分布式数据存储模块中的节点或边的数据,并返回所述数据在分布式数据存储模块中的唯一索引;The graph data writing interface is configured to create or update data of nodes or edges in the distributed data storage module according to the type of data to be written in the data operation request, and return the data in the distributed data storage module Unique index in
    所述图数据查询接口配置为根据所述数据操作请求中的查询条件,获得存储在分布式数据存储模块中的数据,并按预设的节点和边的数据格式返回给用户;The graph data query interface is configured to obtain data stored in the distributed data storage module according to the query conditions in the data operation request, and return to the user according to a preset data format of nodes and edges;
    所述分布式数据存储模块为分布式文件系统或分布式数据库,配置为为图服务器提供数据存储和查询服务。The distributed data storage module is a distributed file system or a distributed database, and is configured to provide data storage and query services for the graph server.
  2. 根据权利要求1所述的图服务器,其中,所述图服务器还包括查询拆解模块,配置为将复杂度大于预设条件的查询请求拆解为多个子查询请求,按顺序或并发调用图数据查询接口实现用户的数据查询请求。The graph server according to claim 1, wherein the graph server further includes a query disassembly module configured to disassemble a query request with a complexity greater than a preset condition into multiple sub-query requests, and to call graph data in sequence or concurrently The query interface implements the user's data query request.
  3. 根据权利要求1所述的图服务器,其中,所述图服务器还设置有内存缓存,用于缓存用户最近访问的数据和/或查询命中次数大于或等于预设热数据阈值的数据。The graph server according to claim 1, wherein the graph server is further provided with a memory cache for caching data recently accessed by the user and/or data whose number of query hits is greater than or equal to a preset thermal data threshold.
  4. 根据权利要求1所述的图服务器,其中,所述图服务器设置有分布式数据存储模块的服务发现机制;所述分布式数据存储模块的每个存储服务器上设置有心跳检测接口,实时向图服务器报告设备状态;当有新存储服务器加入或现有存储服务器退出时,所述图服务器通过所述服务发现机制自动更新分布式数据存储模块的配置,并将存储和查询服务切换到对应的存储服务器上。The graph server according to claim 1, wherein the graph server is provided with a service discovery mechanism of a distributed data storage module; each storage server of the distributed data storage module is provided with a heartbeat detection interface, which maps the graph in real time The server reports the device status; when a new storage server is added or an existing storage server is withdrawn, the graph server automatically updates the configuration of the distributed data storage module through the service discovery mechanism, and switches the storage and query services to the corresponding storage On the server.
  5. 根据权利要求1所述的图服务器,其中,所述图服务器还包括第一数据预处理模块,配置为对结构化或非结构化的原始数据进行抽取,并转换为图数据库的节点数据和/或边数据。The graph server according to claim 1, wherein the graph server further includes a first data preprocessing module configured to extract structured or unstructured original data and convert it into node data of the graph database and/or Or edge data.
  6. 一种知识图谱系统,包括客户端和权利要求1~5任一所述的图服务器; 所述客户端通过网络与所述图服务器连接;A knowledge graph system including a client and the graph server according to any one of claims 1 to 5; the client is connected to the graph server through a network;
    所述客户端包括用户接口,配置为接收用户的数据操作请求,并通过网络发送至图服务器的图数据库接口,以及接收并显示图服务器的数据操作结果。The client includes a user interface configured to receive a user's data operation request and send it to the graph database interface of the graph server through the network, and receive and display the graph server's data operation result.
  7. 根据权利要求6所述的知识图谱系统,其中,所述客户端还包括第二数据预处理模块,配置为将结构化或非结构化的原始数据转换为图数据库的节点数据或边数据。The knowledge graph system according to claim 6, wherein the client further includes a second data preprocessing module configured to convert the structured or unstructured original data into node data or edge data of the graph database.
  8. 根据权利要求7所述的知识图谱系统,其中,所述客户端还包括中间可持久化文件系统,配置为暂存所述第二数据预处理模块处理后的节点数据和边数据。The knowledge graph system according to claim 7, wherein the client further includes an intermediate persistent file system configured to temporarily store node data and edge data processed by the second data preprocessing module.
  9. 根据权利要求6所述的知识图谱系统,其中,所述用户接口通过超文本传输协议、websocket协议或远程过程调用协议方式与图服务器建立连接。The knowledge graph system according to claim 6, wherein the user interface establishes a connection with the graph server through hypertext transfer protocol, websocket protocol or remote procedure call protocol.
  10. 根据权利要求6所述的知识图谱系统,其中,所述数据操作请求采用Gremlin、GSQL或SPARQL语言的语法格式。The knowledge graph system according to claim 6, wherein the data operation request adopts a grammatical format of Gremlin, GSQL or SPARQL language.
PCT/CN2019/124555 2018-12-29 2019-12-11 Knowledge mapping system and map server thereof WO2020135050A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811635242.2A CN109670089A (en) 2018-12-29 2018-12-29 Knowledge mapping system and its figure server
CN201811635242.2 2018-12-29

Publications (1)

Publication Number Publication Date
WO2020135050A1 true WO2020135050A1 (en) 2020-07-02

Family

ID=66147029

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/124555 WO2020135050A1 (en) 2018-12-29 2019-12-11 Knowledge mapping system and map server thereof

Country Status (2)

Country Link
CN (1) CN109670089A (en)
WO (1) WO2020135050A1 (en)

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109670089A (en) * 2018-12-29 2019-04-23 颖投信息科技(上海)有限公司 Knowledge mapping system and its figure server
CN110427359A (en) * 2019-06-27 2019-11-08 苏州浪潮智能科技有限公司 A kind of diagram data treating method and apparatus
CN110347711B (en) * 2019-07-10 2022-02-08 北京百度网讯科技有限公司 Fragment storage graph database query method and device
CN110489986B (en) * 2019-08-22 2021-03-23 网易(杭州)网络有限公司 Response method and system of graph data function and electronic equipment
CN110717056B (en) * 2019-09-03 2023-05-23 平安科技(深圳)有限公司 Update maintenance method and device for Noe j graph database and computer readable storage medium
CN110598059B (en) * 2019-09-16 2022-07-05 北京百度网讯科技有限公司 Database operation method and device
CN110941619B (en) * 2019-12-02 2023-05-16 浪潮软件股份有限公司 Definition method of graph data storage model and structure for various usage scenes
CN111090653B (en) * 2019-12-20 2023-12-15 东软集团股份有限公司 Data caching method and device and related products
CN111177189B (en) * 2019-12-20 2024-04-05 北京航天云路有限公司 Client optimization system and method based on user behavior analysis
CN111177478A (en) * 2019-12-24 2020-05-19 北京明略软件系统有限公司 Query method, device and system
CN111274333A (en) * 2020-01-20 2020-06-12 北京明略软件系统有限公司 Map relation updating method, device, server and storage medium
CN111309750A (en) * 2020-03-31 2020-06-19 中国邮政储蓄银行股份有限公司 Data updating method and device for graph database
CN111538854B (en) * 2020-04-27 2023-08-08 北京百度网讯科技有限公司 Searching method and device
CN111897971B (en) * 2020-07-29 2023-04-07 中国电力科学研究院有限公司 Knowledge graph management method and system suitable for field of power grid dispatching control
CN112182238B (en) * 2020-09-22 2022-12-27 苏州浪潮智能科技有限公司 Knowledge graph construction system and method based on graph database
CN112256927B (en) * 2020-10-21 2024-06-04 网易(杭州)网络有限公司 Knowledge graph data processing method and device based on attribute graph
CN113177142A (en) * 2021-03-23 2021-07-27 杭州费尔斯通科技有限公司 Method, system, equipment and storage medium for storing extended graph database
CN113468275A (en) * 2021-07-28 2021-10-01 浙江大华技术股份有限公司 Data importing method and device of graph database, storage medium and electronic equipment
CN115203488B (en) * 2022-09-15 2022-12-06 国网智能电网研究院有限公司 Graph database management method and device and electronic equipment

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105210058A (en) * 2012-12-14 2015-12-30 微软技术许可有限责任公司 Graph query processing using plurality of engines
CN106354729A (en) * 2015-07-16 2017-01-25 阿里巴巴集团控股有限公司 Graph data handling method, device and system
CN206003092U (en) * 2016-05-30 2017-03-08 深圳市华傲数据技术有限公司 Chart database system
CN106484824A (en) * 2016-09-28 2017-03-08 华东师范大学 Knowledge mapping isomery storing framework middleware based on multivariate data storehouse supporting assembly
CN107832323A (en) * 2017-09-14 2018-03-23 北京知道未来信息技术有限公司 A kind of distributed implementation system and method based on chart database
US10102291B1 (en) * 2015-07-06 2018-10-16 Google Llc Computerized systems and methods for building knowledge bases using context clouds
CN109670089A (en) * 2018-12-29 2019-04-23 颖投信息科技(上海)有限公司 Knowledge mapping system and its figure server

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6076092A (en) * 1997-08-19 2000-06-13 Sun Microsystems, Inc. System and process for providing improved database interfacing using query objects
CN103425793B (en) * 2013-08-28 2017-03-01 五八同城信息技术有限公司 Method for utilizing database purchase layer to access data base in instant communicating system
CN104573086A (en) * 2015-01-28 2015-04-29 浪潮集团有限公司 Database access component and generating method thereof

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105210058A (en) * 2012-12-14 2015-12-30 微软技术许可有限责任公司 Graph query processing using plurality of engines
US10102291B1 (en) * 2015-07-06 2018-10-16 Google Llc Computerized systems and methods for building knowledge bases using context clouds
CN106354729A (en) * 2015-07-16 2017-01-25 阿里巴巴集团控股有限公司 Graph data handling method, device and system
CN206003092U (en) * 2016-05-30 2017-03-08 深圳市华傲数据技术有限公司 Chart database system
CN106484824A (en) * 2016-09-28 2017-03-08 华东师范大学 Knowledge mapping isomery storing framework middleware based on multivariate data storehouse supporting assembly
CN107832323A (en) * 2017-09-14 2018-03-23 北京知道未来信息技术有限公司 A kind of distributed implementation system and method based on chart database
CN109670089A (en) * 2018-12-29 2019-04-23 颖投信息科技(上海)有限公司 Knowledge mapping system and its figure server

Also Published As

Publication number Publication date
CN109670089A (en) 2019-04-23

Similar Documents

Publication Publication Date Title
WO2020135050A1 (en) Knowledge mapping system and map server thereof
US11461356B2 (en) Large scale unstructured database systems
US11531682B2 (en) Federated search of multiple sources with conflict resolution
WO2020228801A1 (en) Multi-language fusion query method and multi-model database system
CN109299102B (en) HBase secondary index system and method based on Elastcissearch
EP3365805B1 (en) Ability to group multiple container databases as a single container database cluster
US10803078B2 (en) Ability to group multiple container databases as a single container database cluster
US10565199B2 (en) Massively parallel processing database middleware connector
US8239423B2 (en) System and method for semantic exposure of data stored in a dynamic schema
US20120158655A1 (en) Non-relational function-based data publication for relational data
US20140214897A1 (en) SYSTEMS AND METHODS FOR ACCESSING A NoSQL DATABASE USING BUSINESS INTELLIGENCE TOOLS
US20160224570A1 (en) Archiving indexed data
US11030242B1 (en) Indexing and querying semi-structured documents using a key-value store
US9135297B2 (en) Database translation system and method
CN107506464A (en) A kind of method that HBase secondary indexs are realized based on ES
US20150154259A1 (en) Sql query on a nosql database
WO2019226328A1 (en) Data analysis over the combination of relational and big data
US11507591B2 (en) Methods, systems, and computer readable mediums for command engine execution
US9047354B2 (en) Statement categorization and normalization
WO2015094195A1 (en) Transaction query engine
US20210081451A1 (en) Persisted queries and batch streaming
US8868495B2 (en) System and method for indexing user data on storage systems
US11526516B2 (en) Method, apparatus, device and storage medium for generating and processing a distributed graph database
US10599728B1 (en) Metadata agent for query management
US11789971B1 (en) Adding replicas to a multi-leader replica group for a data set

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19904639

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19904639

Country of ref document: EP

Kind code of ref document: A1