WO2023131218A1

WO2023131218A1 - Graph data storage

Info

Publication number: WO2023131218A1
Application number: PCT/CN2023/070606
Authority: WO
Inventors: 张达
Original assignee: 支付宝(杭州)信息技术有限公司
Priority date: 2022-01-07
Filing date: 2023-01-05
Publication date: 2023-07-13
Also published as: CN114077680B; CN114077680A

Abstract

The present description relates to a graph data storage method, system and apparatus. Graph data comprises a node and an edge. The storage method comprises: storing, in a node table of a data block, node information of several nodes in graph data, wherein the node information comprises node identifiers; storing, in an edge table of the data block, edge information of edges of the several nodes, wherein the edge information comprises node identifiers of target nodes connected to the edges; storing, in a node attribute table of the data block, attribute information of the several nodes; and storing, in an edge attribute table of the data block, attribute information of the edges of the several nodes.

Description

storage of graph data

technical field

One or more embodiments of this specification relate to the field of computers, and in particular to a method, system, and device for storing graph data.

Background technique

Currently, various databases can be used to store and manage graph data. With the emergence of new Internet applications such as social networks, mobile Internet and IOT (Internet of Things), the interaction data generated by various entities (such as users, systems and sensors, etc.) is increasing exponentially, and the scale and complexity of graph data have increased significantly. . When storing and managing massive and complex graph data, the database needs to have high read and write efficiency to support efficient data traversal, relationship query, and one-hop subgraphs (that is, one-hop graphs, which refer to a node and The subgraph formed by the edges connected by the node) and other graph processing operations such as expansion.

Therefore, there is an urgent need for a graph data storage method, system, and device to realize functions such as efficient storage of graph data and complex relational query of graph data.

Contents of the invention

One aspect of this specification provides a method for storing graph data, where the graph data includes nodes and edges; the storage method includes: storing node information of several nodes in the graph data in a point table of a data block; the The node information includes a node identifier; the edge information of the edges of the several nodes is stored in the edge table of the data block; the edge information includes the node identifier of the target node connected to the edge; the edge information of the several nodes is The attribute information is stored in the point attribute table of the data block; the attribute information of the edges of the several nodes is stored in the edge attribute table of the data block.

Another aspect of this specification provides a graph data storage system, the graph data includes nodes and edges; the storage system includes: a node information storage module, used to store the node information of several nodes in the graph data in the data in the point table of the block; the node information includes a node identifier; the edge information storage module is used to store the edge information of the edges of the several nodes in the edge table of the data block; the edge information includes an edge The node identification of the connected target node; the node attribute information storage module, used to store the attribute information of the several nodes in the point attribute table of the data block; the edge attribute information storage module, used to store the several nodes The attribute information of the edge of the node is stored in the edge attribute table of the data block.

Another aspect of the specification provides a graph data storage device, the device includes a processor and a memory; the memory is used to store instructions, and the processor is used to execute the instructions to implement the graph data storage device , including a storage medium and a processor, the storage medium is used to store computer instructions, and the processor is used to execute the computer instructions to realize the image data storage training method.

Another aspect of this specification provides a graph data file, the graph data includes nodes and edges; the file includes several data blocks, wherein each data block includes: a point table, used to store nodes of at least some nodes in the graph data information; the node information includes a node identifier; an edge table is used to store the edge information of the edge of the node; the edge information includes a node identifier of a target node connected to the edge; a point attribute table is used to store the node The attribute information of the node; the edge attribute table is used to store the attribute information of the edge of the node.

Description of drawings

This specification will be further described in terms of exemplary embodiments, which will be described in detail with the accompanying drawings. These examples are non-limiting, and in these examples, the same number indicates the same structure, wherein:

Fig. 1 is a schematic diagram of an application scenario of an exemplary graph data storage system according to some embodiments of the present specification;

Figure 2 is a schematic diagram of a point table according to some embodiments of the present specification;

Fig. 3 is a schematic diagram of an edge table according to some embodiments of the present specification;

Fig. 4 is a schematic diagram of a point/edge attribute table according to some embodiments of the present specification;

Fig. 5 is a system block diagram of graph data storage according to some embodiments of the present specification;

Fig. 6 is a schematic diagram of a data block structure according to some embodiments of this specification;

Fig. 7 is an exemplary flow chart of graph data storage according to some embodiments of the present specification;

Fig. 8 is an exemplary flow chart of querying graph data according to some embodiments of the present specification.

Detailed ways

In order to more clearly illustrate the technical solutions of the embodiments of the present specification, the following briefly introduces the drawings that need to be used in the description of the embodiments. Apparently, the accompanying drawings in the following description are only some examples or embodiments of this specification, and those skilled in the art can also apply this specification to other similar scenarios. Unless otherwise apparent from context or otherwise indicated, like reference numerals in the figures represent like structures or operations.

It should be understood that "system", "device", "unit" and/or "module" used in this specification is a method for distinguishing different components, elements, parts, parts or assemblies of different levels. However, the words may be replaced by other expressions if other words can achieve the same purpose.

As indicated in the specification and claims, the terms "a", "an", "an" and/or "the" are not specific to the singular and may include the plural unless the context clearly indicates an exception. Generally speaking, the terms "comprising" and "comprising" only suggest the inclusion of explicitly identified steps and elements, and these steps and elements do not constitute an exclusive list, and the method or device may also contain other steps or elements.

The flowchart is used in this specification to illustrate the operations performed by the system according to the embodiment of this specification. It should be understood that the preceding or following operations are not necessarily performed in the exact order. Instead, various steps may be processed in reverse order or simultaneously. At the same time, other operations can be added to these procedures, or a certain step or steps can be removed from these procedures.

Fig. 1 is a schematic diagram of an application scenario of an exemplary graph database storage system according to some embodiments of the present specification.

With the emergence of new Internet applications such as social networks, mobile Internet, and the Internet of Things (IOT), the data generated between different entities (such as users, systems, and sensors) is increasing exponentially, and the internal dependence of data and complexity increases. Usually, the form of graph data is used to describe and characterize the relationship between different entities. Graph data is composed of multiple nodes and edges connecting each node. The nodes in the graph data represent entities, and the edges between nodes represent the relationship between entities. Entities can be real objects, institutions, etc. in the physical world, or abstract concepts, such as companies, equipment, people, goods, storage locations, means of transportation, images, computer programs, accounts, etc. Entities can have attribute information. Taking the entity as "person" as an example, attribute information includes age, gender, occupation, work unit or home address, etc. For companies, attribute information includes company registered address, legal person, business scope, registered capital and other information. Edges between entities (ie, edge information) can reflect the relationship between entities. For example, there may be an employment relationship between an entity person and an entity company, and there may be a friend relationship between Zhang San and Li Si. Edges can also have attribute information. For example, the attribute information of an employment relationship can include establishment time, employment relationship type (whether it is formal employment or temporary employment), and so on.

With the development of Internet technology, the scale of graph data is getting larger and larger. How to store graph data so as to efficiently call the stored data has become a problem to be solved.

In some embodiments, the graph data can be stored in a relational database, and this storage method will store the nodes and edges in the graph data separately. However, relational databases show more inadaptability when storing graph data. For example, because the graph data is huge, the graph data needs to be stored in separate databases and tables, and then the nodes and the edges of these nodes will be split and stored. When querying the graph data, it is necessary to interact with different databases (such as storage devices) to find the target Query nodes and their edges, or multiple reads and writes are required to obtain the target query nodes and their edges.

In order to make up for the above-mentioned shortcomings of relational databases, a graph data storage method based on graph databases is proposed in some embodiments. In a graph database, the relationship between data plays an important role, and it can store massive and complex data and the relationship between complex data. Specifically, the graph database is a graph database that divides the nodes and edges in the graph data into different KV storage engines for storage, and builds a proxy layer (that is, a proxy layer) on top of the graph database to provide graph query services. However, on the one hand, due to the addition of a proxy layer in this approach, the data needs to be cached in different data areas multiple times during the query process, which increases the complexity of the entire query process. On the other hand, when performing graph query on a graph database, since nodes and edges are stored separately, when retrieving a one-hop subgraph (that is, a one-hop graph, it refers to a node, the edge connected to the node, and the node at the other end of the edge) When constructing a subgraph), it is necessary to query the node and all the edges connected to the node separately. In other words, querying a one-hop subgraph requires many read and write operations to obtain the query result of a one-hop subgraph, and such retrieval efficiency is very low. At the same time, in order to ensure the efficiency of the above query process, the graph database needs an independent cluster server (computer) for deployment and operation and maintenance, so as to ensure that there is enough memory for multiple read and write operations in the graph query process. This brings about a large equipment operation and maintenance cost.

To address the shortcomings of the above technologies, some embodiments of this specification provide a storage method for graph data, including: correspondingly storing node information, edge information, node attribute information, and edge attribute information of several nodes in the graph data in the same data In the point table, edge table, point attribute table and edge attribute table of the block. In this way, the node information and edge information of the relevant nodes can be obtained by reading the data block once, which effectively reduces the frequency of reading and writing in the process of graph processing. Exemplarily, when a one-hop subgraph query is required, the data block can be read and written only once, and the query efficiency is significantly improved.

In some embodiments of this specification, the storage order of the edges in the edge table can also be consistent with the storage order of the several nodes in the point table, so that the storage order of the attribute information of several nodes in the point attribute table is the same as The storage order of the several nodes in the point table is consistent, so that the storage order of the edge attribute information of the several nodes in the edge attribute table is consistent with the storage order of the edges of the several nodes in the edge table, through such In this way, the alignment of point table-edge table-attribute table is realized. After node A is queried, the positions of all edges corresponding to node A in the edge table can be quickly determined, and then the attribute information of node A in the edge attribute table can be quickly located. Such a setting eliminates the need for excessive data reading and writing and caching requirements during the graph query process, so the entire process does not require a resident service cluster to support it.

It should be noted that, in the embodiments of the specification, since the graph data is stored in multiple data blocks sequentially, and the node information and its edge information are stored in the same data block, for large-scale graph data, you can use Multiple data blocks or multiple map files (a map file contains multiple data blocks) are used for storage, which enables one or more embodiments involved in this specification to perform distributed storage of map data by multiple devices and support parallel Query (for example, different devices query different data blocks) to further improve query efficiency.

In some embodiments, the application scenario of the graph data storage system is shown in FIG. 1 , and the scenario 100 may include a storage device 110 - 1 , a storage device 110 - 2 , .

The storage device 110-1, the storage device 110-2, the storage device 110-3, ... may include a processor and a large capacity memory, a removable memory, a volatile read-write memory, a read-only memory (ROM), etc. or any combination thereof , for data storage, management of resources, and processing of data and/or information from at least one component of the System or external data sources (eg, cloud data centers). In some embodiments, each of storage device 110-1, storage device 110-2, storage device 110-3, ... may be a single server or a group of servers. The server group may be centralized or distributed (for example, the server 110-1 may be a distributed system), may be dedicated, or may be simultaneously provided by other devices or systems. In some embodiments, storage device 110-1, storage device 110-2, storage device 110-3, ... may be local or remote. In some embodiments, the storage device 110-1, the storage device 110-2, the storage device 110-3, ... may be implemented on a cloud platform, or provided in a virtual manner. By way of example only, the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, an internal cloud, a multi-layer cloud, etc., or any combination thereof.

In some embodiments, any one or more of storage device 110-1, storage device 110-2, ..., storage device 110-n can store one or more graph files, and support parallel query of graph data. The graph file may include multiple data blocks, and each data block is used to store node information, edge information, and attribute information corresponding to nodes and edges of all or part of the nodes in the graph data. Specifically, as shown at 200 in FIG. 1 is a typical data block structure, each data block includes a point table 210 , an edge table 220 , a point attribute table 230 , an edge attribute table 240 and a table element 250 .

The processing device 120 can generate or acquire graph data, write the graph data into multiple data blocks or multiple graph files, and distribute the multiple data blocks or graph files to the storage device 110-1, storage device 110-2, ..., the storage device 110-n stores. In some embodiments, the processing device 120 can obtain the query request, and distribute the query request to each storage device, so that each storage device can perform a query in the locally stored map data or data blocks, and return the query result to the processing device 120 . In some embodiments, when the scale of graph data is not large, a storage device may be used to store the map files, and in this case, the processing device 120 may be omitted.

In some embodiments, the scene 100 may also include a network (not shown in the figure). A network can connect components of a system and/or connect the system with external parts. A network enables communication between the various components of the system and between the system and external parts, facilitating the exchange of data and/or information. In some embodiments, the network 130 may be any one or more of a wired network or a wireless network. For example, a network may include a cable network, a fiber optic network, a telecommunications network, the Internet, a local area network (LAN), a wide area network (WAN), a wireless local area network (WLAN), a metropolitan area network (MAN), a public switched telephone network (PSTN), Bluetooth network, ZigBee network (ZigBee), near field communication (NFC), internal bus, internal line, cable connection, etc. or any combination thereof. In some embodiments, the network connection between various parts of the system may adopt one of the above-mentioned methods, or may adopt multiple methods. In some embodiments, the network may be in various topologies such as point-to-point, shared, and central, or a combination of multiple topologies.

Fig. 5 is a system block diagram for storing a graph database according to some embodiments of the present specification.

As shown in FIG. 5 , the system 500 is arranged on any processing device that can execute programs (such as any one of server 110-1, storage device 110-2, ..., storage device 110-n in FIG. 1 ), specifically including : a node information storage module 510, used to store the node information of several nodes in the graph data in the point table of the data block; the node information includes a node identifier; a side information storage module 520, used to store the several nodes The edge information of the edge of the node is stored in the edge table of the data block; the edge information includes the node identification of the target node connected to the edge; the node attribute information storage module 530 is used to store the attribute information of the several nodes Stored in the point attribute table of the data block; the edge attribute information storage module 540 is configured to store the attribute information of the edges of the several nodes in the edge attribute table of the data block.

In some embodiments, the storage order of the edges of the several nodes in the edge table is consistent with the storage order of the several nodes in the point table; the storage order of the attribute information of the several nodes in the point attribute table It is consistent with the storage order of the several nodes in the point table; the storage order of the edge attribute information of the several nodes in the edge attribute table is consistent with the storage order of the edges of the several nodes in the edge table.

In some embodiments, the edge table includes an edge table index area and an edge table data area; the edge information of the edges of the several nodes is stored in the edge table data area; the edge table index area stores the several The edge index information of a node, the edge index information includes the storage address information of the edge information of the corresponding node in the edge table data area; the storage order of the edge index information of the several nodes is the same as the storage order of the edge information The storage order of the above-mentioned several nodes in the point table is consistent.

In some embodiments, the node information further includes storage address information of edges of nodes, and the storage address information of edges in the point table is storage address information of index information corresponding to edges in the edge table.

In some embodiments, the edge information of different edges of the same node is continuously stored in the edge table data area; the storage order of the edge information of the edges of the several nodes is the same as the storage order of the several nodes in the point table in the same order.

In some embodiments, the edge index information also includes the edge type; the edge information also includes the node type of the target node; the edge information of the same node is stored sequentially in the edge table data area according to the edge type.

In some embodiments, the edge attribute table includes an edge attribute table index area and an edge attribute table data area; the attribute information of the edges of the several nodes is stored in the edge attribute table data area; the edge attribute table index area The edge attribute index information of the edges of the several nodes is stored, and the edge attribute index information includes the storage address information of the edge attribute information in the edge attribute table data area; the edge attribute index information of the edges of the several nodes The storage order of the information is consistent with the storage order of the edge information of the several edges in the edge table data area.

In some embodiments, the node information further includes node types, and the node information of the several nodes is stored in the point table in order of node identification.

In some embodiments, the point attribute table includes a point attribute table index area and a point attribute table data area; the attribute information of the several nodes is stored in the point attribute table data area; the point attribute table index area stores There are node attribute index information of the several nodes, and the node attribute index information includes the storage address information of the attribute information of the node in the point attribute table data area; the storage order of the node attribute index information of the several nodes is the same as The storage order of the several nodes in the point table is consistent.

In some embodiments, the system 500 further includes a table element generation module 550, the table element generation module 550 is used to generate the table element of the data block, and the table element includes storage address information of each table in the data block And the node identifier of the first node in each point table in the data block.

In some embodiments, the data block includes encoding information; the system 500 also includes a vocabulary generating module 560, and the vocabulary generating module 560 is used to generate a vocabulary of the map file; the vocabulary includes encoding in each data block in the map file The mapping relationship between information and original information.

In some embodiments, the system 500 also includes a data block index generation module 570, the data block index generation module 570 is used to generate the data block index of the map file; the data block index of the map file includes the storage of each data block in the map file Address information and node identification of the first node in each data block.

In some embodiments, the system 500 further includes a map file element generation module 580, and the map file element generation module 580 is used to generate a map file element, and the map file element includes the map file where each data block in each map file is located and the The serial number of the data block in the graph file, the node identifier of the first node in each graph file, and the node identifier of the last node in each graph file.

In some embodiments, a data block is the smallest read/write unit.

In some embodiments, the edge of the graph data includes an outgoing edge and an incoming edge; the edge table includes an outgoing edge table and an incoming edge table; the edge attribute table includes an outgoing edge attribute table and an incoming edge attribute table; the node information It also includes the storage address information of the outgoing edge and the storage address information of the incoming edge of the node.

It should be understood that the system and its modules shown in FIG. 5 can be implemented in various ways. For example, in some embodiments, the device and its modules may be implemented by hardware, software, or a combination of software and hardware. Wherein, the hardware part can be implemented by using dedicated logic; the software part can be stored in a memory and executed by an appropriate instruction executing device, such as a microprocessor or specially designed hardware. Those skilled in the art will understand that the above-mentioned methods and devices can be implemented using computer-executable instructions and/or contained in processor control code, for example on a carrier medium such as a magnetic disk, CD or DVD-ROM, such as a read-only memory (firmware ) or on a data carrier such as an optical or electronic signal carrier. The device and its modules in this specification can not only be implemented by hardware circuits such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, etc., or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc. , can also be realized by software executed by various types of processors, for example, and can also be realized by a combination of the above-mentioned hardware circuits and software (for example, firmware).

Fig. 6 is a schematic diagram of a data block structure according to some embodiments of the present specification.

The format of the storage file involved in one or more embodiments involved in this description will be further described below with reference to FIG. 6 .

Stored file 600 includes an atlas file element and one or more atlas files. The graph file element includes the graph file where each data block in each graph file is located, the serial number of the data block in the graph file, the node identifier of the first node in each graph file, and the node identifier of the last node in each graph file. The node label is to indicate the number of the node in the graph data, and is used to trace the position of the node in the graph data. Exemplarily, the node label can be set as node 1, node 2, . . . , node m and so on. In some embodiments, nodes in the graph data can be stored in multiple data blocks or graph files based on node identifiers, so as to quickly determine which graph file the target search node is in. The map file element can be understood as index information of multiple map files, which can be called and accessed by a host computer or a server (such as calling through an SDK).

A graph file may include multiple data blocks. In some embodiments, a graph file may include a fixed number of data blocks, for example, a graph file may include 1024 data blocks. Among them, the data block is the smallest read-write unit, which can be used to store and write data. When storing graph data, the data block is the minimum writing unit, and the processing device can sequentially write the graph data into one or more data blocks according to the format of the data block. A data block can have a fixed size, such as 64 bytes, 128 bytes, etc. When a data block is full, a new data block is created to continue writing until a complete graph data is written. In some embodiments, the data in the data block comes from the same graph data, and may also come from different graph data. Specifically, the data block includes a point table, a point attribute table, an edge table, and an edge attribute table. In some embodiments, the data block can also include a table element, and the table element includes the storage address information of each table in the data block and the point table in the data block. The node identifier of the first node in , the table element can be regarded as the index information inside the data block, which is convenient for quickly locating the storage location of each table. For more descriptions about the point table, point attribute table, edge table and edge attribute table, please refer to the detailed description of the corresponding part in FIG. 7 , which will not be repeated here.

In some embodiments, in addition to multiple data blocks, the graph file may also include file footer information, data block indexes, and vocabulary.

The vocabulary of the map file is used to record the mapping relationship between the encoded information and the original information. Further, the vocabulary can be used to encode or decode at least part of the information in the map file. For example, information such as edge type and node type can be represented by numbers, such as number 1 for user-type nodes, and number 2 for company-type nodes. Therefore, when storing node types in the point table, numbers such as 1 and 2 can be used to represent corresponding type. Representing text with shorter numbers or letters can effectively reduce the actual storage space of graph data. Correspondingly, similar mapping relationships such as "1" - user node, "2" - company node, etc. may be recorded in the vocabulary.

The data block index of the map file includes the storage address information of each data block in the map file and the node identifier of the first node in each data block. The data block index of the map file can quickly determine which data block the target query point is in.

The file footer information includes the total number of nodes in the data block, the total number of edges, and file extension areas (such as file protocol, compression algorithm, correction information, etc.).

Fig. 8 is an exemplary flow chart of querying graph data according to some embodiments of the present specification. In the following, in conjunction with the process 800 shown in FIG. 8 , the method of using the stored file will be described by taking the known target query node and finding the N-hop subgraph of the target query node as an example. The N-hop subgraph includes N-hop edges of the target query node and nodes on each edge. The storage device receives a query request from a service end or a processing device. In step 810, the query request includes a node identifier of a target query node. First, the storage device accesses the graph file element, as in step 820, determines which graph file the target node is stored in through the node identifier of the first node of each graph file stored in the graph file element and the node identifier of the last node in each graph file in (eg locked to a map file V). Further, based on the node identifier of the first node in each data block stored in the data block index of the map file (data block index of map file V), determine the target data block where the target query node is located, as in step 830. Then locate the target data block where the target query node is located based on the storage address information of each data block in the map file stored in the data block index, for example, in step 840, specifically, the target data block can be obtained. In the target data block, the point table can be located based on its elements, and the node information of the target query node can be found in the point table based on the node ID. When the node information in the point table is stored in the order of the node ID, binary search can be performed The node information of the target query node is quickly determined in a manner such as step 850. Since the point table, edge table, point attribute table, and edge attribute table are located in the same data block and are aligned with each other, the node information of the target query node can be based on the target query node through a read operation (such as loading the data block into the memory). The storage order in the point table or the storage address information of the edge, the edge information, point attribute information and One or more types of information in the edge attribute information, as in step 860, and then find the one-hop subgraph of the target query node. Further, obtain the node identifications of each first-hop neighbor node (the node on the first hop side of the target query node) of the target query node in the one-hop subgraph, and repeat the above steps to find the one-hop sub-nodes of each first-hop neighbor node Graph, get the two-hop subgraph of the target query node, and so on, get the N-hop subgraph of the target query node.

It should be noted that, in one or more embodiments involved in this specification, the edges of graph data may include outgoing edges and incoming edges. In the embodiment of this scenario, the edge table involved in this specification can also be further divided into the outgoing edge table and the incoming edge table; the corresponding edge attribute table also includes the outgoing edge attribute table and the incoming edge attribute table; the corresponding node information It also includes the storage address information of the outgoing edge and the storage address information of the incoming edge of the node.

Fig. 7 is an exemplary flowchart of graph data storage according to some embodiments of the present specification. In some embodiments, an exemplary process for storing graph data is shown in process 700 , wherein process 700 may include

steps

710 , 720 , . . . , step 780 , and a detailed description of process 700 is as follows.

Step 710, storing the node information of several nodes in the graph data in the point table of the data block.

In some embodiments, step 710 may be performed by the node information storage module 510 . The node information storage module 510 fills the node information into the point table in order based on the format of the set point table. Graph data includes nodes and edges. In some embodiments, the node information storage module 510 may select several nodes from the graph data for storage. Several nodes can be all the nodes of the graph data, or some of them.

FIG. 2 is a schematic diagram of an exemplary point table 210 . Node information of several nodes is stored in the point table, and the node information includes node identifiers. The node identifier is the number indicating the node in the graph data, and is used to trace the position of the node in the graph data. Exemplarily, the node identifier can be set as node 1, node 2, . . . , node m and so on. In some embodiments, the node information stored in the point table is stored based on the order of node identification. Exemplarily, the node information storage module 510 may select several nodes with consecutive node IDs from the graph data, and store the node information of these nodes sequentially according to the ascending or descending order of the node IDs.

In some embodiments, the node information also includes storage address information of the edge corresponding to the node, and the storage address information of the edge indicates the storage location of the edge in the edge table, for example, it may be the storage address information of the edge index information in the edge table. Wherein, the storage address information may be an absolute address, or an offset relative to a certain starting position. Exemplarily, the storage address information of the edge index information in the edge table may be an absolute address, or an offset relative to the starting position of the edge table. With such a setting, when performing graph query, after locating a certain target node, the data of the edge connected to the target node can be directly determined based on the storage address information of the target node's edge in the point table.

In general, a node can contain multiple edges. In some embodiments, the node information storage module 510 can record the storage address information of each edge of the node in the point table, that is, a node information can record the storage address information of all edges connected to the node. However, in some implementation scenarios, due to the large number of edges corresponding to a node (for example, a merchant node can be connected to thousands of user nodes), using the above method to store the storage address information of all edges of the node will occupy a large amount of storage resources are very inefficient. Therefore, in some embodiments of the present specification, the edge information of the same node can be continuously stored in the edge table. For example, node A has 5 edges and node B has 3 edges. In the edge table, the edge information of the five edges of node A is continuously stored in an area (such as the size of 12*5=60 bytes) starting from the first storage location (such as the 16th byte in the edge table). area), node B's edge information is continuously stored in another area (eg, an area with a size of 12×3 bytes) starting from the second storage location (eg, the 76th byte in the edge table). In this way, as shown in Figure 2, the edge storage address information of each node stored in the point table can only include the initial storage location of its edge in the edge table (such as the edge storage address information of A node is the first storage location location, the storage address information of the edge of node B is the second storage location). That is, in the point table, the intermediate storage area from the storage address information of the edge of the previous node to the storage address information of the edge of the next node is regarded as the storage address information of the edge corresponding to the previous node.

In some embodiments, an edge has a direction, and a node may have an outgoing edge and/or an incoming edge, where an incoming edge is an edge pointing to the node, and an outgoing edge is an edge starting from the node pointing to another node. Therefore, in some embodiments, in the point table, the edge storage address information in the node information can be further divided into the storage address information of the incoming edge and the storage address information of the outgoing edge. Correspondingly, the edge table can include two types: an in-edge table and an out-edge table. The in-edge table only stores the edge information of the in-edge table, and the out-edge table stores the edge information of the out-edge table. The storage address information of the outgoing/incoming edge in the node information and the storage method of the outgoing/incoming edge information in the outgoing/incoming edge table are similar to those described above, and will not be repeated here. For more description about the storage address information of the edge, refer to the corresponding description of step 720 .

In some embodiments, the node information may also include node type information. Since a node can describe any entity or object in the physical world, it can be of different types. For example, a user-type node, a company-type node, a location-type node, and so on. The node type (not shown in the figure) may be stored between the node identifier of each node and the storage address information of the edge as shown in FIG. 2 . Generally speaking, the types of nodes can be exhaustive. In order to facilitate the representation and storage of node types, in some embodiments, the node types can also be encoded in the map file through the vocabulary, and the point table only stores the encoded the node type. When it is necessary to read the node type of a node from the point table, it can be encoded and parsed into a node type with clear semantics based on the vocabulary again, such as "user class node". The way of encoding and decoding in the file through the vocabulary can simplify the expression of the node type, so as to further reduce the storage space. For more descriptions about the vocabulary, refer to the description of FIG. 6 , which will not be repeated here.

In some embodiments, the node information may also be stored in the order of node types first, and then in the order of node identifiers. For example, user class nodes can be stored together, and stored sequentially according to node identifiers among multiple user class nodes. When sorting according to the node type, it can be arranged according to the pinyin alphabet of the first character of the node type description text or the first letter of the first word. The point table 210 shown in FIG. 2 also includes a header identification bit for indicating whether the table has an index area. In some embodiments, the point table does not include an index area, and its header identification bit stores "0".

Step 720, storing the edge information of the edges of the several nodes in the edge table of the data block.

In some embodiments, step 720 may be performed by the side information storage module 520 . The side information storage module 520 fills the data into the side table in sequence based on the format of the set side table.

In some embodiments, the edge table may include an edge table index area and an edge table data area. It can be understood that since an edge can be described by two target nodes connected by the edge, the edge information can include a node identifier of the target node connected to the edge. In some embodiments, the edge information is stored in the edge table data area. For example, the edge table data area stores a pair of node IDs of target nodes, wherein each pair of node IDs of target nodes corresponds to an edge. The edge table index area stores the index information of the edge information of each edge in the edge table, for example, includes the storage address information of the node identifier of the target node corresponding to each edge in the edge table data area.

As shown in FIG. 3 , it is a schematic diagram of an exemplary edge table 220 . In the figure, the header flag indicates whether the table has an index area. Exemplarily, setting the header identification bit to "1" indicates that there is an index area; setting the header identification bit to "0" indicates that there is no index area. Since all edge tables contain index areas, the table header flag is 1. The index area length indicates the total length of the edge table index area, such as the number of bytes occupied by the edge table index area. The length of the index area can indicate from which bit is the edge table data area. The edge table index area is used to store the index information of each edge, for example, the index information of edge A points to the position of the data of edge A in the edge table data area. The edge table data area is used to store the edge information of each edge. In some embodiments, the side information may also include the node type of the target node. In some embodiments, the storage length of each piece of side information is the same. For example, for each edge, 4 bytes are used to store the node types of the two target nodes, and 8 bytes are used to store the node identifiers of the two target nodes.

In some embodiments, the storage order of the edge index information is consistent with the storage order of the nodes in the vertex table (also referred to as the alignment of the edge table and the vertex table). For example, start from the edge table index area, continuously store the edge index information of the first node in the point table, then store the edge index information of the second node, and so on. In the edge table data area, the edge information can store the edge information of each edge sequentially according to the storage order of the edge index information in the edge table index area. Thus, the index information of the corresponding edge can be found according to the position of the node in the point table. For example, to determine the k-th storage order of a node in the point table, you can directly read the index information of the k-th edge, and then find the corresponding edge of the k-th node in the edge table data area based on the index information of the k-th edge storage location.

In some embodiments, the storage order of the edge information in the edge table is consistent with the storage order of the nodes in the vertex table, and the edge information of the same node is stored together consecutively. For example, node A is connected to three nodes K, M, and L, and node B is connected to two nodes Q and G. The storage order of node A in the point table is the first, and the storage order of node B in the point table is The second one. At this time, the edge information of the three edges A-K, A-M, and A-L, and the edge information of the two edges B-Q and B-G are stored sequentially from the starting position of the edge table data area. In this way, as shown in Figure 3, the edge index information stored in the edge table index area can only include the initial storage position of the edge information of the edge corresponding to the node in the edge table (such as the edge index information corresponding to node A includes edge A-K The storage address information of the node B, the edge index information corresponding to the node B includes the storage address information of the edge B-Q). That is, in the edge table, the storage area between the index information of the edge corresponding to the previous node and the index information of the edge of the next node is regarded as the edge information of the edge corresponding to the previous node.

Optionally, in some embodiments, the edge table index area also includes the edge type of each edge. For example, in FIG. 3 , the edge index information of edge A not only stores the address information, but also includes the edge type. The edge type can reflect the interactive relationship between two entities, such as the litigation relationship between two enterprises or the economic transaction relationship between two enterprises. In some embodiments, when the same node corresponds to multiple edges, and the multiple edges belong to different types, in the edge table data area, the edge information of the edges of the same node can be stored in order of edge types. At this time, the edge index information corresponding to the node in the edge table index area may include multiple edge types and multiple storage address information, wherein the multiple edge types are continuously stored, and the multiple storage address information are also continuously stored. As shown in Figure 3, assuming that node B has multiple edges, and these edges belong to two types of edges, two edge types and two storage address information can be continuously stored in the edge index information of node B, where the first The storage address information is the storage address information of the edge information belonging to the first edge type among the multiple edges of node B in the edge data area (for example, the edge information belonging to the first edge type among the multiple edges of node B is in the edge data area), the second storage address information is the storage address information of the edge information belonging to the second edge type among the multiple edges of node B in the edge data area (for example, among the multiple edges of node B, the edge information belongs to the first The edge information of the two edge types is in the initial storage location of the edge data area). By setting in this way, when performing graph query, all edges corresponding to a certain edge type corresponding to a certain node can be quickly located.

In some embodiments, the edge type can be the same as the node type, and the edge type is encoded inside the graph file using a vocabulary, and the edge table part only stores the internal encoding of the edge type. For more descriptions about the vocabulary, refer to the corresponding description in FIG. 6 , which will not be repeated here.

In some embodiments, edges have directions and nodes may have outgoing and/or incoming edges. Correspondingly, the edge table can include two types: an in-edge table and an out-edge table. The in-edge table only stores the relevant data of the in-edge table, and the out-edge table stores the relevant data of the out-edge table. The storage method of the relevant data of the outgoing/incoming edge in the outgoing/incoming edge table is similar to the foregoing content, and will not be repeated here.

Step 730, storing the attribute information of the several nodes in the point attribute table of the data block.

In some embodiments, step 730 may be performed by the node attribute information storage module 530 . The node attribute information storage module 530 fills the data into the point attribute table in sequence based on the format of the set point attribute table.

FIG. 4 is a schematic diagram of an exemplary attribute table 240 . In some embodiments, the point attribute table and the edge attribute table may have the same format. Therefore, the attribute table 240 can also be regarded as a point attribute table. The point attribute table includes the point attribute table index area and the point attribute table data area, and the attribute information of the point is stored in the point attribute table data area; the point attribute table index area stores the point attribute index information of the point, and the point attribute index information includes the point The storage address information of the attribute information in the point attribute table data area. As shown in FIG. 4 , each attribute index information can point to an attribute data.

In some embodiments, similar to the alignment of the edge table and the point table, the point attribute table may also be aligned with the point table. Specifically, the storage order of the point attribute index information in the point attribute table is consistent with the storage order of the node information in the point table. With such a setting, it is possible to locate the point attribute index information according to the storage order of the nodes in the point table, and further obtain the attribute information of the node from the point attribute table data area based on the point attribute index information.

In some embodiments, the attribute table 240 may also include a header flag "1" and the length of the index area.

Step 740, storing the edge attribute information of the several nodes in the edge attribute table of the data block.

In some embodiments, step 740 may be performed by the edge attribute information storage module 540 . The edge attribute information storage module 540 fills data into the edge attribute table in sequence based on the format of the set edge attribute table.

Similarly, the attribute table 240 can also be regarded as an edge attribute table. The attribute information of the edges of several nodes is stored in the edge attribute table data area; the edge attribute table index area stores the attribute index information of each edge, and the edge attribute index information includes the attribute information of the edge in the edge attribute table data area storage address information.

In some embodiments, the storage order of the edge attribute index information in the edge attribute table index area is consistent with the storage order of the edge information of each edge in the edge table data area.

In some embodiments, edges have directions and nodes may have outgoing and/or incoming edges. Correspondingly, the edge attribute table may include two types: an incoming edge attribute table and an outgoing edge attribute table, wherein only the attribute information of the incoming edge is stored in the incoming edge attribute table, and the attribute information of the outgoing edge is stored in the outgoing edge attribute table. The storage method of the attribute information of the outgoing/incoming edge in the outgoing/incoming edge attribute table is similar to the foregoing content, and will not be repeated here.

In some embodiments, the process 700 further includes step 750: generating the table element of the data block. In some embodiments, step 750 may be performed by the tab generation module 550 .

The table element includes the storage address information of each table in the data block and the node identifier of the first node in each point table in the data block. For more descriptions about the table elements, refer to the corresponding description in FIG. 6 , which will not be repeated here.

So far, the generation of a data block is completed. In some embodiments, multiple data blocks may be generated according to steps 710-740, and multiple data blocks constitute a map file. The map file can also include information such as vocabulary and data block index.

In some embodiments, the process 700 further includes step 760: generating a vocabulary of the graph file. In some embodiments, step 760 may be performed by the vocabulary generation module 560 .

In some embodiments, the data block includes encoding information, at this time, the vocabulary of the graph file can also be generated. The vocabulary includes the mapping relationship between the coding information in each data block in the map file and the original information. For more expressions about the vocabulary, refer to the corresponding description in FIG. 6 , which will not be repeated here.

In some embodiments, the process 700 further includes step 770: generating a data block index of the atlas file. In some embodiments, step 770 may be performed by the data block index generation module 570 .

The data block index of the map file includes the storage address information of each data block in the map file and the node identifier of the first node in each data block, which is used to determine which data block the target query node is in. For more descriptions about the data block index, refer to the corresponding description in FIG. 6 , which will not be repeated here.

So far, one map file is generated based on the map data, and in some embodiments, multiple map files can be generated to form a storage file. Stored files may also include atlas file elements.

In some embodiments, the process 700 further includes step 780: generating a graph file element.

The map file element includes the map file where each data block is located in each map file and the serial number of the data block in the map file, the node identifier of the first node in each map file and the node identifier of the last node in each map file, among which It is used to determine which graph file the target query node is in. For more descriptions of the map file elements, refer to the corresponding description in Figure 6, and will not repeat them here.

The possible beneficial effects of the embodiments of this specification include but are not limited to: 1) Store several nodes of the graph data, the edges of these nodes, and attribute information in a data block. Find the edge and attribute information related to the node in the block, without multiple read and write operations; 2) The graph data is stored in multiple data blocks in an orderly manner. For large-scale graph data, it can be distributed and stored on multiple devices. , when performing graph query, multiple devices can query in parallel (for example, different devices query different data blocks), so as to save the time of retrieval query and improve the response speed of graph query; 3) realize the point table-edge table-attribute The alignment of tables saves the storage space of edge tables and attribute tables. It should be noted that different embodiments may have different beneficial effects. In different embodiments, the possible beneficial effects may be any one or a combination of the above, or any other possible beneficial effects.

The basic concept has been described above, obviously, for those skilled in the art, the above detailed disclosure is only an example, and does not constitute a limitation to this description. Although not expressly stated here, those skilled in the art may make various modifications, improvements and corrections to this description. Such modifications, improvements and corrections are suggested in this specification, so such modifications, improvements and corrections still belong to the spirit and scope of the exemplary embodiments of this specification.

Meanwhile, this specification uses specific words to describe the embodiments of this specification. For example, "one embodiment", "an embodiment", and/or "some embodiments" refer to a certain feature, structure or characteristic related to at least one embodiment of this specification. Therefore, it should be emphasized and noted that references to "an embodiment" or "an embodiment" or "an alternative embodiment" two or more times in different places in this specification do not necessarily refer to the same embodiment . In addition, certain features, structures or characteristics in one or more embodiments of this specification may be properly combined.

In addition, those skilled in the art will understand that various aspects of this specification can be illustrated and described by several patentable types or situations, including any new and useful process, machine, product or combination of substances, or their Any new and useful improvements. Correspondingly, various aspects of this specification may be entirely executed by hardware, may be entirely executed by software (including firmware, resident software, microcode, etc.), or may be executed by a combination of hardware and software. The above hardware or software may be referred to as "block", "module", "engine", "unit", "component" or "system". Additionally, aspects of this specification may be embodied as a computer product comprising computer readable program code on one or more computer readable media.

A computer storage medium may contain a propagated data signal embodying a computer program code, for example, in baseband or as part of a carrier wave. The propagated signal may have various manifestations, including electromagnetic form, optical form, etc., or a suitable combination. A computer storage medium may be any computer-readable medium, other than a computer-readable storage medium, that can be used to communicate, propagate, or transfer a program for use by being coupled to an instruction execution system, apparatus, or device. Program code residing on a computer storage medium may be transmitted over any suitable medium, including radio, electrical cable, fiber optic cable, RF, or the like, or combinations of any of the foregoing.

The computer program codes required for the operation of each part of this manual can be written in any one or more programming languages, including object-oriented programming languages such as Java, Scala, Smalltalk, Eiffel, JADE, Emerald, C++, C#, VB.NET, Python etc., conventional procedural programming languages such as C language, VisualBasic, Fortran2003, Perl, COBOL2002, PHP, ABAP, dynamic programming languages such as Python, Ruby and Groovy, or other programming languages. The program code may run entirely on the user's computer, or as a stand-alone software package, or run partly on the user's computer and partly on a remote computer, or entirely on the remote computer or processing device. In the latter case, the remote computer can be connected to the user computer through any form of network, such as a local area network (LAN) or wide area network (WAN), or to an external computer (such as through the Internet), or in a cloud computing environment, or as a service Use software as a service (SaaS).

In addition, unless explicitly stated in the claims, the order of processing elements and sequences described in this specification, the use of numbers and letters, or the use of other names are not used to limit the sequence of processes and methods in this specification. While the foregoing disclosure has discussed by way of various examples some embodiments of the invention that are presently believed to be useful, it should be understood that such detail is for illustrative purposes only and that the appended claims are not limited to the disclosed embodiments, but rather, the claims The claims are intended to cover all modifications and equivalent combinations that fall within the spirit and scope of the embodiments of this specification. For example, while the system components described above may be implemented as hardware devices, they may also be implemented as a software-only solution, such as installing the described system on an existing processing device or mobile device.

In the same way, it should be noted that in order to simplify the expression disclosed in this specification and help the understanding of one or more embodiments of the invention, in the foregoing description of the embodiments of this specification, sometimes multiple features are combined into one embodiment, drawings or descriptions thereof. This method of disclosure does not, however, imply that the subject matter of the specification requires more features than are recited in the claims. Indeed, embodiment features are less than all features of a single foregoing disclosed embodiment.

In some embodiments, numbers describing the quantity of components and attributes are used, and it should be understood that such numbers used in the description of the embodiments, in some examples, use the modifiers "about", "approximately" or "substantially" to express grooming. Unless otherwise stated, "about", "approximately" or "substantially" indicates that the stated figure allows for a variation of ±20%. Accordingly, in some embodiments, the numerical parameters used in the specification and claims are approximations that can vary depending upon the desired characteristics of individual embodiments. In some embodiments, numerical parameters should take into account the specified significant digits and adopt the general digit reservation method. Although the numerical ranges and parameters used in some embodiments of this specification to confirm the breadth of the range are approximations, in specific embodiments, such numerical values are set as precisely as practicable.

Each patent, patent application, patent application publication, and other material, such as article, book, specification, publication, document, etc., cited in this specification is hereby incorporated by reference in its entirety. Application history documents that are inconsistent with or conflict with the content of this specification are excluded, and documents (currently or later appended to this specification) that limit the broadest scope of the claims of this specification are excluded. It should be noted that if there is any inconsistency or conflict between the descriptions, definitions, and/or terms used in the accompanying materials of this manual and the contents of this manual, the descriptions, definitions and/or terms used in this manual shall prevail .

Finally, it should be understood that the embodiments described in this specification are only used to illustrate the principles of the embodiments of this specification. Other modifications are also possible within the scope of this description. Therefore, by way of example and not limitation, alternative configurations of the embodiments of this specification may be considered consistent with the teachings of this specification. Accordingly, the embodiments of this specification are not limited to the embodiments explicitly introduced and described in this specification.

Claims

A method for storing graph data, where the graph data includes nodes and edges; the storage method includes:

storing the node information of several nodes in the graph data in the point table of the data block; the node information includes a node identifier;

storing the edge information of the edges of the several nodes in the edge table of the data block; the edge information includes a node identifier of a target node connected to the edge;

storing the attribute information of the several nodes in the point attribute table of the data block;

Store the edge attribute information of the several nodes in the edge attribute table of the data block.
According to the method according to claim 1, the storage order of the edges of the several nodes in the edge table is consistent with the storage order of the several nodes in the point table;

The storage order of the attribute information of the several nodes in the point attribute table is consistent with the storage order of the several nodes in the point table;

The storage order of the edge attribute information of the several nodes in the edge attribute table is consistent with the storage order of the edges of the several nodes in the edge table.
The method according to claim 1 or 2, wherein the edge table includes an edge table index area and an edge table data area;

The edge information of the edges of the several nodes is stored in the edge table data area;

The edge table index area stores edge index information of the several nodes, and the edge index information includes storage address information of the edge information of the corresponding node in the edge table data area;

The storage order of the edge index information of the several nodes in the edge table index area is consistent with the storage order of the several nodes in the point table.
According to the method according to claim 3, the node information further includes storage address information of edges of nodes, and the storage address information of edges in the point table is the storage address information of index information corresponding to edges in the edge table.
According to the method according to claim 3, the edge information of different edges of the same node is continuously stored in the edge table data area; the storage order of the edge information of the edges of the several nodes in the edge table data area is the same as The storage order of the several nodes in the point table is consistent.
According to the method according to claim 5, the edge index information also includes the edge type; the edge information also includes the node type of the target node; the edge information of the edge of the same node is in the order of the edge table data area according to the edge type of the edge Storage, the edge index information corresponding to the same node in the edge table index area includes one or more edge types and one or more storage address information corresponding thereto, wherein the one or more edge types are stored continuously, and the one Or a plurality of storage address information is also stored consecutively.
The method according to claim 3, the edge attribute table includes an edge attribute table index area and an edge attribute table data area;

The attribute information of the edges of the several nodes is stored in the edge attribute table data area;

The edge attribute table index area stores the edge attribute index information of the edges of the several nodes, and the edge attribute index information includes the storage address information of the edge attribute information of the corresponding node in the edge attribute table data area;

The storage order of the edge attribute index information of the edges of the several nodes in the edge attribute table index area is consistent with the storage order of the edge information of the edges of the several nodes in the edge table data area.
According to the method according to claim 1, the node information further includes node types, and the node information of the several nodes is stored in the point table according to the order of the node types.
The method according to claim 1, wherein the point attribute table includes a point attribute table index area and a point attribute table data area;

The attribute information of the several nodes is stored in the point attribute table data area;

The point attribute table index area stores the node attribute index information of the several nodes, and the node attribute index information includes the storage address information of the node attribute information in the point attribute table data area;

The storage order of the node attribute index information of the several nodes in the point attribute table index area is consistent with the storage order of the several nodes in the point table.
The method according to claim 1, further comprising: generating the table element of the data block, the table element including the storage address information of each table in the data block and the first node in the midpoint table of the data block node ID.
The method according to claim 10, wherein the data block includes encoding information; the method further comprises: generating a vocabulary of map files comprising a plurality of the data blocks; the vocabulary includes each data in the map file The mapping relationship between the encoded information in the block and the original information.
The method according to claim 10, further comprising: generating a data block index of a map file including a plurality of said data blocks; the data block index of said map file includes storage address information of each data block in the map file and each data Node ID of the first node in the block.
The method according to claim 12, further comprising: generating an atlas file element, the atlas file element including the atlas file where each data block in each atlas file is located, the serial number of the data block in the atlas file, and the number of data blocks in each atlas file. The node ID of a node and the node ID of the last node in each graph file.
According to the method according to claim 1, the data block is the minimum reading and writing unit.
According to the method according to claim 1, the edge of the graph data includes an outgoing edge and an incoming edge; the edge table includes an outgoing edge table and an incoming edge table; and the edge attribute table includes an outgoing edge attribute table and an incoming edge attribute table ; The node information also includes the storage address information of the outgoing edge and the storage address information of the incoming edge of the node.
A storage system for graph data, where the graph data includes nodes and edges; the storage system includes:

A node information storage module, configured to store the node information of several nodes in the graph data in the point table of the data block; the node information includes a node identifier;

An edge information storage module, configured to store the edge information of the edges of the several nodes in the edge table of the data block; the edge information includes a node identifier of a target node connected to the edge;

A node attribute information storage module, configured to store the attribute information of the several nodes in the point attribute table of the data block;

The edge attribute information storage module is configured to store the edge attribute information of the plurality of nodes in the edge attribute table of the data block.
A graph data storage device, comprising a storage medium and a processor, the storage medium is used to store computer instructions, and the processor is used to execute the computer instructions to implement the storage method according to any one of claims 1-15.
A storage device for graph data, the graph data includes nodes and edges; the storage device stores several data blocks, wherein each data block includes:

A point table, used to store node information of at least some nodes in the graph data; the node information includes node identifiers;

an edge table, configured to store edge information of the edge of the node; the edge information includes a node identifier of a target node connected to the edge;

A point attribute table, used to store the attribute information of the node;

The edge attribute table is used to store the attribute information of the edge of the node.
A graph data query method, comprising:

Receive a query request, the query request includes the node identifier of the target query node;

Access the graph file element, determine the target graph file where the target query node is located by the node identifier of the first node of each graph file stored in the graph file element and the node identifier of the last node in each graph file;

Access the data block index of the target map file, and determine the target data block where the target query node is located by the node identifier of the first node in each data block in the target map file stored in the data block index;

Reading the target data block based on the storage address information of each data block in the target atlas file stored in the data block index;

In the target data block, the storage address information of the point table is obtained based on its table element, and the node information of the target query node is found in the point table based on the node identifier of the target query node;

Based on the storage order of the node information of the target query node in the point table or the storage address information of the edge, the target is obtained from one or more tables in the edge table, point attribute table and edge attribute table of the target data block Query one or more types of edge information, point attribute information, and edge attribute information of nodes.