WO2023131218A1 - Graph data storage - Google Patents

Graph data storage Download PDF

Info

Publication number
WO2023131218A1
WO2023131218A1 PCT/CN2023/070606 CN2023070606W WO2023131218A1 WO 2023131218 A1 WO2023131218 A1 WO 2023131218A1 CN 2023070606 W CN2023070606 W CN 2023070606W WO 2023131218 A1 WO2023131218 A1 WO 2023131218A1
Authority
WO
WIPO (PCT)
Prior art keywords
edge
node
information
attribute
data
Prior art date
Application number
PCT/CN2023/070606
Other languages
French (fr)
Chinese (zh)
Inventor
张达
Original Assignee
支付宝(杭州)信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 支付宝(杭州)信息技术有限公司 filed Critical 支付宝(杭州)信息技术有限公司
Publication of WO2023131218A1 publication Critical patent/WO2023131218A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/51Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying

Definitions

  • One or more embodiments of this specification relate to the field of computers, and in particular to a method, system, and device for storing graph data.
  • One aspect of this specification provides a method for storing graph data, where the graph data includes nodes and edges; the storage method includes: storing node information of several nodes in the graph data in a point table of a data block; the The node information includes a node identifier; the edge information of the edges of the several nodes is stored in the edge table of the data block; the edge information includes the node identifier of the target node connected to the edge; the edge information of the several nodes is The attribute information is stored in the point attribute table of the data block; the attribute information of the edges of the several nodes is stored in the edge attribute table of the data block.
  • the graph data includes nodes and edges;
  • the storage system includes: a node information storage module, used to store the node information of several nodes in the graph data in the data in the point table of the block;
  • the node information includes a node identifier;
  • the edge information storage module is used to store the edge information of the edges of the several nodes in the edge table of the data block;
  • the edge information includes an edge The node identification of the connected target node;
  • the node attribute information storage module used to store the attribute information of the several nodes in the point attribute table of the data block;
  • the edge attribute information storage module used to store the several nodes
  • the attribute information of the edge of the node is stored in the edge attribute table of the data block.
  • graph data file includes nodes and edges; the file includes several data blocks, wherein each data block includes: a point table, used to store nodes of at least some nodes in the graph data information; the node information includes a node identifier; an edge table is used to store the edge information of the edge of the node; the edge information includes a node identifier of a target node connected to the edge; a point attribute table is used to store the node The attribute information of the node; the edge attribute table is used to store the attribute information of the edge of the node.
  • Fig. 1 is a schematic diagram of an application scenario of an exemplary graph data storage system according to some embodiments of the present specification
  • Figure 2 is a schematic diagram of a point table according to some embodiments of the present specification.
  • Fig. 3 is a schematic diagram of an edge table according to some embodiments of the present specification.
  • Fig. 4 is a schematic diagram of a point/edge attribute table according to some embodiments of the present specification.
  • Fig. 5 is a system block diagram of graph data storage according to some embodiments of the present specification.
  • Fig. 6 is a schematic diagram of a data block structure according to some embodiments of this specification.
  • Fig. 7 is an exemplary flow chart of graph data storage according to some embodiments of the present specification.
  • Fig. 8 is an exemplary flow chart of querying graph data according to some embodiments of the present specification.
  • system means for distinguishing different components, elements, parts, parts or assemblies of different levels.
  • the words may be replaced by other expressions if other words can achieve the same purpose.
  • Fig. 1 is a schematic diagram of an application scenario of an exemplary graph database storage system according to some embodiments of the present specification.
  • the data generated between different entities is increasing exponentially, and the internal dependence of data and complexity increases.
  • the form of graph data is used to describe and characterize the relationship between different entities.
  • Graph data is composed of multiple nodes and edges connecting each node.
  • the nodes in the graph data represent entities, and the edges between nodes represent the relationship between entities.
  • Entities can be real objects, institutions, etc. in the physical world, or abstract concepts, such as companies, equipment, people, goods, storage locations, means of transportation, images, computer programs, accounts, etc. Entities can have attribute information.
  • attribute information includes age, gender, occupation, work unit or home address, etc.
  • attribute information includes company registered address, legal person, business scope, registered capital and other information.
  • Edges between entities can reflect the relationship between entities. For example, there may be an employment relationship between an entity person and an entity company, and there may be a friend relationship between Zhang San and Li Si. Edges can also have attribute information.
  • the attribute information of an employment relationship can include establishment time, employment relationship type (whether it is formal employment or temporary employment), and so on.
  • the graph data can be stored in a relational database, and this storage method will store the nodes and edges in the graph data separately.
  • relational databases show more inadaptability when storing graph data. For example, because the graph data is huge, the graph data needs to be stored in separate databases and tables, and then the nodes and the edges of these nodes will be split and stored.
  • queries the graph data it is necessary to interact with different databases (such as storage devices) to find the target Query nodes and their edges, or multiple reads and writes are required to obtain the target query nodes and their edges.
  • a graph data storage method based on graph databases is proposed in some embodiments.
  • the relationship between data plays an important role, and it can store massive and complex data and the relationship between complex data.
  • the graph database is a graph database that divides the nodes and edges in the graph data into different KV storage engines for storage, and builds a proxy layer (that is, a proxy layer) on top of the graph database to provide graph query services.
  • a proxy layer that is, a proxy layer
  • a one-hop subgraph that is, a one-hop graph, it refers to a node, the edge connected to the node, and the node at the other end of the edge
  • querying a one-hop subgraph requires many read and write operations to obtain the query result of a one-hop subgraph, and such retrieval efficiency is very low.
  • the graph database needs an independent cluster server (computer) for deployment and operation and maintenance, so as to ensure that there is enough memory for multiple read and write operations in the graph query process. This brings about a large equipment operation and maintenance cost.
  • some embodiments of this specification provide a storage method for graph data, including: correspondingly storing node information, edge information, node attribute information, and edge attribute information of several nodes in the graph data in the same data In the point table, edge table, point attribute table and edge attribute table of the block.
  • the node information and edge information of the relevant nodes can be obtained by reading the data block once, which effectively reduces the frequency of reading and writing in the process of graph processing.
  • the data block can be read and written only once, and the query efficiency is significantly improved.
  • the storage order of the edges in the edge table can also be consistent with the storage order of the several nodes in the point table, so that the storage order of the attribute information of several nodes in the point attribute table is the same as The storage order of the several nodes in the point table is consistent, so that the storage order of the edge attribute information of the several nodes in the edge attribute table is consistent with the storage order of the edges of the several nodes in the edge table, through such In this way, the alignment of point table-edge table-attribute table is realized. After node A is queried, the positions of all edges corresponding to node A in the edge table can be quickly determined, and then the attribute information of node A in the edge attribute table can be quickly located. Such a setting eliminates the need for excessive data reading and writing and caching requirements during the graph query process, so the entire process does not require a resident service cluster to support it.
  • the application scenario of the graph data storage system is shown in FIG. 1 , and the scenario 100 may include a storage device 110 - 1 , a storage device 110 - 2 , .
  • the storage device 110-1, the storage device 110-2, the storage device 110-3, ... may include a processor and a large capacity memory, a removable memory, a volatile read-write memory, a read-only memory (ROM), etc. or any combination thereof , for data storage, management of resources, and processing of data and/or information from at least one component of the System or external data sources (eg, cloud data centers).
  • each of storage device 110-1, storage device 110-2, storage device 110-3, ... may be a single server or a group of servers.
  • the server group may be centralized or distributed (for example, the server 110-1 may be a distributed system), may be dedicated, or may be simultaneously provided by other devices or systems.
  • storage device 110-1, storage device 110-2, storage device 110-3, ... may be local or remote.
  • the storage device 110-1, the storage device 110-2, the storage device 110-3, ... may be implemented on a cloud platform, or provided in a virtual manner.
  • the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, an internal cloud, a multi-layer cloud, etc., or any combination thereof.
  • any one or more of storage device 110-1, storage device 110-2, ..., storage device 110-n can store one or more graph files, and support parallel query of graph data.
  • the graph file may include multiple data blocks, and each data block is used to store node information, edge information, and attribute information corresponding to nodes and edges of all or part of the nodes in the graph data.
  • each data block includes a point table 210 , an edge table 220 , a point attribute table 230 , an edge attribute table 240 and a table element 250 .
  • the processing device 120 can generate or acquire graph data, write the graph data into multiple data blocks or multiple graph files, and distribute the multiple data blocks or graph files to the storage device 110-1, storage device 110-2, ..., the storage device 110-n stores.
  • the processing device 120 can obtain the query request, and distribute the query request to each storage device, so that each storage device can perform a query in the locally stored map data or data blocks, and return the query result to the processing device 120 .
  • a storage device may be used to store the map files, and in this case, the processing device 120 may be omitted.
  • the scene 100 may also include a network (not shown in the figure).
  • a network can connect components of a system and/or connect the system with external parts.
  • a network enables communication between the various components of the system and between the system and external parts, facilitating the exchange of data and/or information.
  • the network 130 may be any one or more of a wired network or a wireless network.
  • a network may include a cable network, a fiber optic network, a telecommunications network, the Internet, a local area network (LAN), a wide area network (WAN), a wireless local area network (WLAN), a metropolitan area network (MAN), a public switched telephone network (PSTN), Bluetooth network, ZigBee network (ZigBee), near field communication (NFC), internal bus, internal line, cable connection, etc. or any combination thereof.
  • the network connection between various parts of the system may adopt one of the above-mentioned methods, or may adopt multiple methods.
  • the network may be in various topologies such as point-to-point, shared, and central, or a combination of multiple topologies.
  • Fig. 5 is a system block diagram for storing a graph database according to some embodiments of the present specification.
  • the system 500 is arranged on any processing device that can execute programs (such as any one of server 110-1, storage device 110-2, ..., storage device 110-n in FIG. 1 ), specifically including : a node information storage module 510, used to store the node information of several nodes in the graph data in the point table of the data block; the node information includes a node identifier; a side information storage module 520, used to store the several nodes
  • the edge information of the edge of the node is stored in the edge table of the data block; the edge information includes the node identification of the target node connected to the edge;
  • the node attribute information storage module 530 is used to store the attribute information of the several nodes Stored in the point attribute table of the data block;
  • the edge attribute information storage module 540 is configured to store the attribute information of the edges of the several nodes in the edge attribute table of the data block.
  • the storage order of the edges of the several nodes in the edge table is consistent with the storage order of the several nodes in the point table; the storage order of the attribute information of the several nodes in the point attribute table It is consistent with the storage order of the several nodes in the point table; the storage order of the edge attribute information of the several nodes in the edge attribute table is consistent with the storage order of the edges of the several nodes in the edge table.
  • the edge table includes an edge table index area and an edge table data area; the edge information of the edges of the several nodes is stored in the edge table data area; the edge table index area stores the several The edge index information of a node, the edge index information includes the storage address information of the edge information of the corresponding node in the edge table data area; the storage order of the edge index information of the several nodes is the same as the storage order of the edge information The storage order of the above-mentioned several nodes in the point table is consistent.
  • the node information further includes storage address information of edges of nodes, and the storage address information of edges in the point table is storage address information of index information corresponding to edges in the edge table.
  • the edge information of different edges of the same node is continuously stored in the edge table data area; the storage order of the edge information of the edges of the several nodes is the same as the storage order of the several nodes in the point table in the same order.
  • the edge index information also includes the edge type; the edge information also includes the node type of the target node; the edge information of the same node is stored sequentially in the edge table data area according to the edge type.
  • the edge attribute table includes an edge attribute table index area and an edge attribute table data area; the attribute information of the edges of the several nodes is stored in the edge attribute table data area; the edge attribute table index area The edge attribute index information of the edges of the several nodes is stored, and the edge attribute index information includes the storage address information of the edge attribute information in the edge attribute table data area; the edge attribute index information of the edges of the several nodes The storage order of the information is consistent with the storage order of the edge information of the several edges in the edge table data area.
  • the node information further includes node types, and the node information of the several nodes is stored in the point table in order of node identification.
  • the point attribute table includes a point attribute table index area and a point attribute table data area; the attribute information of the several nodes is stored in the point attribute table data area; the point attribute table index area stores There are node attribute index information of the several nodes, and the node attribute index information includes the storage address information of the attribute information of the node in the point attribute table data area; the storage order of the node attribute index information of the several nodes is the same as The storage order of the several nodes in the point table is consistent.
  • the system 500 further includes a table element generation module 550, the table element generation module 550 is used to generate the table element of the data block, and the table element includes storage address information of each table in the data block And the node identifier of the first node in each point table in the data block.
  • the data block includes encoding information;
  • the system 500 also includes a vocabulary generating module 560, and the vocabulary generating module 560 is used to generate a vocabulary of the map file; the vocabulary includes encoding in each data block in the map file The mapping relationship between information and original information.
  • the system 500 also includes a data block index generation module 570, the data block index generation module 570 is used to generate the data block index of the map file; the data block index of the map file includes the storage of each data block in the map file Address information and node identification of the first node in each data block.
  • the system 500 further includes a map file element generation module 580, and the map file element generation module 580 is used to generate a map file element, and the map file element includes the map file where each data block in each map file is located and the The serial number of the data block in the graph file, the node identifier of the first node in each graph file, and the node identifier of the last node in each graph file.
  • a data block is the smallest read/write unit.
  • the edge of the graph data includes an outgoing edge and an incoming edge;
  • the edge table includes an outgoing edge table and an incoming edge table;
  • the edge attribute table includes an outgoing edge attribute table and an incoming edge attribute table;
  • the node information It also includes the storage address information of the outgoing edge and the storage address information of the incoming edge of the node.
  • the device and its modules shown in FIG. 5 can be implemented in various ways.
  • the device and its modules may be implemented by hardware, software, or a combination of software and hardware.
  • the hardware part can be implemented by using dedicated logic;
  • the software part can be stored in a memory and executed by an appropriate instruction executing device, such as a microprocessor or specially designed hardware.
  • an appropriate instruction executing device such as a microprocessor or specially designed hardware.
  • processor control code for example on a carrier medium such as a magnetic disk, CD or DVD-ROM, such as a read-only memory (firmware ) or on a data carrier such as an optical or electronic signal carrier.
  • the device and its modules in this specification can not only be implemented by hardware circuits such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, etc., or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc. , can also be realized by software executed by various types of processors, for example, and can also be realized by a combination of the above-mentioned hardware circuits and software (for example, firmware).
  • Fig. 6 is a schematic diagram of a data block structure according to some embodiments of the present specification.
  • Stored file 600 includes an atlas file element and one or more atlas files.
  • the graph file element includes the graph file where each data block in each graph file is located, the serial number of the data block in the graph file, the node identifier of the first node in each graph file, and the node identifier of the last node in each graph file.
  • the node label is to indicate the number of the node in the graph data, and is used to trace the position of the node in the graph data. Exemplarily, the node label can be set as node 1, node 2, . . . , node m and so on.
  • nodes in the graph data can be stored in multiple data blocks or graph files based on node identifiers, so as to quickly determine which graph file the target search node is in.
  • the map file element can be understood as index information of multiple map files, which can be called and accessed by a host computer or a server (such as calling through an SDK).
  • a graph file may include multiple data blocks.
  • a graph file may include a fixed number of data blocks, for example, a graph file may include 1024 data blocks.
  • the data block is the smallest read-write unit, which can be used to store and write data.
  • the data block is the minimum writing unit, and the processing device can sequentially write the graph data into one or more data blocks according to the format of the data block.
  • a data block can have a fixed size, such as 64 bytes, 128 bytes, etc. When a data block is full, a new data block is created to continue writing until a complete graph data is written.
  • the data in the data block comes from the same graph data, and may also come from different graph data.
  • the data block includes a point table, a point attribute table, an edge table, and an edge attribute table.
  • the data block can also include a table element, and the table element includes the storage address information of each table in the data block and the point table in the data block.
  • the node identifier of the first node in , the table element can be regarded as the index information inside the data block, which is convenient for quickly locating the storage location of each table.
  • the graph file may also include file footer information, data block indexes, and vocabulary.
  • the vocabulary of the map file is used to record the mapping relationship between the encoded information and the original information. Further, the vocabulary can be used to encode or decode at least part of the information in the map file. For example, information such as edge type and node type can be represented by numbers, such as number 1 for user-type nodes, and number 2 for company-type nodes. Therefore, when storing node types in the point table, numbers such as 1 and 2 can be used to represent corresponding type. Representing text with shorter numbers or letters can effectively reduce the actual storage space of graph data. Correspondingly, similar mapping relationships such as "1" - user node, "2" - company node, etc. may be recorded in the vocabulary.
  • the data block index of the map file includes the storage address information of each data block in the map file and the node identifier of the first node in each data block.
  • the data block index of the map file can quickly determine which data block the target query point is in.
  • the file footer information includes the total number of nodes in the data block, the total number of edges, and file extension areas (such as file protocol, compression algorithm, correction information, etc.).
  • Fig. 8 is an exemplary flow chart of querying graph data according to some embodiments of the present specification.
  • the method of using the stored file will be described by taking the known target query node and finding the N-hop subgraph of the target query node as an example.
  • the N-hop subgraph includes N-hop edges of the target query node and nodes on each edge.
  • the storage device receives a query request from a service end or a processing device.
  • the query request includes a node identifier of a target query node.
  • the storage device accesses the graph file element, as in step 820, determines which graph file the target node is stored in through the node identifier of the first node of each graph file stored in the graph file element and the node identifier of the last node in each graph file in (eg locked to a map file V). Further, based on the node identifier of the first node in each data block stored in the data block index of the map file (data block index of map file V), determine the target data block where the target query node is located, as in step 830.
  • the target data block where the target query node is located based on the storage address information of each data block in the map file stored in the data block index, for example, in step 840, specifically, the target data block can be obtained.
  • the point table can be located based on its elements, and the node information of the target query node can be found in the point table based on the node ID.
  • binary search can be performed The node information of the target query node is quickly determined in a manner such as step 850.
  • the node information of the target query node can be based on the target query node through a read operation (such as loading the data block into the memory).
  • each first-hop neighbor node the node on the first hop side of the target query node
  • obtain the node identifications of each first-hop neighbor node (the node on the first hop side of the target query node) of the target query node in the one-hop subgraph and repeat the above steps to find the one-hop sub-nodes of each first-hop neighbor node Graph, get the two-hop subgraph of the target query node, and so on, get the N-hop subgraph of the target query node.
  • the edges of graph data may include outgoing edges and incoming edges.
  • the edge table involved in this specification can also be further divided into the outgoing edge table and the incoming edge table; the corresponding edge attribute table also includes the outgoing edge attribute table and the incoming edge attribute table; the corresponding node information It also includes the storage address information of the outgoing edge and the storage address information of the incoming edge of the node.
  • Fig. 7 is an exemplary flowchart of graph data storage according to some embodiments of the present specification.
  • process 700 may include steps 710 , 720 , . . . , step 780 , and a detailed description of process 700 is as follows.
  • Step 710 storing the node information of several nodes in the graph data in the point table of the data block.
  • step 710 may be performed by the node information storage module 510 .
  • the node information storage module 510 fills the node information into the point table in order based on the format of the set point table.
  • Graph data includes nodes and edges.
  • the node information storage module 510 may select several nodes from the graph data for storage. Several nodes can be all the nodes of the graph data, or some of them.
  • FIG. 2 is a schematic diagram of an exemplary point table 210 .
  • Node information of several nodes is stored in the point table, and the node information includes node identifiers.
  • the node identifier is the number indicating the node in the graph data, and is used to trace the position of the node in the graph data.
  • the node identifier can be set as node 1, node 2, . . . , node m and so on.
  • the node information stored in the point table is stored based on the order of node identification.
  • the node information storage module 510 may select several nodes with consecutive node IDs from the graph data, and store the node information of these nodes sequentially according to the ascending or descending order of the node IDs.
  • the node information also includes storage address information of the edge corresponding to the node, and the storage address information of the edge indicates the storage location of the edge in the edge table, for example, it may be the storage address information of the edge index information in the edge table.
  • the storage address information may be an absolute address, or an offset relative to a certain starting position.
  • the storage address information of the edge index information in the edge table may be an absolute address, or an offset relative to the starting position of the edge table.
  • a node can contain multiple edges.
  • the node information storage module 510 can record the storage address information of each edge of the node in the point table, that is, a node information can record the storage address information of all edges connected to the node.
  • a node information can record the storage address information of all edges connected to the node.
  • the edge information of the same node can be continuously stored in the edge table. For example, node A has 5 edges and node B has 3 edges.
  • node B's edge information is continuously stored in another area (eg, an area with a size of 12 ⁇ 3 bytes) starting from the second storage location (eg, the 76th byte in the edge table).
  • the edge storage address information of each node stored in the point table can only include the initial storage location of its edge in the edge table (such as the edge storage address information of A node is the first storage location location, the storage address information of the edge of node B is the second storage location). That is, in the point table, the intermediate storage area from the storage address information of the edge of the previous node to the storage address information of the edge of the next node is regarded as the storage address information of the edge corresponding to the previous node.
  • an edge has a direction
  • a node may have an outgoing edge and/or an incoming edge, where an incoming edge is an edge pointing to the node, and an outgoing edge is an edge starting from the node pointing to another node.
  • the edge storage address information in the node information can be further divided into the storage address information of the incoming edge and the storage address information of the outgoing edge.
  • the edge table can include two types: an in-edge table and an out-edge table. The in-edge table only stores the edge information of the in-edge table, and the out-edge table stores the edge information of the out-edge table.
  • the storage address information of the outgoing/incoming edge in the node information and the storage method of the outgoing/incoming edge information in the outgoing/incoming edge table are similar to those described above, and will not be repeated here.
  • the node information may also include node type information. Since a node can describe any entity or object in the physical world, it can be of different types. For example, a user-type node, a company-type node, a location-type node, and so on.
  • the node type (not shown in the figure) may be stored between the node identifier of each node and the storage address information of the edge as shown in FIG. 2 .
  • the types of nodes can be exhaustive.
  • the node types can also be encoded in the map file through the vocabulary, and the point table only stores the encoded the node type.
  • node type of a node When it is necessary to read the node type of a node from the point table, it can be encoded and parsed into a node type with clear semantics based on the vocabulary again, such as "user class node".
  • the way of encoding and decoding in the file through the vocabulary can simplify the expression of the node type, so as to further reduce the storage space.
  • FIG. 6 For more descriptions about the vocabulary, refer to the description of FIG. 6 , which will not be repeated here.
  • the node information may also be stored in the order of node types first, and then in the order of node identifiers.
  • user class nodes can be stored together, and stored sequentially according to node identifiers among multiple user class nodes.
  • it can be arranged according to the pinyin alphabet of the first character of the node type description text or the first letter of the first word.
  • the point table 210 shown in FIG. 2 also includes a header identification bit for indicating whether the table has an index area. In some embodiments, the point table does not include an index area, and its header identification bit stores "0".
  • Step 720 storing the edge information of the edges of the several nodes in the edge table of the data block.
  • step 720 may be performed by the side information storage module 520 .
  • the side information storage module 520 fills the data into the side table in sequence based on the format of the set side table.
  • the edge table may include an edge table index area and an edge table data area. It can be understood that since an edge can be described by two target nodes connected by the edge, the edge information can include a node identifier of the target node connected to the edge.
  • the edge information is stored in the edge table data area.
  • the edge table data area stores a pair of node IDs of target nodes, wherein each pair of node IDs of target nodes corresponds to an edge.
  • the edge table index area stores the index information of the edge information of each edge in the edge table, for example, includes the storage address information of the node identifier of the target node corresponding to each edge in the edge table data area.
  • the header flag indicates whether the table has an index area. Exemplarily, setting the header identification bit to "1" indicates that there is an index area; setting the header identification bit to "0" indicates that there is no index area. Since all edge tables contain index areas, the table header flag is 1.
  • the index area length indicates the total length of the edge table index area, such as the number of bytes occupied by the edge table index area. The length of the index area can indicate from which bit is the edge table data area.
  • the edge table index area is used to store the index information of each edge, for example, the index information of edge A points to the position of the data of edge A in the edge table data area.
  • the edge table data area is used to store the edge information of each edge.
  • the side information may also include the node type of the target node.
  • the storage length of each piece of side information is the same. For example, for each edge, 4 bytes are used to store the node types of the two target nodes, and 8 bytes are used to store the node identifiers of the two target nodes.
  • the storage order of the edge index information is consistent with the storage order of the nodes in the vertex table (also referred to as the alignment of the edge table and the vertex table). For example, start from the edge table index area, continuously store the edge index information of the first node in the point table, then store the edge index information of the second node, and so on.
  • the edge information can store the edge information of each edge sequentially according to the storage order of the edge index information in the edge table index area.
  • the index information of the corresponding edge can be found according to the position of the node in the point table.
  • the storage order of the edge information in the edge table is consistent with the storage order of the nodes in the vertex table, and the edge information of the same node is stored together consecutively.
  • node A is connected to three nodes K, M, and L
  • node B is connected to two nodes Q and G.
  • the storage order of node A in the point table is the first
  • the storage order of node B in the point table is The second one.
  • the edge information of the three edges A-K, A-M, and A-L, and the edge information of the two edges B-Q and B-G are stored sequentially from the starting position of the edge table data area.
  • the edge index information stored in the edge table index area can only include the initial storage position of the edge information of the edge corresponding to the node in the edge table (such as the edge index information corresponding to node A includes edge A-K
  • the storage address information of the node B, the edge index information corresponding to the node B includes the storage address information of the edge B-Q). That is, in the edge table, the storage area between the index information of the edge corresponding to the previous node and the index information of the edge of the next node is regarded as the edge information of the edge corresponding to the previous node.
  • the edge table index area also includes the edge type of each edge.
  • the edge index information of edge A not only stores the address information, but also includes the edge type.
  • the edge type can reflect the interactive relationship between two entities, such as the litigation relationship between two enterprises or the economic transaction relationship between two enterprises.
  • the edge index information corresponding to the node in the edge table index area may include multiple edge types and multiple storage address information, wherein the multiple edge types are continuously stored, and the multiple storage address information are also continuously stored.
  • node B has multiple edges, and these edges belong to two types of edges, two edge types and two storage address information can be continuously stored in the edge index information of node B, where the first The storage address information is the storage address information of the edge information belonging to the first edge type among the multiple edges of node B in the edge data area (for example, the edge information belonging to the first edge type among the multiple edges of node B is in the edge data area), the second storage address information is the storage address information of the edge information belonging to the second edge type among the multiple edges of node B in the edge data area (for example, among the multiple edges of node B, the edge information belongs to the first The edge information of the two edge types is in the initial storage location of the edge data area).
  • the edge type can be the same as the node type, and the edge type is encoded inside the graph file using a vocabulary, and the edge table part only stores the internal encoding of the edge type.
  • the vocabulary For more descriptions about the vocabulary, refer to the corresponding description in FIG. 6 , which will not be repeated here.
  • edges have directions and nodes may have outgoing and/or incoming edges.
  • the edge table can include two types: an in-edge table and an out-edge table.
  • the in-edge table only stores the relevant data of the in-edge table
  • the out-edge table stores the relevant data of the out-edge table.
  • the storage method of the relevant data of the outgoing/incoming edge in the outgoing/incoming edge table is similar to the foregoing content, and will not be repeated here.
  • Step 730 storing the attribute information of the several nodes in the point attribute table of the data block.
  • step 730 may be performed by the node attribute information storage module 530 .
  • the node attribute information storage module 530 fills the data into the point attribute table in sequence based on the format of the set point attribute table.
  • FIG. 4 is a schematic diagram of an exemplary attribute table 240 .
  • the point attribute table and the edge attribute table may have the same format. Therefore, the attribute table 240 can also be regarded as a point attribute table.
  • the point attribute table includes the point attribute table index area and the point attribute table data area, and the attribute information of the point is stored in the point attribute table data area; the point attribute table index area stores the point attribute index information of the point, and the point attribute index information includes the point The storage address information of the attribute information in the point attribute table data area.
  • each attribute index information can point to an attribute data.
  • the point attribute table may also be aligned with the point table.
  • the storage order of the point attribute index information in the point attribute table is consistent with the storage order of the node information in the point table. With such a setting, it is possible to locate the point attribute index information according to the storage order of the nodes in the point table, and further obtain the attribute information of the node from the point attribute table data area based on the point attribute index information.
  • the attribute table 240 may also include a header flag "1" and the length of the index area.
  • Step 740 storing the edge attribute information of the several nodes in the edge attribute table of the data block.
  • step 740 may be performed by the edge attribute information storage module 540 .
  • the edge attribute information storage module 540 fills data into the edge attribute table in sequence based on the format of the set edge attribute table.
  • the attribute table 240 can also be regarded as an edge attribute table.
  • the attribute information of the edges of several nodes is stored in the edge attribute table data area; the edge attribute table index area stores the attribute index information of each edge, and the edge attribute index information includes the attribute information of the edge in the edge attribute table data area storage address information.
  • the storage order of the edge attribute index information in the edge attribute table index area is consistent with the storage order of the edge information of each edge in the edge table data area.
  • edges have directions and nodes may have outgoing and/or incoming edges.
  • the edge attribute table may include two types: an incoming edge attribute table and an outgoing edge attribute table, wherein only the attribute information of the incoming edge is stored in the incoming edge attribute table, and the attribute information of the outgoing edge is stored in the outgoing edge attribute table.
  • the storage method of the attribute information of the outgoing/incoming edge in the outgoing/incoming edge attribute table is similar to the foregoing content, and will not be repeated here.
  • the process 700 further includes step 750: generating the table element of the data block.
  • step 750 may be performed by the tab generation module 550 .
  • the table element includes the storage address information of each table in the data block and the node identifier of the first node in each point table in the data block. For more descriptions about the table elements, refer to the corresponding description in FIG. 6 , which will not be repeated here.
  • multiple data blocks may be generated according to steps 710-740, and multiple data blocks constitute a map file.
  • the map file can also include information such as vocabulary and data block index.
  • the process 700 further includes step 760: generating a vocabulary of the graph file.
  • step 760 may be performed by the vocabulary generation module 560 .
  • the data block includes encoding information
  • the vocabulary of the graph file can also be generated.
  • the vocabulary includes the mapping relationship between the coding information in each data block in the map file and the original information. For more expressions about the vocabulary, refer to the corresponding description in FIG. 6 , which will not be repeated here.
  • the process 700 further includes step 770: generating a data block index of the atlas file.
  • step 770 may be performed by the data block index generation module 570 .
  • the data block index of the map file includes the storage address information of each data block in the map file and the node identifier of the first node in each data block, which is used to determine which data block the target query node is in. For more descriptions about the data block index, refer to the corresponding description in FIG. 6 , which will not be repeated here.
  • map file is generated based on the map data, and in some embodiments, multiple map files can be generated to form a storage file.
  • Stored files may also include atlas file elements.
  • the process 700 further includes step 780: generating a graph file element.
  • the map file element includes the map file where each data block is located in each map file and the serial number of the data block in the map file, the node identifier of the first node in each map file and the node identifier of the last node in each map file, among which It is used to determine which graph file the target query node is in.
  • map file elements refer to the corresponding description in Figure 6, and will not repeat them here.
  • the possible beneficial effects of the embodiments of this specification include but are not limited to: 1) Store several nodes of the graph data, the edges of these nodes, and attribute information in a data block. Find the edge and attribute information related to the node in the block, without multiple read and write operations; 2)
  • the graph data is stored in multiple data blocks in an orderly manner. For large-scale graph data, it can be distributed and stored on multiple devices. , when performing graph query, multiple devices can query in parallel (for example, different devices query different data blocks), so as to save the time of retrieval query and improve the response speed of graph query; 3) realize the point table-edge table-attribute
  • the alignment of tables saves the storage space of edge tables and attribute tables.
  • different embodiments may have different beneficial effects.
  • the possible beneficial effects may be any one or a combination of the above, or any other possible beneficial effects.
  • aspects of this specification can be illustrated and described by several patentable types or situations, including any new and useful process, machine, product or combination of substances, or their Any new and useful improvements.
  • various aspects of this specification may be entirely executed by hardware, may be entirely executed by software (including firmware, resident software, microcode, etc.), or may be executed by a combination of hardware and software.
  • the above hardware or software may be referred to as “block”, “module”, “engine”, “unit”, “component” or “system”.
  • aspects of this specification may be embodied as a computer product comprising computer readable program code on one or more computer readable media.
  • a computer storage medium may contain a propagated data signal embodying a computer program code, for example, in baseband or as part of a carrier wave.
  • the propagated signal may have various manifestations, including electromagnetic form, optical form, etc., or a suitable combination.
  • a computer storage medium may be any computer-readable medium, other than a computer-readable storage medium, that can be used to communicate, propagate, or transfer a program for use by being coupled to an instruction execution system, apparatus, or device.
  • Program code residing on a computer storage medium may be transmitted over any suitable medium, including radio, electrical cable, fiber optic cable, RF, or the like, or combinations of any of the foregoing.
  • the computer program codes required for the operation of each part of this manual can be written in any one or more programming languages, including object-oriented programming languages such as Java, Scala, Smalltalk, Eiffel, JADE, Emerald, C++, C#, VB.NET, Python etc., conventional procedural programming languages such as C language, VisualBasic, Fortran2003, Perl, COBOL2002, PHP, ABAP, dynamic programming languages such as Python, Ruby and Groovy, or other programming languages.
  • the program code may run entirely on the user's computer, or as a stand-alone software package, or run partly on the user's computer and partly on a remote computer, or entirely on the remote computer or processing device.
  • the remote computer can be connected to the user computer through any form of network, such as a local area network (LAN) or wide area network (WAN), or to an external computer (such as through the Internet), or in a cloud computing environment, or as a service Use software as a service (SaaS).
  • LAN local area network
  • WAN wide area network
  • SaaS service Use software as a service
  • numbers describing the quantity of components and attributes are used, and it should be understood that such numbers used in the description of the embodiments, in some examples, use the modifiers "about”, “approximately” or “substantially” to express grooming. Unless otherwise stated, “about”, “approximately” or “substantially” indicates that the stated figure allows for a variation of ⁇ 20%. Accordingly, in some embodiments, the numerical parameters used in the specification and claims are approximations that can vary depending upon the desired characteristics of individual embodiments. In some embodiments, numerical parameters should take into account the specified significant digits and adopt the general digit reservation method. Although the numerical ranges and parameters used in some embodiments of this specification to confirm the breadth of the range are approximations, in specific embodiments, such numerical values are set as precisely as practicable.

Abstract

The present description relates to a graph data storage method, system and apparatus. Graph data comprises a node and an edge. The storage method comprises: storing, in a node table of a data block, node information of several nodes in graph data, wherein the node information comprises node identifiers; storing, in an edge table of the data block, edge information of edges of the several nodes, wherein the edge information comprises node identifiers of target nodes connected to the edges; storing, in a node attribute table of the data block, attribute information of the several nodes; and storing, in an edge attribute table of the data block, attribute information of the edges of the several nodes.

Description

图数据的存储storage of graph data 技术领域technical field
本说明书一个或多个实施例涉及计算机领域,特别涉及一种图数据的存储方法、系统及装置。One or more embodiments of this specification relate to the field of computers, and in particular to a method, system, and device for storing graph data.
背景技术Background technique
目前对于图数据的存储和管理,可以使用各种数据库实现。随着社交网络、移动互联网和IOT(物联网)等新的互联网应用不断涌现,各个实体(如用户、系统和传感器等)产生的交互数据呈指数级增长,图数据的规模以及复杂度显著增加。在进行海量和复杂图数据的存储和管理时,需要数据库具备较高的读写效率,以支持高效地进行数据遍历、关联关系查询、一跳子图(即one-hop图,指一个节点与该节点连接的边构成的子图)展开等图处理操作。Currently, various databases can be used to store and manage graph data. With the emergence of new Internet applications such as social networks, mobile Internet and IOT (Internet of Things), the interaction data generated by various entities (such as users, systems and sensors, etc.) is increasing exponentially, and the scale and complexity of graph data have increased significantly. . When storing and managing massive and complex graph data, the database needs to have high read and write efficiency to support efficient data traversal, relationship query, and one-hop subgraphs (that is, one-hop graphs, which refer to a node and The subgraph formed by the edges connected by the node) and other graph processing operations such as expansion.
所以,亟需一种图数据的存储方法、系统及装置,以实现图数据的高效存储以及图数据的复杂关系查询等功能。Therefore, there is an urgent need for a graph data storage method, system, and device to realize functions such as efficient storage of graph data and complex relational query of graph data.
发明内容Contents of the invention
本说明书一个方面提供一种图数据的存储方法,所述图数据包括节点和边;所述存储方法包括:将图数据中的若干个节点的节点信息存储在数据块的点表中;所述节点信息包括节点标识;将所述若干个节点的边的边信息存储在所述数据块的边表中;所述边信息包括与边连接的目标节点的节点标识;将所述若干个节点的属性信息存储在所述数据块的点属性表中;将所述若干个节点的边的属性信息存储在所述数据块的边属性表中。One aspect of this specification provides a method for storing graph data, where the graph data includes nodes and edges; the storage method includes: storing node information of several nodes in the graph data in a point table of a data block; the The node information includes a node identifier; the edge information of the edges of the several nodes is stored in the edge table of the data block; the edge information includes the node identifier of the target node connected to the edge; the edge information of the several nodes is The attribute information is stored in the point attribute table of the data block; the attribute information of the edges of the several nodes is stored in the edge attribute table of the data block.
本说明书另一个方面提供一种图数据的存储系统,所述图数据包括节点和边;所述存储系统包括:节点信息存储模块,用于将图数据中的若干个节点的节点信息存储在数据块的点表中;所述节点信息包括节点标识;边信息存储模块,用于将所述若干个节点的边的边信息存储在所述数据块的边表中;所述边信息包括与边连接的目标节点的节点标识;节点属性信息存储模块,用于将所述若干个节点的属性信息存储在所述数据块的点属性表中;边属性信息存储模块,用于将所述若干个节点的边的属性信息存储在所述数据块的边属性表中。Another aspect of this specification provides a graph data storage system, the graph data includes nodes and edges; the storage system includes: a node information storage module, used to store the node information of several nodes in the graph data in the data in the point table of the block; the node information includes a node identifier; the edge information storage module is used to store the edge information of the edges of the several nodes in the edge table of the data block; the edge information includes an edge The node identification of the connected target node; the node attribute information storage module, used to store the attribute information of the several nodes in the point attribute table of the data block; the edge attribute information storage module, used to store the several nodes The attribute information of the edge of the node is stored in the edge attribute table of the data block.
本说明书另一个方面提供一种图数据存储装置,所述装置包括处理器以及存储器;所述存储器用于存储指令,所述处理器用于执行所述指令,以实现所述一种图数据存储装置,包括存储介质和处理器,所述存储介质用于存储计算机指令,所述处理器用于执 行计算机指令以实现图数据存储训练方法。Another aspect of the specification provides a graph data storage device, the device includes a processor and a memory; the memory is used to store instructions, and the processor is used to execute the instructions to implement the graph data storage device , including a storage medium and a processor, the storage medium is used to store computer instructions, and the processor is used to execute the computer instructions to realize the image data storage training method.
本说明书另一个方面提供一种图数据文件,所述图数据包括节点和边;所述文件包括若干数据块,其中每个数据块包括:点表,用于存储图数据中至少部分节点的节点信息;所述节点信息包括节点标识;边表,用于存储所述节点的边的边信息;所述边信息包括与边连接的目标节点的节点标识;点属性表,用于存储所述节点的属性信息;边属性表,用于存储所述节点的边的属性信息。Another aspect of this specification provides a graph data file, the graph data includes nodes and edges; the file includes several data blocks, wherein each data block includes: a point table, used to store nodes of at least some nodes in the graph data information; the node information includes a node identifier; an edge table is used to store the edge information of the edge of the node; the edge information includes a node identifier of a target node connected to the edge; a point attribute table is used to store the node The attribute information of the node; the edge attribute table is used to store the attribute information of the edge of the node.
附图说明Description of drawings
本说明书将以示例性实施例的方式进一步描述,这些示例性实施例将通过附图进行详细描述。这些实施例并非限制性的,在这些实施例中,相同的编号表示相同的结构,其中:This specification will be further described in terms of exemplary embodiments, which will be described in detail with the accompanying drawings. These examples are non-limiting, and in these examples, the same number indicates the same structure, wherein:
图1是根据本说明书的一些实施例所示的示例性图数据存储系统的应用场景示意图;Fig. 1 is a schematic diagram of an application scenario of an exemplary graph data storage system according to some embodiments of the present specification;
图2是根据本说明书的一些实施例所示的点表示意图;Figure 2 is a schematic diagram of a point table according to some embodiments of the present specification;
图3是根据本说明书的一些实施例所示的边表示意图;Fig. 3 is a schematic diagram of an edge table according to some embodiments of the present specification;
图4是根据本说明书的一些实施例所示的点/边属性表示意图;Fig. 4 is a schematic diagram of a point/edge attribute table according to some embodiments of the present specification;
图5是根据本说明书一些实施例所示的进行图数据存储的系统框图;Fig. 5 is a system block diagram of graph data storage according to some embodiments of the present specification;
图6是根据本说明书的一些实施例所示的数据块结构示意图;Fig. 6 is a schematic diagram of a data block structure according to some embodiments of this specification;
图7是根据本说明书的一些实施例所示的进行图数据存储的示例性流程图;Fig. 7 is an exemplary flow chart of graph data storage according to some embodiments of the present specification;
图8是根据本说明书的一些实施例所示的进行图数据查询的示例性流程图。Fig. 8 is an exemplary flow chart of querying graph data according to some embodiments of the present specification.
具体实施方式Detailed ways
为了更清楚地说明本说明书实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单的介绍。显而易见地,下面描述中的附图仅仅是本说明书的一些示例或实施例,对于本领域的普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图将本说明书应用于其它类似情景。除非从语言环境中显而易见或另做说明,图中相同标号代表相同结构或操作。In order to more clearly illustrate the technical solutions of the embodiments of the present specification, the following briefly introduces the drawings that need to be used in the description of the embodiments. Apparently, the accompanying drawings in the following description are only some examples or embodiments of this specification, and those skilled in the art can also apply this specification to other similar scenarios. Unless otherwise apparent from context or otherwise indicated, like reference numerals in the figures represent like structures or operations.
应当理解,本说明书中所使用的“系统”、“装置”、“单元”和/或“模组”是用于区分不同级别的不同组件、元件、部件、部分或装配的一种方法。然而,如果其他词语可实现相同的目的,则可通过其他表达来替换所述词语。It should be understood that "system", "device", "unit" and/or "module" used in this specification is a method for distinguishing different components, elements, parts, parts or assemblies of different levels. However, the words may be replaced by other expressions if other words can achieve the same purpose.
如本说明书和权利要求书中所示,除非上下文明确提示例外情形,“一”、“一个”、“一种”和/或“该”等词并非特指单数,也可包括复数。一般说来,术语“包括”与“包含”仅提示包括已明确标识的步骤和元素,而这些步骤和元素不构成一个排它性的 罗列,方法或者设备也可能包含其它的步骤或元素。As indicated in the specification and claims, the terms "a", "an", "an" and/or "the" are not specific to the singular and may include the plural unless the context clearly indicates an exception. Generally speaking, the terms "comprising" and "comprising" only suggest the inclusion of explicitly identified steps and elements, and these steps and elements do not constitute an exclusive list, and the method or device may also contain other steps or elements.
本说明书中使用了流程图用来说明根据本说明书的实施例的系统所执行的操作。应当理解的是,前面或后面操作不一定按照顺序来精确地执行。相反,可以按照倒序或同时处理各个步骤。同时,也可以将其他操作添加到这些过程中,或从这些过程移除某一步或数步操作。The flowchart is used in this specification to illustrate the operations performed by the system according to the embodiment of this specification. It should be understood that the preceding or following operations are not necessarily performed in the exact order. Instead, various steps may be processed in reverse order or simultaneously. At the same time, other operations can be added to these procedures, or a certain step or steps can be removed from these procedures.
图1是根据本说明书的一些实施例所示的示例性图数据库存储系统的应用场景示意图。Fig. 1 is a schematic diagram of an application scenario of an exemplary graph database storage system according to some embodiments of the present specification.
随着社交网络、移动互联网和物联网(The Internet of Things,简称IOT)等新的互联网应用不断涌现,不同实体之间(如用户、系统和传感器)产生的数据呈指数级增长,数据内部依赖和复杂度增加。通常会采用图数据的形式以刻画和表征不同实体之间的相互关系。图数据有多个节点以及连接各个节点的边构成,其中,图数据中的节点表示实体,节点之间的边表征实体之间的相互关系。实体可以是物理世界中真实存在的物体、机构等也可以是抽象的概念,例如,公司、设备、人、货品、库位、运输工具、图像、计算机程序、账户等。实体可以具有属性信息,以实体为“人”为例,属性信息包括年龄、性别、职业、工作单位或家庭住址等,对于公司而言,属性信息包括公司注册地址、法人、营业范围、注册资本等信息。实体之间的边(即边信息)可以反映实体之间的关系。如实体人与实体公司之间可以具有雇佣关系,张三与李四之间可以是朋友关系等。边也可以具有属性信息,如雇佣关系的属性信息可以包括建立时间、雇佣关系类型(是正式雇佣还是临时雇佣)等。With the emergence of new Internet applications such as social networks, mobile Internet, and the Internet of Things (IOT), the data generated between different entities (such as users, systems, and sensors) is increasing exponentially, and the internal dependence of data and complexity increases. Usually, the form of graph data is used to describe and characterize the relationship between different entities. Graph data is composed of multiple nodes and edges connecting each node. The nodes in the graph data represent entities, and the edges between nodes represent the relationship between entities. Entities can be real objects, institutions, etc. in the physical world, or abstract concepts, such as companies, equipment, people, goods, storage locations, means of transportation, images, computer programs, accounts, etc. Entities can have attribute information. Taking the entity as "person" as an example, attribute information includes age, gender, occupation, work unit or home address, etc. For companies, attribute information includes company registered address, legal person, business scope, registered capital and other information. Edges between entities (ie, edge information) can reflect the relationship between entities. For example, there may be an employment relationship between an entity person and an entity company, and there may be a friend relationship between Zhang San and Li Si. Edges can also have attribute information. For example, the attribute information of an employment relationship can include establishment time, employment relationship type (whether it is formal employment or temporary employment), and so on.
随着互联网技术的发展,图数据的规模越来越大,如何对图数据进行存储以实现对存储好的数据进行高效地调用成为了有待解决的问题。With the development of Internet technology, the scale of graph data is getting larger and larger. How to store graph data so as to efficiently call the stored data has become a problem to be solved.
在一些实施例中,可以将图数据存储进入关系型数据库中,这类存储方式会将图数据中的节点和边分离存储。然而,关系型数据库在存储图数据时表现出了较多的不适应性。例如,因为图数据庞大,图数据需要分库分表存储,进而会将节点以及这些节点的边拆分存储,再进行图数据查询时,需要不同数据库(如存储设备)之间交互,找到目标查询节点及其边,又或者需要多次读写才能获取目标查询节点及其边。In some embodiments, the graph data can be stored in a relational database, and this storage method will store the nodes and edges in the graph data separately. However, relational databases show more inadaptability when storing graph data. For example, because the graph data is huge, the graph data needs to be stored in separate databases and tables, and then the nodes and the edges of these nodes will be split and stored. When querying the graph data, it is necessary to interact with different databases (such as storage devices) to find the target Query nodes and their edges, or multiple reads and writes are required to obtain the target query nodes and their edges.
为了弥补关系数据库的上述缺点,在一些实施例中提出了基于图数据库的图数据存储方式。在图数据库中,数据之间的关系占重要地位,可以存储海量的、关系复杂的数据以及复杂数据之间的相互关系。具体地,图数据库是将图数据中的节点和边分到不同的KV存储引擎的图数据库进行存储,并在图数据库之上搭建proxy层(即代理层)以提供图查询服务。然而,这种做法一方面由于增设了代理层,数据在查询过程中需要多 次地在不同的数据区域进行缓存,提高了整个查询过程的复杂性。另一方面,对图数据库进行图查询时,由于节点与边是分开存储的,在检索一个一跳子图(即one-hop图,指一个节点、该节点连接的边与边另一端的节点构成的子图)时,需要分别查询该节点以及与该节点相连的所有边。换言之,查询一个一跳子图需要很多次地读写操作才能得到一个一跳子图的查询结果,这样的检索效率很低。同时,为了保证以上查询过程中的效率,图数据库需要独立的集群服务器(计算机)进行部署和运维,以保证具有足够的内存以进行图查询过程中的多次读写操作的需求,这也带来了较大的设备运维成本。In order to make up for the above-mentioned shortcomings of relational databases, a graph data storage method based on graph databases is proposed in some embodiments. In a graph database, the relationship between data plays an important role, and it can store massive and complex data and the relationship between complex data. Specifically, the graph database is a graph database that divides the nodes and edges in the graph data into different KV storage engines for storage, and builds a proxy layer (that is, a proxy layer) on top of the graph database to provide graph query services. However, on the one hand, due to the addition of a proxy layer in this approach, the data needs to be cached in different data areas multiple times during the query process, which increases the complexity of the entire query process. On the other hand, when performing graph query on a graph database, since nodes and edges are stored separately, when retrieving a one-hop subgraph (that is, a one-hop graph, it refers to a node, the edge connected to the node, and the node at the other end of the edge) When constructing a subgraph), it is necessary to query the node and all the edges connected to the node separately. In other words, querying a one-hop subgraph requires many read and write operations to obtain the query result of a one-hop subgraph, and such retrieval efficiency is very low. At the same time, in order to ensure the efficiency of the above query process, the graph database needs an independent cluster server (computer) for deployment and operation and maintenance, so as to ensure that there is enough memory for multiple read and write operations in the graph query process. This brings about a large equipment operation and maintenance cost.
针对以上技术的不足,本说明书一些实施例提供了一种图数据的存储方法,包括:将图数据中的若干个节点的节点信息、边信息、节点属性信息以及边属性信息对应存储在同一数据块的点表、边表、点属性表以及边属性表中。通过这种方式,可以通过一次读取数据块,便可以获得相关节点的节点信息以及边信息,有效降低了图处理过程中的读写频次。示例性的,当需要进行一跳子图查询时,读写一次数据块便可完成,查询效率显著提高。To address the shortcomings of the above technologies, some embodiments of this specification provide a storage method for graph data, including: correspondingly storing node information, edge information, node attribute information, and edge attribute information of several nodes in the graph data in the same data In the point table, edge table, point attribute table and edge attribute table of the block. In this way, the node information and edge information of the relevant nodes can be obtained by reading the data block once, which effectively reduces the frequency of reading and writing in the process of graph processing. Exemplarily, when a one-hop subgraph query is required, the data block can be read and written only once, and the query efficiency is significantly improved.
在本说明书的一些实施例中,还可以使得边在边表中的存储顺序与所述若干个节点在点表中的存储顺序一致,使得若干个节点的属性信息在点属性表的存储顺序与所述若干个节点在点表中的存储顺序一致,使得若干个节点的边的属性信息在边属性表的存储顺序与所述若干个节点的边在边表中的存储顺序一致,通过这样的方式,实现了点表-边表-属性表的对齐。在查询到节点A后,可以快速地确定节点A对应的所有边在边表中的位置,进而可以快速定位到节点A在边属性表中的属性信息。这样的设置使得图查询过程中无需过多的数据读写以及缓存需求,因此整个过程无需常驻的服务集群来支持。In some embodiments of this specification, the storage order of the edges in the edge table can also be consistent with the storage order of the several nodes in the point table, so that the storage order of the attribute information of several nodes in the point attribute table is the same as The storage order of the several nodes in the point table is consistent, so that the storage order of the edge attribute information of the several nodes in the edge attribute table is consistent with the storage order of the edges of the several nodes in the edge table, through such In this way, the alignment of point table-edge table-attribute table is realized. After node A is queried, the positions of all edges corresponding to node A in the edge table can be quickly determined, and then the attribute information of node A in the edge attribute table can be quickly located. Such a setting eliminates the need for excessive data reading and writing and caching requirements during the graph query process, so the entire process does not require a resident service cluster to support it.
需要说明的是,在说明书的实施例中,由于图数据是按顺序存储在多个数据块中,且节点信息及其边信息存储在同一个数据块中,对于规模较大的图数据可以用多个数据块或者用多个图谱文件(图谱文件中包含多个数据块)进行存储,这使得本说明书涉及的一个及多个实施例可以由多台设备对图数据进行分布式存储并支持并行查询(如不同的设备查询不同的数据块),以进一步提高查询效率。It should be noted that, in the embodiments of the specification, since the graph data is stored in multiple data blocks sequentially, and the node information and its edge information are stored in the same data block, for large-scale graph data, you can use Multiple data blocks or multiple map files (a map file contains multiple data blocks) are used for storage, which enables one or more embodiments involved in this specification to perform distributed storage of map data by multiple devices and support parallel Query (for example, different devices query different data blocks) to further improve query efficiency.
在一些实施例中,图数据存储系统的应用场景如图1所示,场景100可以包括存储设备110-1、存储设备110-2、…、存储设备110-n和处理设备120。In some embodiments, the application scenario of the graph data storage system is shown in FIG. 1 , and the scenario 100 may include a storage device 110 - 1 , a storage device 110 - 2 , .
存储设备110-1、存储设备110-2、存储设备110-3、…可包括处理器以及大容量存储器、可移动存储器、易失性读写存储器、只读存储器(ROM)等或其任意组合,用于数据存储、管理资源以及处理来自本系统至少一个组件或外部数据源(例如,云数据 中心)的数据和/或信息。在一些实施例中,存储设备110-1、存储设备110-2、存储设备110-3、…中的每一个可以是单一服务器或服务器组。该服务器组可以是集中式或分布式的(例如,服务器110-1可以是分布式系统),可以是专用的也可以由其他设备或系统同时提供服务。在一些实施例中,存储设备110-1、存储设备110-2、存储设备110-3、…可以是区域的或者远程的。在一些实施例中,存储设备110-1、存储设备110-2、存储设备110-3、…可以在云平台上实施,或者以虚拟方式提供。仅作为示例,所述云平台可以包括私有云、公共云、混合云、社区云、分布云、内部云、多层云等或其任意组合。The storage device 110-1, the storage device 110-2, the storage device 110-3, ... may include a processor and a large capacity memory, a removable memory, a volatile read-write memory, a read-only memory (ROM), etc. or any combination thereof , for data storage, management of resources, and processing of data and/or information from at least one component of the System or external data sources (eg, cloud data centers). In some embodiments, each of storage device 110-1, storage device 110-2, storage device 110-3, ... may be a single server or a group of servers. The server group may be centralized or distributed (for example, the server 110-1 may be a distributed system), may be dedicated, or may be simultaneously provided by other devices or systems. In some embodiments, storage device 110-1, storage device 110-2, storage device 110-3, ... may be local or remote. In some embodiments, the storage device 110-1, the storage device 110-2, the storage device 110-3, ... may be implemented on a cloud platform, or provided in a virtual manner. By way of example only, the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, an internal cloud, a multi-layer cloud, etc., or any combination thereof.
在一些实施例中,存储设备110-1、存储设备110-2、…、存储设备110-n中的任一个或以上个可以存储一个或多个图谱文件,同时支持图数据的并行查询。图谱文件中可以包括多个数据块,每个数据块用于存储图数据中全部或部分节点的节点信息、边信息以及节点和边对应的属性信息。具体地,如图1中200所示即为一个典型的数据块结构,每个数据块中包括点表210、边表220,点属性表230,边属性表240以及表元250。In some embodiments, any one or more of storage device 110-1, storage device 110-2, ..., storage device 110-n can store one or more graph files, and support parallel query of graph data. The graph file may include multiple data blocks, and each data block is used to store node information, edge information, and attribute information corresponding to nodes and edges of all or part of the nodes in the graph data. Specifically, as shown at 200 in FIG. 1 is a typical data block structure, each data block includes a point table 210 , an edge table 220 , a point attribute table 230 , an edge attribute table 240 and a table element 250 .
处理设备120可以生成或获取图数据,将图数据写入到多个数据块或多个图谱文件中,并将多个数据块或图谱文件分发给存储设备110-1、存储设备110-2、…、存储设备110-n进行存储。在一些实施例中,处理设备120可以获取查询请求,并将查询请求分发给各存储设备,以便各存储设备在本地存储的图谱数据或数据块中进行查询,并将查询结果返回给处理设备120。在一些实施例中,在图数据规模不大的情形下,可以使用一个存储设备对其图谱文件进行存储,此时,处理设备120可以省去。The processing device 120 can generate or acquire graph data, write the graph data into multiple data blocks or multiple graph files, and distribute the multiple data blocks or graph files to the storage device 110-1, storage device 110-2, ..., the storage device 110-n stores. In some embodiments, the processing device 120 can obtain the query request, and distribute the query request to each storage device, so that each storage device can perform a query in the locally stored map data or data blocks, and return the query result to the processing device 120 . In some embodiments, when the scale of graph data is not large, a storage device may be used to store the map files, and in this case, the processing device 120 may be omitted.
在一些实施例中,场景100还可以包括网络(图中未示出)。网络可以连接系统的各组成部分和/或连接系统与外部部分。网络使得系统各组成部分之间以及与系统与外部部分之间可以进行通讯,促进数据和/或信息的交换。在一些实施例中,网络130可以是有线网络或无线网络中的任意一种或多种。例如,网络可以包括电缆网络、光纤网络、电信网络、互联网、局域网络(LAN)、广域网络(WAN)、无线局域网络(WLAN)、城域网(MAN)、公共交换电话网络(PSTN)、蓝牙网络、紫蜂网络(ZigBee)、近场通信(NFC)、设备内总线、设备内线路、线缆连接等或其任意组合。在一些实施例中,系统各部分之间的网络连接可以采用上述一种方式,也可以采取多种方式。在一些实施例中,网络可以是点对点的、共享的、中心式的等各种拓扑结构或者多种拓扑结构的组合。In some embodiments, the scene 100 may also include a network (not shown in the figure). A network can connect components of a system and/or connect the system with external parts. A network enables communication between the various components of the system and between the system and external parts, facilitating the exchange of data and/or information. In some embodiments, the network 130 may be any one or more of a wired network or a wireless network. For example, a network may include a cable network, a fiber optic network, a telecommunications network, the Internet, a local area network (LAN), a wide area network (WAN), a wireless local area network (WLAN), a metropolitan area network (MAN), a public switched telephone network (PSTN), Bluetooth network, ZigBee network (ZigBee), near field communication (NFC), internal bus, internal line, cable connection, etc. or any combination thereof. In some embodiments, the network connection between various parts of the system may adopt one of the above-mentioned methods, or may adopt multiple methods. In some embodiments, the network may be in various topologies such as point-to-point, shared, and central, or a combination of multiple topologies.
图5是根据本说明书一些实施例所示的进行图数据库存储的系统框图。Fig. 5 is a system block diagram for storing a graph database according to some embodiments of the present specification.
如图5所示,系统500布置在任意可执行程序的处理设备上(如图1中的服务器110-1、 存储设备110-2、…、存储设备110-n中的任意一个),具体包括:节点信息存储模块510,用于将图数据中的若干个节点的节点信息存储在数据块的点表中;所述节点信息包括节点标识;边信息存储模块520,用于将所述若干个节点的边的边信息存储在所述数据块的边表中;所述边信息包括与边连接的目标节点的节点标识;节点属性信息存储模块530,用于将所述若干个节点的属性信息存储在所述数据块的点属性表中;边属性信息存储模块540,用于将所述若干个节点的边的属性信息存储在所述数据块的边属性表中。As shown in FIG. 5 , the system 500 is arranged on any processing device that can execute programs (such as any one of server 110-1, storage device 110-2, ..., storage device 110-n in FIG. 1 ), specifically including : a node information storage module 510, used to store the node information of several nodes in the graph data in the point table of the data block; the node information includes a node identifier; a side information storage module 520, used to store the several nodes The edge information of the edge of the node is stored in the edge table of the data block; the edge information includes the node identification of the target node connected to the edge; the node attribute information storage module 530 is used to store the attribute information of the several nodes Stored in the point attribute table of the data block; the edge attribute information storage module 540 is configured to store the attribute information of the edges of the several nodes in the edge attribute table of the data block.
在一些实施例中,所述若干个节点的边在边表中的存储顺序与所述若干个节点在点表中的存储顺序一致;所述若干个节点的属性信息在点属性表的存储顺序与所述若干个节点在点表中的存储顺序一致;所述若干个节点的边的属性信息在边属性表的存储顺序与所述若干个节点的边在边表中的存储顺序一致。In some embodiments, the storage order of the edges of the several nodes in the edge table is consistent with the storage order of the several nodes in the point table; the storage order of the attribute information of the several nodes in the point attribute table It is consistent with the storage order of the several nodes in the point table; the storage order of the edge attribute information of the several nodes in the edge attribute table is consistent with the storage order of the edges of the several nodes in the edge table.
在一些实施例中,所述边表包括边表索引区以及边表数据区;所述若干个节点的边的边信息存储在所述边表数据区中;边表索引区存储有所述若干个节点的边的索引信息,所述边的索引信息包括对应节点的边的边信息在所述边表数据区中的存储地址信息;所述若干个节点的边的索引信息的存储顺序与所述若干个节点在点表中的存储顺序一致。In some embodiments, the edge table includes an edge table index area and an edge table data area; the edge information of the edges of the several nodes is stored in the edge table data area; the edge table index area stores the several The edge index information of a node, the edge index information includes the storage address information of the edge information of the corresponding node in the edge table data area; the storage order of the edge index information of the several nodes is the same as the storage order of the edge information The storage order of the above-mentioned several nodes in the point table is consistent.
在一些实施例中,所述节点信息还包括节点的边的存储地址信息,所述点表中边的存储地址信息为对应边的索引信息在边表中的存储地址信息。In some embodiments, the node information further includes storage address information of edges of nodes, and the storage address information of edges in the point table is storage address information of index information corresponding to edges in the edge table.
在一些实施例中,同一节点的不同边的边信息在所述边表数据区中连续存储;所述若干个节点的边的边信息的存储顺序与所述若干个节点在点表中的存储顺序一致。In some embodiments, the edge information of different edges of the same node is continuously stored in the edge table data area; the storage order of the edge information of the edges of the several nodes is the same as the storage order of the several nodes in the point table in the same order.
在一些实施例中,边的索引信息还包括边类型;边信息还包括目标节点的节点类型;同一个节点的边的边信息按照边的边类型在边表数据区中顺序存储。In some embodiments, the edge index information also includes the edge type; the edge information also includes the node type of the target node; the edge information of the same node is stored sequentially in the edge table data area according to the edge type.
在一些实施例中,所述边属性表包括边属性表索引区以及边属性表数据区;所述若干个节点的边的属性信息存储在所述边属性表数据区中;边属性表索引区存储有所述若干个节点的边的边属性索引信息,边属性索引信息包括该边的属性信息在所述边属性表数据区中的存储地址信息;所述若干个节点的边的边属性索引信息的存储顺序与所述若干个边的边信息在边表数据区中的存储顺序一致。In some embodiments, the edge attribute table includes an edge attribute table index area and an edge attribute table data area; the attribute information of the edges of the several nodes is stored in the edge attribute table data area; the edge attribute table index area The edge attribute index information of the edges of the several nodes is stored, and the edge attribute index information includes the storage address information of the edge attribute information in the edge attribute table data area; the edge attribute index information of the edges of the several nodes The storage order of the information is consistent with the storage order of the edge information of the several edges in the edge table data area.
在一些实施例中,节点信息还包括节点类型,所述若干个节点的节点信息按照节点标识顺序存储在所述点表中。In some embodiments, the node information further includes node types, and the node information of the several nodes is stored in the point table in order of node identification.
在一些实施例中,所述点属性表中包括点属性表索引区以及点属性表数据区;所述若干个节点的属性信息存储在所述点属性表数据区中;点属性表索引区存储有所述若干个节点的节点属性索引信息,节点属性索引信息包括该节点的属性信息在所述点属性表 数据区中的存储地址信息;所述若干个节点的节点属性索引信息的存储顺序与所述若干个节点在点表中的存储顺序一致。In some embodiments, the point attribute table includes a point attribute table index area and a point attribute table data area; the attribute information of the several nodes is stored in the point attribute table data area; the point attribute table index area stores There are node attribute index information of the several nodes, and the node attribute index information includes the storage address information of the attribute information of the node in the point attribute table data area; the storage order of the node attribute index information of the several nodes is the same as The storage order of the several nodes in the point table is consistent.
在一些实施例中,系统500还包括表元生成模块550,所述表元生成模块550用于生成所述数据块的表元,所述表元包括所述数据块中各表的存储地址信息以及所述数据块中各点表中第一个节点的节点标识。In some embodiments, the system 500 further includes a table element generation module 550, the table element generation module 550 is used to generate the table element of the data block, and the table element includes storage address information of each table in the data block And the node identifier of the first node in each point table in the data block.
在一些实施例中,数据块包括编码信息;系统500还包括词表生成模块560,词表生成模块560用于生成图谱文件的词表;所述词表包括图谱文件中各数据块中的编码信息与原始信息的映射关系。In some embodiments, the data block includes encoding information; the system 500 also includes a vocabulary generating module 560, and the vocabulary generating module 560 is used to generate a vocabulary of the map file; the vocabulary includes encoding in each data block in the map file The mapping relationship between information and original information.
在一些实施例中,系统500还包括数据块索引生成模块570,数据块索引生成模块570用于生成图谱文件的数据块索引;所述图谱文件的数据块索引包括图谱文件中各数据块的存储地址信息以及各数据块中第一个节点的节点标识。In some embodiments, the system 500 also includes a data block index generation module 570, the data block index generation module 570 is used to generate the data block index of the map file; the data block index of the map file includes the storage of each data block in the map file Address information and node identification of the first node in each data block.
在一些实施例中,系统500还包括图谱文件元生成模块580,图谱文件元生成模块580用于生成图谱文件元,所述图谱文件元包括各图谱文件中各数据块所在的图谱文件以及在该图谱文件中的数据块序号、各图谱文件中第一个节点的节点标识以及各图谱文件中最后一个节点的节点标识。In some embodiments, the system 500 further includes a map file element generation module 580, and the map file element generation module 580 is used to generate a map file element, and the map file element includes the map file where each data block in each map file is located and the The serial number of the data block in the graph file, the node identifier of the first node in each graph file, and the node identifier of the last node in each graph file.
在一些实施例中,数据块为最小读写单元。In some embodiments, a data block is the smallest read/write unit.
在一些实施例中,图数据的边包括出边和入边;所述边表包括出边表和入边表;所述边属性表包括出边属性表和入边属性表;所述节点信息还包括节点的出边的存储地址信息和入边的存储地址信息。In some embodiments, the edge of the graph data includes an outgoing edge and an incoming edge; the edge table includes an outgoing edge table and an incoming edge table; the edge attribute table includes an outgoing edge attribute table and an incoming edge attribute table; the node information It also includes the storage address information of the outgoing edge and the storage address information of the incoming edge of the node.
应当理解,图5所示的系统及其模块可以利用各种方式来实现。例如,在一些实施例中,装置及其模块可以通过硬件、软件或者软件和硬件的结合来实现。其中,硬件部分可以利用专用逻辑来实现;软件部分则可以存储在存储器中,由适当的指令执行装置,例如微处理器或者专用设计硬件来执行。本领域技术人员可以理解上述的方法和装置可以使用计算机可执行指令和/或包含在处理器控制代码中来实现,例如在诸如磁盘、CD或DVD-ROM的载体介质、诸如只读存储器(固件)的可编程的存储器或者诸如光学或电子信号载体的数据载体上提供了这样的代码。本说明书的装置及其模块不仅可以有诸如超大规模集成电路或门阵列、诸如逻辑芯片、晶体管等的半导体、或者诸如现场可编程门阵列、可编程逻辑设备等的可编程硬件设备的硬件电路实现,也可以用例如由各种类型的处理器所执行的软件实现,还可以由上述硬件电路和软件的结合(例如,固件)来实现。It should be understood that the system and its modules shown in FIG. 5 can be implemented in various ways. For example, in some embodiments, the device and its modules may be implemented by hardware, software, or a combination of software and hardware. Wherein, the hardware part can be implemented by using dedicated logic; the software part can be stored in a memory and executed by an appropriate instruction executing device, such as a microprocessor or specially designed hardware. Those skilled in the art will understand that the above-mentioned methods and devices can be implemented using computer-executable instructions and/or contained in processor control code, for example on a carrier medium such as a magnetic disk, CD or DVD-ROM, such as a read-only memory (firmware ) or on a data carrier such as an optical or electronic signal carrier. The device and its modules in this specification can not only be implemented by hardware circuits such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, etc., or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc. , can also be realized by software executed by various types of processors, for example, and can also be realized by a combination of the above-mentioned hardware circuits and software (for example, firmware).
图6是根据本说明书的一些实施例所示的数据块结构示意图。Fig. 6 is a schematic diagram of a data block structure according to some embodiments of the present specification.
下面结合图6对本说明涉及的一个及多个实施例所涉及的存储文件的形式进行进一步说明。The format of the storage file involved in one or more embodiments involved in this description will be further described below with reference to FIG. 6 .
存储文件600包括图谱文件元以及一个或多个图谱文件。图谱文件元包括各图谱文件中各数据块所在的图谱文件以及在该图谱文件中的数据块序号、各图谱文件中第一个节点的节点标识以及各图谱文件中最后一个节点的节点标识。节点标示是指示节点在图数据中的编号,用于追溯节点在图数据中的位置。示例性地,节点标示可以设置为节点1,节点2,…,节点m等。在一些实施例中,图数据中的节点可以基于节点标识存储在多个数据块或图谱文件中,以便快速确定目标查找节点在哪个图谱文件中。图谱文件元可以理解为多个图谱文件的索引信息,其可以被上位机或者是服务器进行调取和访问(如通过SDK等方式进行调用)。Stored file 600 includes an atlas file element and one or more atlas files. The graph file element includes the graph file where each data block in each graph file is located, the serial number of the data block in the graph file, the node identifier of the first node in each graph file, and the node identifier of the last node in each graph file. The node label is to indicate the number of the node in the graph data, and is used to trace the position of the node in the graph data. Exemplarily, the node label can be set as node 1, node 2, . . . , node m and so on. In some embodiments, nodes in the graph data can be stored in multiple data blocks or graph files based on node identifiers, so as to quickly determine which graph file the target search node is in. The map file element can be understood as index information of multiple map files, which can be called and accessed by a host computer or a server (such as calling through an SDK).
一个图谱文件可以包括多个数据块,在一些实施例中,图谱文件可以包含数量固定的数据块,如一个图谱文件可以包括1024个数据块。其中,数据块为最小读写单元,可以用于存储和写入数据。在进行图数据存储时,数据块为最小写单元,处理设备可以按照数据块的格式将图数据依次写入一个或多个数据块。数据块可以有固定大小,如64字节、128字节等。当一个数据块被写满时,便创建一个新的数据块继续写入,直到将一个完整的图数据被写入。在一些实施例中,数据块中的数据来自同一图数据,也可以来自不同的图数据。数据块中具体包括点表、点属性表、边表和边属性表,在一些实施例中数据块还可以包括表元,表元包括数据块中各表的存储地址信息以及数据块中点表中第一个节点的节点标识,表元可以视作数据块内部的索引信息,便于快速定位到各表的存储位置。有关点表、点属性表、边表和边属性表的更多描述可参见图7对应部分的详细描述,在此不再赘述。A graph file may include multiple data blocks. In some embodiments, a graph file may include a fixed number of data blocks, for example, a graph file may include 1024 data blocks. Among them, the data block is the smallest read-write unit, which can be used to store and write data. When storing graph data, the data block is the minimum writing unit, and the processing device can sequentially write the graph data into one or more data blocks according to the format of the data block. A data block can have a fixed size, such as 64 bytes, 128 bytes, etc. When a data block is full, a new data block is created to continue writing until a complete graph data is written. In some embodiments, the data in the data block comes from the same graph data, and may also come from different graph data. Specifically, the data block includes a point table, a point attribute table, an edge table, and an edge attribute table. In some embodiments, the data block can also include a table element, and the table element includes the storage address information of each table in the data block and the point table in the data block. The node identifier of the first node in , the table element can be regarded as the index information inside the data block, which is convenient for quickly locating the storage location of each table. For more descriptions about the point table, point attribute table, edge table and edge attribute table, please refer to the detailed description of the corresponding part in FIG. 7 , which will not be repeated here.
在一些实施例中,图谱文件除了包含多个数据块以外,还可以包括文件页脚信息、数据块索引以及词表。In some embodiments, in addition to multiple data blocks, the graph file may also include file footer information, data block indexes, and vocabulary.
图谱文件的词表用于记录编码信息与原始信息的映射关系,进一步,词表可以用来对图谱文件中的至少部分信息进编码或解码。示例性地,边类型、节点类型等信息可以使用数字表征,如数字1表示用户类节点、数字2表示公司类节点,因此,在点表中存储节点类型时可以用1、2等数字表示对应的类型。将文本以更为简短的数字或字母予以表征,可以有效减少图数据实际的存储空间。相应的,词表中可以记录有“1”——用户类节点,“2”——公司类节点等类似的映射关系。The vocabulary of the map file is used to record the mapping relationship between the encoded information and the original information. Further, the vocabulary can be used to encode or decode at least part of the information in the map file. For example, information such as edge type and node type can be represented by numbers, such as number 1 for user-type nodes, and number 2 for company-type nodes. Therefore, when storing node types in the point table, numbers such as 1 and 2 can be used to represent corresponding type. Representing text with shorter numbers or letters can effectively reduce the actual storage space of graph data. Correspondingly, similar mapping relationships such as "1" - user node, "2" - company node, etc. may be recorded in the vocabulary.
图谱文件的数据块索引包括图谱文件中各数据块的存储地址信息以及各数据块中第一个节点的节点标识。图谱文件的数据块索引可以快速确定目标查询点在哪一个数据 块中。The data block index of the map file includes the storage address information of each data block in the map file and the node identifier of the first node in each data block. The data block index of the map file can quickly determine which data block the target query point is in.
文件页脚信息包括数据块中的总节点数、边的总数以及文件扩展区域(比如文件协议、压缩算法、校正信息等)。The file footer information includes the total number of nodes in the data block, the total number of edges, and file extension areas (such as file protocol, compression algorithm, correction information, etc.).
图8是根据本说明书的一些实施例所示的进行图数据查询的示例性流程图。下面结合图8所示出的流程800,以已知目标查询节点,查找该目标查询节点的N跳子图为例阐述存储文件的使用方法。N跳子图包括目标查询节点的N跳边以及各边上的节点。存储设备接收来自业务端或处理设备的查询请求,如步骤810,查询请求中包括目标查询节点的节点标识。首先,存储设备访问图谱文件元,如步骤820,通过图谱文件元中存储的各图谱文件的第一个节点的节点标识以及各图谱文件中最后一个节点的节点标识确定目标节点存储在哪个图谱文件中(如锁定到一个图谱文件V)。进一步地,再基于该图谱文件的数据块索引(图谱文件V的数据块索引)中存储的各数据块中第一个节点的节点标识,确定目标查询节点所在的目标数据块,如步骤830。再基于数据块索引中存储的图谱文件中各数据块的存储地址信息定位到目标查询节点所在的目标数据块,例如步骤840,具体可以获取所述目标数据块。在目标数据块中,可以基于其表元定位到点表,在点表中基于节点标识查找到目标查询节点的节点信息,当点表中的节点信息按照节点标识顺序存储时,可以通过二分查找的方式快速确定目标查询节点的节点信息,如步骤850。由于点表、边表、点属性表以及边属性表位于同一数据块中且相互对齐,因此通过一次读操作(如将数据块加载到内存中)便可基于目标查询节点的节点信息在所述点表中的存储顺序或者边的存储地址信息,从所述目标数据块的边表、点属性表以及边属性表中的一个或多个表中获取目标查询节点的边信息、点属性信息以及边属性信息中的一种或多种信息,如步骤860,进而找到目标查询节点的一跳子图。进一步,获取一跳子图中目标查询节点的各第一跳邻居节点(目标查询节点第一跳边上的节点)的节点标识,重复上述步骤,可以找到各第一跳邻居节点的一跳子图,得到目标查询节点的二跳子图,以此类推,得到目标查询节点的N跳子图。Fig. 8 is an exemplary flow chart of querying graph data according to some embodiments of the present specification. In the following, in conjunction with the process 800 shown in FIG. 8 , the method of using the stored file will be described by taking the known target query node and finding the N-hop subgraph of the target query node as an example. The N-hop subgraph includes N-hop edges of the target query node and nodes on each edge. The storage device receives a query request from a service end or a processing device. In step 810, the query request includes a node identifier of a target query node. First, the storage device accesses the graph file element, as in step 820, determines which graph file the target node is stored in through the node identifier of the first node of each graph file stored in the graph file element and the node identifier of the last node in each graph file in (eg locked to a map file V). Further, based on the node identifier of the first node in each data block stored in the data block index of the map file (data block index of map file V), determine the target data block where the target query node is located, as in step 830. Then locate the target data block where the target query node is located based on the storage address information of each data block in the map file stored in the data block index, for example, in step 840, specifically, the target data block can be obtained. In the target data block, the point table can be located based on its elements, and the node information of the target query node can be found in the point table based on the node ID. When the node information in the point table is stored in the order of the node ID, binary search can be performed The node information of the target query node is quickly determined in a manner such as step 850. Since the point table, edge table, point attribute table, and edge attribute table are located in the same data block and are aligned with each other, the node information of the target query node can be based on the target query node through a read operation (such as loading the data block into the memory). The storage order in the point table or the storage address information of the edge, the edge information, point attribute information and One or more types of information in the edge attribute information, as in step 860, and then find the one-hop subgraph of the target query node. Further, obtain the node identifications of each first-hop neighbor node (the node on the first hop side of the target query node) of the target query node in the one-hop subgraph, and repeat the above steps to find the one-hop sub-nodes of each first-hop neighbor node Graph, get the two-hop subgraph of the target query node, and so on, get the N-hop subgraph of the target query node.
需要说明的是,在本说明书涉及的一个或多个实施例中,图数据的边可以包括出边和入边。在该场景的实施例中,本说明书所涉及的边表也可以进一步分为出边表和入边表;所对应的边属性表也包括出边属性表和入边属性表;对应的节点信息还包括节点的出边的存储地址信息和入边的存储地址信息。It should be noted that, in one or more embodiments involved in this specification, the edges of graph data may include outgoing edges and incoming edges. In the embodiment of this scenario, the edge table involved in this specification can also be further divided into the outgoing edge table and the incoming edge table; the corresponding edge attribute table also includes the outgoing edge attribute table and the incoming edge attribute table; the corresponding node information It also includes the storage address information of the outgoing edge and the storage address information of the incoming edge of the node.
图7是根据本说明书的一些实施例所示的进行图数据存储的示例性流程图。在一些实施例中,进行图数据存储的示例性流程如流程700所示,其中,流程700可以包括步骤710、步骤720、…、步骤780,以下是对流程700的详细描述。Fig. 7 is an exemplary flowchart of graph data storage according to some embodiments of the present specification. In some embodiments, an exemplary process for storing graph data is shown in process 700 , wherein process 700 may include steps 710 , 720 , . . . , step 780 , and a detailed description of process 700 is as follows.
步骤710,将图数据中的若干个节点的节点信息存储在数据块的点表中。 Step 710, storing the node information of several nodes in the graph data in the point table of the data block.
在一些实施例中,步骤710可以由节点信息存储模块510执行。节点信息存储模块510基于设定好的点表的格式将节点信息按序填入点表中。图数据包括节点和边,在一些实施例中,节点信息存储模块510可以从图数据中选取若干节点进行存储。若干节点可以是图数据的全部节点,也可以是其中的一部分。In some embodiments, step 710 may be performed by the node information storage module 510 . The node information storage module 510 fills the node information into the point table in order based on the format of the set point table. Graph data includes nodes and edges. In some embodiments, the node information storage module 510 may select several nodes from the graph data for storage. Several nodes can be all the nodes of the graph data, or some of them.
如图2所示为一个示例性的点表210的示意图。点表中存储若干个节点的节点信息,节点信息包括节点标识。节点标识是指示节点在图数据中的编号,用于追溯节点在图数据中的位置。示例性地,节点标识可以设置为节点1,节点2,…,节点m等。在一些实施例中,点表中存储的节点信息基于节点标识的顺序进行存储。示例性地,节点信息存储模块510可以从图数据中选取节点标识连续的若干节点,并按照节点标识的升序或者降序将这些节点的节点信息进行依次存储。FIG. 2 is a schematic diagram of an exemplary point table 210 . Node information of several nodes is stored in the point table, and the node information includes node identifiers. The node identifier is the number indicating the node in the graph data, and is used to trace the position of the node in the graph data. Exemplarily, the node identifier can be set as node 1, node 2, . . . , node m and so on. In some embodiments, the node information stored in the point table is stored based on the order of node identification. Exemplarily, the node information storage module 510 may select several nodes with consecutive node IDs from the graph data, and store the node information of these nodes sequentially according to the ascending or descending order of the node IDs.
在一些实施例中,节点信息还包括节点对应边的存储地址信息,边的存储地址信息指示该边在边表中的存储位置,例如可以是边的索引信息在边表中的存储地址信息。其中,存储地址信息可以是绝对地址,也可以是相对于某起始位置的偏移量。示例性的,边的索引信息在边表中的存储地址信息可以是一个绝对地址,或者是其相对边表起始位置的偏移量。通过这样的设置,在进行图查询时,在定位到某目标节点后,可以基于目标节点在点表中的边的存储地址信息直接确定与该目标节点相连的边的数据。In some embodiments, the node information also includes storage address information of the edge corresponding to the node, and the storage address information of the edge indicates the storage location of the edge in the edge table, for example, it may be the storage address information of the edge index information in the edge table. Wherein, the storage address information may be an absolute address, or an offset relative to a certain starting position. Exemplarily, the storage address information of the edge index information in the edge table may be an absolute address, or an offset relative to the starting position of the edge table. With such a setting, when performing graph query, after locating a certain target node, the data of the edge connected to the target node can be directly determined based on the storage address information of the target node's edge in the point table.
一般来说,节点可以包括多条边。在一些实施例中,节点信息存储模块510可以将节点的每一条边的存储地址信息均在点表中进行记录,即一个节点信息中可以记录所有与该节点连接的边的存储地址信息。但是,在一些实施场景中,由于一个节点对应的边的数量很多(如一个商户节点可以与成千上万个用户节点相连),采用以上方式存储节点所有边的存储地址信息会占用大量的存储资源,十分低效。因此,在本说明书的一些实施例中,可以在边表中将同一节点的边信息连续存储。如节点A具有5条边,节点B具有3条边。在边表中,节点A的5条边的边信息从第一存储位置(如边表中的第16个字节)开始连续地存放在一个区域(如大小为12×5=60字节的区域)内,节点B的边信息从第二存储位置(如边表中的第76个字节)开始连续地存放在另一个区域(如大小为12×3字节的区域)。如此,如图2所示,点表中存储的每个节点的边存储地址信息可以只包括其边的在边表中的起始存储位置(如A节点的边的存储地址信息为第一存储位置,B节点的边的存储地址信息为第二存储位置)。即,点表中,前一节点的边的存储地址信息到下一个节点的边的存储地址信息中间存储区域均视为前一节点对应的边的存储地址信息。In general, a node can contain multiple edges. In some embodiments, the node information storage module 510 can record the storage address information of each edge of the node in the point table, that is, a node information can record the storage address information of all edges connected to the node. However, in some implementation scenarios, due to the large number of edges corresponding to a node (for example, a merchant node can be connected to thousands of user nodes), using the above method to store the storage address information of all edges of the node will occupy a large amount of storage resources are very inefficient. Therefore, in some embodiments of the present specification, the edge information of the same node can be continuously stored in the edge table. For example, node A has 5 edges and node B has 3 edges. In the edge table, the edge information of the five edges of node A is continuously stored in an area (such as the size of 12*5=60 bytes) starting from the first storage location (such as the 16th byte in the edge table). area), node B's edge information is continuously stored in another area (eg, an area with a size of 12×3 bytes) starting from the second storage location (eg, the 76th byte in the edge table). In this way, as shown in Figure 2, the edge storage address information of each node stored in the point table can only include the initial storage location of its edge in the edge table (such as the edge storage address information of A node is the first storage location location, the storage address information of the edge of node B is the second storage location). That is, in the point table, the intermediate storage area from the storage address information of the edge of the previous node to the storage address information of the edge of the next node is regarded as the storage address information of the edge corresponding to the previous node.
在一些实施例中,边具有方向,节点可以具有出边和/或入边,其中入边是指向该节点的边,出边是从该节点出发指向另一节点的边。因此,在一些实施例中,在点表中,节点信息中的边的存储地址信息可以进一步分为入边的存储地址信息以及出边的存储地址信息。对应的,边表可以包括入边表与出边表两种,其中入边表中仅存储入边的边信息,出边表中存储出边表的边信息。节点信息中的出/入边的存储地址信息,以及出/入边的边信息在出/入边表中的存储方式与前述内容类似,在此不再赘述。有关边的存储地址信息的更多描述可参见步骤720的相应描述。In some embodiments, an edge has a direction, and a node may have an outgoing edge and/or an incoming edge, where an incoming edge is an edge pointing to the node, and an outgoing edge is an edge starting from the node pointing to another node. Therefore, in some embodiments, in the point table, the edge storage address information in the node information can be further divided into the storage address information of the incoming edge and the storage address information of the outgoing edge. Correspondingly, the edge table can include two types: an in-edge table and an out-edge table. The in-edge table only stores the edge information of the in-edge table, and the out-edge table stores the edge information of the out-edge table. The storage address information of the outgoing/incoming edge in the node information and the storage method of the outgoing/incoming edge information in the outgoing/incoming edge table are similar to those described above, and will not be repeated here. For more description about the storage address information of the edge, refer to the corresponding description of step 720 .
在一些实施例中,节点信息还可以包括节点的类型信息。由于节点可以描述物理世界中任何实体或对象,因此其可以具有不同的类型。例如,用户类型的节点、公司类型的节点、地点类的节点等等。节点类型(图中未示出)可以存储在如图2所示的每个节点的节点标识与边的存储地址信息之间。一般来说,节点的类型是可以穷举的,为了方便对节点类型进行表示和存储,在一些实施例中,还可以通过词表对节点类型进行图谱文件内部的编码,点表仅存储编码后的节点类型。当需要从点表中读取节点的节点类型时,可以再次基于词表将其编码解析成语义明确的节点类型,如“用户类节点”。通过词表进行文件内编解码的方式可以使得节点类型的表达变得简约,以进一步减小存储空间。有关词表的更多描述可参见图6的描述,在此不再赘述。In some embodiments, the node information may also include node type information. Since a node can describe any entity or object in the physical world, it can be of different types. For example, a user-type node, a company-type node, a location-type node, and so on. The node type (not shown in the figure) may be stored between the node identifier of each node and the storage address information of the edge as shown in FIG. 2 . Generally speaking, the types of nodes can be exhaustive. In order to facilitate the representation and storage of node types, in some embodiments, the node types can also be encoded in the map file through the vocabulary, and the point table only stores the encoded the node type. When it is necessary to read the node type of a node from the point table, it can be encoded and parsed into a node type with clear semantics based on the vocabulary again, such as "user class node". The way of encoding and decoding in the file through the vocabulary can simplify the expression of the node type, so as to further reduce the storage space. For more descriptions about the vocabulary, refer to the description of FIG. 6 , which will not be repeated here.
在一些实施例中,节点信息也可以先按照节点类型的顺序存储,再按照节点标识顺序存储。例如,用户类节点可以存储在一起,在多个用户类节点中按照节点标识再次顺序存储。当按照节点类型排序时,可以是按照节点类型描述文本的第一个字符的拼音字母或第一个单词的首字母顺序排列。图2所示出的点表210中还包括表头标识位,用于指示该表是否具有索引区,在一些实施例中,点表不包含索引区,其表头标识位存储“0”。In some embodiments, the node information may also be stored in the order of node types first, and then in the order of node identifiers. For example, user class nodes can be stored together, and stored sequentially according to node identifiers among multiple user class nodes. When sorting according to the node type, it can be arranged according to the pinyin alphabet of the first character of the node type description text or the first letter of the first word. The point table 210 shown in FIG. 2 also includes a header identification bit for indicating whether the table has an index area. In some embodiments, the point table does not include an index area, and its header identification bit stores "0".
步骤720,将所述若干个节点的边的边信息存储在所述数据块的边表中。 Step 720, storing the edge information of the edges of the several nodes in the edge table of the data block.
在一些实施例中,步骤720可以由边信息存储模块520执行。边信息存储模块520基于设定好的边表的格式将数据按序填入边表中。In some embodiments, step 720 may be performed by the side information storage module 520 . The side information storage module 520 fills the data into the side table in sequence based on the format of the set side table.
在一些实施例中,边表可以包括边表索引区以及边表数据区。可以理解,由于边可以由边所连接的两个目标节点进行刻画,因此,边信息可以包括与边连接的目标节点的节点标识。在一些实施例中,边信息存储在边表数据区中,如边表数据区中存储的是一对对目标节点的节点标识,其中每一对目标节点的节点标识对应一条边。边表索引区存储各边的边信息在边表中的索引信息,例如包括各边对应的目标节点的节点标识在边表数据区中的存储地址信息。In some embodiments, the edge table may include an edge table index area and an edge table data area. It can be understood that since an edge can be described by two target nodes connected by the edge, the edge information can include a node identifier of the target node connected to the edge. In some embodiments, the edge information is stored in the edge table data area. For example, the edge table data area stores a pair of node IDs of target nodes, wherein each pair of node IDs of target nodes corresponds to an edge. The edge table index area stores the index information of the edge information of each edge in the edge table, for example, includes the storage address information of the node identifier of the target node corresponding to each edge in the edge table data area.
如图3所示即为一个示例性的边表220的示意图。图中,表头标识位表示该表是否 具有索引区。示例性地,可以将表头标识位设为“1”表示有索引区;将表头标识位设为“0”表示无索引区。由于边表均包含索引区,因此表头标识位为1。索引区长度表示边表索引区的总长度,如表示边表索引区所占用的字节数。索引区长度可以表示从哪一位起是边表数据区。边表索引区用于存储各边的索引信息,例如,边A的索引信息指向了边A的数据在边表数据区中的位置。边表数据区用于存储各边的边信息。在一些实施例中,边信息还可以包括目标节点的节点类型。在一些实施例中,每条边信息的存储长度是相同的。例如,对于每一条边,使用4字节存储两个目标节点的节点类型,使用8个字节存储两个目标节点的节点标识。As shown in FIG. 3 , it is a schematic diagram of an exemplary edge table 220 . In the figure, the header flag indicates whether the table has an index area. Exemplarily, setting the header identification bit to "1" indicates that there is an index area; setting the header identification bit to "0" indicates that there is no index area. Since all edge tables contain index areas, the table header flag is 1. The index area length indicates the total length of the edge table index area, such as the number of bytes occupied by the edge table index area. The length of the index area can indicate from which bit is the edge table data area. The edge table index area is used to store the index information of each edge, for example, the index information of edge A points to the position of the data of edge A in the edge table data area. The edge table data area is used to store the edge information of each edge. In some embodiments, the side information may also include the node type of the target node. In some embodiments, the storage length of each piece of side information is the same. For example, for each edge, 4 bytes are used to store the node types of the two target nodes, and 8 bytes are used to store the node identifiers of the two target nodes.
在一些实施例中,边的索引信息的存储顺序与节点在点表中的存储顺序一致(也可称之为边表与点表的对齐)。例如,从边表索引区开始,连续存储点表中第一个节点的边的索引信息,之后存储第二个节点的边的索引信息,以此类推。在边表数据区中,边信息可以按照边表索引区中边的索引信息的存储顺序,依次存储各边的边信息。由此,可以按照节点的在点表中的位置找到对应的边的索引信息。例如,确定某个节点在点表中的存储顺序第k个,可以直接读取第k个边的索引信息,进而基于第k个边的索引信息找到第k个节点对应边在边表数据区的存储位置。In some embodiments, the storage order of the edge index information is consistent with the storage order of the nodes in the vertex table (also referred to as the alignment of the edge table and the vertex table). For example, start from the edge table index area, continuously store the edge index information of the first node in the point table, then store the edge index information of the second node, and so on. In the edge table data area, the edge information can store the edge information of each edge sequentially according to the storage order of the edge index information in the edge table index area. Thus, the index information of the corresponding edge can be found according to the position of the node in the point table. For example, to determine the k-th storage order of a node in the point table, you can directly read the index information of the k-th edge, and then find the corresponding edge of the k-th node in the edge table data area based on the index information of the k-th edge storage location.
在一些实施例中,边表中的边信息的存储顺序与节点在点表中的存储顺序一致,同一节点的边信息连续存储在一起。例如,节点A与K、M、L三个节点相连,节点B与Q、G两个节点相连,节点A在点表中的存储顺序为第一个,节点B在点表中的存储顺序是第2个,此时,从边表数据区的起始位置依次存储的是A-K、A-M、A-L这三条边的边信息,B-Q、B-G这两条边的边信息。如此,如图3所示,边表索引区中存储的边的索引信息可以只包括对应节点的边的边信息在边表中的起始存储位置(如节点A对应的边索引信息包括边A-K的存储地址信息,节点B对应的边索引信息包括边B-Q的存储地址信息)。即,边表中,前一节点对应的边的索引信息到下一个节点的边的索引信息中间的存储区域均视为前一节点对应的边的边信息。In some embodiments, the storage order of the edge information in the edge table is consistent with the storage order of the nodes in the vertex table, and the edge information of the same node is stored together consecutively. For example, node A is connected to three nodes K, M, and L, and node B is connected to two nodes Q and G. The storage order of node A in the point table is the first, and the storage order of node B in the point table is The second one. At this time, the edge information of the three edges A-K, A-M, and A-L, and the edge information of the two edges B-Q and B-G are stored sequentially from the starting position of the edge table data area. In this way, as shown in Figure 3, the edge index information stored in the edge table index area can only include the initial storage position of the edge information of the edge corresponding to the node in the edge table (such as the edge index information corresponding to node A includes edge A-K The storage address information of the node B, the edge index information corresponding to the node B includes the storage address information of the edge B-Q). That is, in the edge table, the storage area between the index information of the edge corresponding to the previous node and the index information of the edge of the next node is regarded as the edge information of the edge corresponding to the previous node.
可选的,在一些实施例中,边表索引区中还包括各边的边类型,如图3中在边A的边索引信息中除了存储地址信息外,还包括边类型。边类型可以反映两个实体之间的交互关系,如两个企业之间的诉讼关系或者两个企业之间的经济交易关系等。在一些实施例中,当同一节点对应多条边,且多条边分属不同的类型时,在边表数据区中,同一节点的边的边信息可以按照边类型顺序存储。此时,该节点在边表索引区对应的边索引信息可以包括多个边类型以及多个存储地址信息,其中,所述多个边类型连续存储,多个存储地址信息也连续存储。如图3所示,假设节点B有多条边,且这些边分属两种边 类型,则可以在节点B的边索引信息中连续存储两个边类型以及两个存储地址信息,其中第一个存储地址信息为节点B的多条边中属于第一个边类型的边信息在边数据区的存储地址信息(如节点B的多条边中属于第一个边类型的边信息在边数据区的起始存储位置),第二个存储地址信息为节点B的多条边中属于第二个边类型的边信息在边数据区的存储地址信息(如节点B的多条边中属于第二个边类型的边信息在边数据区的起始存储位置)。通过这样地设置,使得在进行图查询时,可以快速定位某一节点对应的某一边类型对应的所有边。Optionally, in some embodiments, the edge table index area also includes the edge type of each edge. For example, in FIG. 3 , the edge index information of edge A not only stores the address information, but also includes the edge type. The edge type can reflect the interactive relationship between two entities, such as the litigation relationship between two enterprises or the economic transaction relationship between two enterprises. In some embodiments, when the same node corresponds to multiple edges, and the multiple edges belong to different types, in the edge table data area, the edge information of the edges of the same node can be stored in order of edge types. At this time, the edge index information corresponding to the node in the edge table index area may include multiple edge types and multiple storage address information, wherein the multiple edge types are continuously stored, and the multiple storage address information are also continuously stored. As shown in Figure 3, assuming that node B has multiple edges, and these edges belong to two types of edges, two edge types and two storage address information can be continuously stored in the edge index information of node B, where the first The storage address information is the storage address information of the edge information belonging to the first edge type among the multiple edges of node B in the edge data area (for example, the edge information belonging to the first edge type among the multiple edges of node B is in the edge data area), the second storage address information is the storage address information of the edge information belonging to the second edge type among the multiple edges of node B in the edge data area (for example, among the multiple edges of node B, the edge information belongs to the first The edge information of the two edge types is in the initial storage location of the edge data area). By setting in this way, when performing graph query, all edges corresponding to a certain edge type corresponding to a certain node can be quickly located.
在一些实施例中,边类型可以与节点类型一样,采用词表对边类型进行图谱文件内部的编码,边表部分仅存储边类型的内部编码。有关词表的更多描述可参见图6的相应描述,在此不再赘述。In some embodiments, the edge type can be the same as the node type, and the edge type is encoded inside the graph file using a vocabulary, and the edge table part only stores the internal encoding of the edge type. For more descriptions about the vocabulary, refer to the corresponding description in FIG. 6 , which will not be repeated here.
在一些实施例中,边具有方向,节点可以具有出边和/或入边。对应的,边表可以包括入边表与出边表两种,其中入边表中仅存储入边的相关数据,出边表中存储出边表的相关数据。出/入边的相关数据在出/入边表中的存储方式与前述内容类似,在此不再赘述。In some embodiments, edges have directions and nodes may have outgoing and/or incoming edges. Correspondingly, the edge table can include two types: an in-edge table and an out-edge table. The in-edge table only stores the relevant data of the in-edge table, and the out-edge table stores the relevant data of the out-edge table. The storage method of the relevant data of the outgoing/incoming edge in the outgoing/incoming edge table is similar to the foregoing content, and will not be repeated here.
步骤730,将所述若干个节点的属性信息存储在所述数据块的点属性表中。 Step 730, storing the attribute information of the several nodes in the point attribute table of the data block.
在一些实施例中,步骤730可以由节点属性信息存储模块530执行。节点属性信息存储模块530基于设定好的点属性表的格式将数据按序填入点属性表中。In some embodiments, step 730 may be performed by the node attribute information storage module 530 . The node attribute information storage module 530 fills the data into the point attribute table in sequence based on the format of the set point attribute table.
如图4所示为一个示例性的属性表240的示意图。在一些实施例中,点属性表与边属性表可以具有相同的格式。因此,属性表240亦可看作点属性表。点属性表包括点属性表索引区以及点属性表数据区,点的属性信息存储在点属性表数据区中;点属性表索引区存储有点的点属性索引信息,点属性索引信息包括该点的属性信息在点属性表数据区中的存储地址信息。如图4所示,每一个属性索引信息都可以指向一个属性数据。FIG. 4 is a schematic diagram of an exemplary attribute table 240 . In some embodiments, the point attribute table and the edge attribute table may have the same format. Therefore, the attribute table 240 can also be regarded as a point attribute table. The point attribute table includes the point attribute table index area and the point attribute table data area, and the attribute information of the point is stored in the point attribute table data area; the point attribute table index area stores the point attribute index information of the point, and the point attribute index information includes the point The storage address information of the attribute information in the point attribute table data area. As shown in FIG. 4 , each attribute index information can point to an attribute data.
在一些实施例中,与边表与点表的对齐相类似,点属性表也可以与点表相对齐。具体地,点属性表中点属性索引信息的存储顺序与点表中节点信息的存储顺序一致。通过这样的设置,可以根据节点在点表中的存储顺序确定定位到点属性索引信息,进一步基于点属性索引信息从点属性表数据区中获取该节点的属性信息。In some embodiments, similar to the alignment of the edge table and the point table, the point attribute table may also be aligned with the point table. Specifically, the storage order of the point attribute index information in the point attribute table is consistent with the storage order of the node information in the point table. With such a setting, it is possible to locate the point attribute index information according to the storage order of the nodes in the point table, and further obtain the attribute information of the node from the point attribute table data area based on the point attribute index information.
在一些实施例中,属性表240还可以包括表头标识位“1”,以及索引区长度。In some embodiments, the attribute table 240 may also include a header flag "1" and the length of the index area.
步骤740,将所述若干个节点的边的属性信息存储在所述数据块的边属性表中。 Step 740, storing the edge attribute information of the several nodes in the edge attribute table of the data block.
在一些实施例中,步骤740可以由边属性信息存储模块540执行。边属性信息存储模块540基于设定好的边属性表的格式将数据按序填入边属性表中。In some embodiments, step 740 may be performed by the edge attribute information storage module 540 . The edge attribute information storage module 540 fills data into the edge attribute table in sequence based on the format of the set edge attribute table.
同理,属性表240亦可看作边属性表。若干个节点的边的属性信息存储在所述边属 性表数据区中;边属性表索引区存储有各边的属性索引信息,边属性索引信息包括该边的属性信息在边属性表数据区中的存储地址信息。Similarly, the attribute table 240 can also be regarded as an edge attribute table. The attribute information of the edges of several nodes is stored in the edge attribute table data area; the edge attribute table index area stores the attribute index information of each edge, and the edge attribute index information includes the attribute information of the edge in the edge attribute table data area storage address information.
在一些实施例中,边属性索引信息在边属性表索引区的存储顺序与各边的边信息在边表数据区中的存储顺序一致。In some embodiments, the storage order of the edge attribute index information in the edge attribute table index area is consistent with the storage order of the edge information of each edge in the edge table data area.
在一些实施例中,边具有方向,节点可以具有出边和/或入边。对应的,边属性表可以包括入边属性表与出边属性表两种,其中入边属性表中仅存储入边的属性信息,出边属性表中存储出边的属性信息。出/入边的属性信息在出/入边属性表中的存储方式与前述内容类似,在此不再赘述。In some embodiments, edges have directions and nodes may have outgoing and/or incoming edges. Correspondingly, the edge attribute table may include two types: an incoming edge attribute table and an outgoing edge attribute table, wherein only the attribute information of the incoming edge is stored in the incoming edge attribute table, and the attribute information of the outgoing edge is stored in the outgoing edge attribute table. The storage method of the attribute information of the outgoing/incoming edge in the outgoing/incoming edge attribute table is similar to the foregoing content, and will not be repeated here.
在一些实施例中,流程700还包括步骤750:生成数据块的表元。在一些实施例中,步骤750可以由表元生成模块550执行。In some embodiments, the process 700 further includes step 750: generating the table element of the data block. In some embodiments, step 750 may be performed by the tab generation module 550 .
表元包括数据块中各表的存储地址信息以及数据块中各点表中第一个节点的节点标识。有关表元的更多表述可参见图6的相应说明,在此不再赘述。The table element includes the storage address information of each table in the data block and the node identifier of the first node in each point table in the data block. For more descriptions about the table elements, refer to the corresponding description in FIG. 6 , which will not be repeated here.
至此,便完成了一个数据块的生成。在一些实施例中,可以按照步骤710~740生成多个数据块,多个数据块构成一个图谱文件。图谱文件还可以包括词表、数据块索引等信息。So far, the generation of a data block is completed. In some embodiments, multiple data blocks may be generated according to steps 710-740, and multiple data blocks constitute a map file. The map file can also include information such as vocabulary and data block index.
在一些实施例中,流程700还包括步骤760:生成图谱文件的词表。在一些实施例中,步骤760可以由词表生成模块560执行。In some embodiments, the process 700 further includes step 760: generating a vocabulary of the graph file. In some embodiments, step 760 may be performed by the vocabulary generation module 560 .
在一些实施例中,数据块包括编码信息,此时,还可以生成图谱文件的词表。词表包括图谱文件中各数据块中的编码信息与原始信息的映射关系。有关词表的更多表述可参见图6的相应说明,在此不再赘述。In some embodiments, the data block includes encoding information, at this time, the vocabulary of the graph file can also be generated. The vocabulary includes the mapping relationship between the coding information in each data block in the map file and the original information. For more expressions about the vocabulary, refer to the corresponding description in FIG. 6 , which will not be repeated here.
在一些实施例中,流程700还包括步骤770:生成图谱文件的数据块索引。在一些实施例中,步骤770可以由数据块索引生成模块570执行。In some embodiments, the process 700 further includes step 770: generating a data block index of the atlas file. In some embodiments, step 770 may be performed by the data block index generation module 570 .
图谱文件的数据块索引包括图谱文件中各数据块的存储地址信息以及各数据块中第一个节点的节点标识,其用来确定目标查询节点在哪一个数据块中。有关数据块索引的更多表述可参见图6的相应说明,在此不再赘述。The data block index of the map file includes the storage address information of each data block in the map file and the node identifier of the first node in each data block, which is used to determine which data block the target query node is in. For more descriptions about the data block index, refer to the corresponding description in FIG. 6 , which will not be repeated here.
至此,便基于图数据生成了一个图谱文件,在一些实施例中,可以生成多个图谱文件,以构成存储文件。存储文件还可以包括图谱文件元。So far, one map file is generated based on the map data, and in some embodiments, multiple map files can be generated to form a storage file. Stored files may also include atlas file elements.
在一些实施例中,流程700还包括步骤780:生成图谱文件元。In some embodiments, the process 700 further includes step 780: generating a graph file element.
图谱文件元包括各图谱文件中各数据块所在的图谱文件以及在该图谱文件中的数据块序号、各图谱文件中第一个节点的节点标识以及各图谱文件中最后一个节点的节点标识,其用来确定目标查询节点在哪一个图谱文件中。有关图谱文件元的更多表述可参 见图6的相应说明,在此不再赘述。The map file element includes the map file where each data block is located in each map file and the serial number of the data block in the map file, the node identifier of the first node in each map file and the node identifier of the last node in each map file, among which It is used to determine which graph file the target query node is in. For more descriptions of the map file elements, refer to the corresponding description in Figure 6, and will not repeat them here.
本说明书实施例可能带来的有益效果包括但不限于:1)将图数据的若干节点、这些节点的边、属性信息存储在一个数据块中,在进行图查询时,可以方便的在一个数据块中找到节点相关的边和属性信息,无需多次读写操作;2)图数据是有序存储在多个数据块中,对于规模较大的图数据,可以分布式存储在多台设备上,在进行图查询时可以由多台设备并行查询(如不同的设备查询不同的数据块),以节约检索查询的时间,提高图查询的响应速度;3)实现了点表-边表-属性表的对齐,节约了边表、属性表的存储空间。需要说明的是,不同实施例可能产生的有益效果不同,在不同的实施例里,可能产生的有益效果可以是以上任意一种或几种的组合,也可以是其他任何可能获得的有益效果。The possible beneficial effects of the embodiments of this specification include but are not limited to: 1) Store several nodes of the graph data, the edges of these nodes, and attribute information in a data block. Find the edge and attribute information related to the node in the block, without multiple read and write operations; 2) The graph data is stored in multiple data blocks in an orderly manner. For large-scale graph data, it can be distributed and stored on multiple devices. , when performing graph query, multiple devices can query in parallel (for example, different devices query different data blocks), so as to save the time of retrieval query and improve the response speed of graph query; 3) realize the point table-edge table-attribute The alignment of tables saves the storage space of edge tables and attribute tables. It should be noted that different embodiments may have different beneficial effects. In different embodiments, the possible beneficial effects may be any one or a combination of the above, or any other possible beneficial effects.
上文已对基本概念做了描述,显然,对于本领域技术人员来说,上述详细披露仅仅作为示例,而并不构成对本说明书的限定。虽然此处并没有明确说明,本领域技术人员可能会对本说明书进行各种修改、改进和修正。该类修改、改进和修正在本说明书中被建议,所以该类修改、改进、修正仍属于本说明书示范实施例的精神和范围。The basic concept has been described above, obviously, for those skilled in the art, the above detailed disclosure is only an example, and does not constitute a limitation to this description. Although not expressly stated here, those skilled in the art may make various modifications, improvements and corrections to this description. Such modifications, improvements and corrections are suggested in this specification, so such modifications, improvements and corrections still belong to the spirit and scope of the exemplary embodiments of this specification.
同时,本说明书使用了特定词语来描述本说明书的实施例。如“一个实施例”、“一实施例”、和/或“一些实施例”意指与本说明书至少一个实施例相关的某一特征、结构或特点。因此,应强调并注意的是,本说明书中在不同位置两次或多次提及的“一实施例”或“一个实施例”或“一个替代性实施例”并不一定是指同一实施例。此外,本说明书的一个或多个实施例中的某些特征、结构或特点可以进行适当的组合。Meanwhile, this specification uses specific words to describe the embodiments of this specification. For example, "one embodiment", "an embodiment", and/or "some embodiments" refer to a certain feature, structure or characteristic related to at least one embodiment of this specification. Therefore, it should be emphasized and noted that references to "an embodiment" or "an embodiment" or "an alternative embodiment" two or more times in different places in this specification do not necessarily refer to the same embodiment . In addition, certain features, structures or characteristics in one or more embodiments of this specification may be properly combined.
此外,本领域技术人员可以理解,本说明书的各方面可以通过若干具有可专利性的种类或情况进行说明和描述,包括任何新的和有用的工序、机器、产品或物质的组合,或对他们的任何新的和有用的改进。相应地,本说明书的各个方面可以完全由硬件执行、可以完全由软件(包括固件、常驻软件、微码等)执行、也可以由硬件和软件组合执行。以上硬件或软件均可被称为“数据块”、“模块”、“引擎”、“单元”、“组件”或“系统”。此外,本说明书的各方面可能表现为位于一个或多个计算机可读介质中的计算机产品,该产品包括计算机可读程序编码。In addition, those skilled in the art will understand that various aspects of this specification can be illustrated and described by several patentable types or situations, including any new and useful process, machine, product or combination of substances, or their Any new and useful improvements. Correspondingly, various aspects of this specification may be entirely executed by hardware, may be entirely executed by software (including firmware, resident software, microcode, etc.), or may be executed by a combination of hardware and software. The above hardware or software may be referred to as "block", "module", "engine", "unit", "component" or "system". Additionally, aspects of this specification may be embodied as a computer product comprising computer readable program code on one or more computer readable media.
计算机存储介质可能包含一个内含有计算机程序编码的传播数据信号,例如在基带上或作为载波的一部分。该传播信号可能有多种表现形式,包括电磁形式、光形式等,或合适的组合形式。计算机存储介质可以是除计算机可读存储介质之外的任何计算机可读介质,该介质可以通过连接至一个指令执行系统、装置或设备以实现通讯、传播或传输供使用的程序。位于计算机存储介质上的程序编码可以通过任何合适的介质进行传播, 包括无线电、电缆、光纤电缆、RF、或类似介质,或任何上述介质的组合。A computer storage medium may contain a propagated data signal embodying a computer program code, for example, in baseband or as part of a carrier wave. The propagated signal may have various manifestations, including electromagnetic form, optical form, etc., or a suitable combination. A computer storage medium may be any computer-readable medium, other than a computer-readable storage medium, that can be used to communicate, propagate, or transfer a program for use by being coupled to an instruction execution system, apparatus, or device. Program code residing on a computer storage medium may be transmitted over any suitable medium, including radio, electrical cable, fiber optic cable, RF, or the like, or combinations of any of the foregoing.
本说明书各部分操作所需的计算机程序编码可以用任意一种或多种程序语言编写,包括面向对象编程语言如Java、Scala、Smalltalk、Eiffel、JADE、Emerald、C++、C#、VB.NET、Python等,常规程序化编程语言如C语言、VisualBasic、Fortran2003、Perl、COBOL2002、PHP、ABAP,动态编程语言如Python、Ruby和Groovy,或其他编程语言等。该程序编码可以完全在用户计算机上运行、或作为独立的软件包在用户计算机上运行、或部分在用户计算机上运行部分在远程计算机运行、或完全在远程计算机或处理设备上运行。在后种情况下,远程计算机可以通过任何网络形式与用户计算机连接,比如局域网(LAN)或广域网(WAN),或连接至外部计算机(例如通过因特网),或在云计算环境中,或作为服务使用如软件即服务(SaaS)。The computer program codes required for the operation of each part of this manual can be written in any one or more programming languages, including object-oriented programming languages such as Java, Scala, Smalltalk, Eiffel, JADE, Emerald, C++, C#, VB.NET, Python etc., conventional procedural programming languages such as C language, VisualBasic, Fortran2003, Perl, COBOL2002, PHP, ABAP, dynamic programming languages such as Python, Ruby and Groovy, or other programming languages. The program code may run entirely on the user's computer, or as a stand-alone software package, or run partly on the user's computer and partly on a remote computer, or entirely on the remote computer or processing device. In the latter case, the remote computer can be connected to the user computer through any form of network, such as a local area network (LAN) or wide area network (WAN), or to an external computer (such as through the Internet), or in a cloud computing environment, or as a service Use software as a service (SaaS).
此外,除非权利要求中明确说明,本说明书所述处理元素和序列的顺序、数字字母的使用、或其他名称的使用,并非用于限定本说明书流程和方法的顺序。尽管上述披露中通过各种示例讨论了一些目前认为有用的发明实施例,但应当理解的是,该类细节仅起到说明的目的,附加的权利要求并不仅限于披露的实施例,相反,权利要求旨在覆盖所有符合本说明书实施例实质和范围的修正和等价组合。例如,虽然以上所描述的系统组件可以通过硬件设备实现,但是也可以只通过软件的解决方案得以实现,如在现有的处理设备或移动设备上安装所描述的系统。In addition, unless explicitly stated in the claims, the order of processing elements and sequences described in this specification, the use of numbers and letters, or the use of other names are not used to limit the sequence of processes and methods in this specification. While the foregoing disclosure has discussed by way of various examples some embodiments of the invention that are presently believed to be useful, it should be understood that such detail is for illustrative purposes only and that the appended claims are not limited to the disclosed embodiments, but rather, the claims The claims are intended to cover all modifications and equivalent combinations that fall within the spirit and scope of the embodiments of this specification. For example, while the system components described above may be implemented as hardware devices, they may also be implemented as a software-only solution, such as installing the described system on an existing processing device or mobile device.
同理,应当注意的是,为了简化本说明书披露的表述,从而帮助对一个或多个发明实施例的理解,前文对本说明书实施例的描述中,有时会将多种特征归并至一个实施例、附图或对其的描述中。但是,这种披露方法并不意味着本说明书对象所需要的特征比权利要求中提及的特征多。实际上,实施例的特征要少于上述披露的单个实施例的全部特征。In the same way, it should be noted that in order to simplify the expression disclosed in this specification and help the understanding of one or more embodiments of the invention, in the foregoing description of the embodiments of this specification, sometimes multiple features are combined into one embodiment, drawings or descriptions thereof. This method of disclosure does not, however, imply that the subject matter of the specification requires more features than are recited in the claims. Indeed, embodiment features are less than all features of a single foregoing disclosed embodiment.
一些实施例中使用了描述成分、属性数量的数字,应当理解的是,此类用于实施例描述的数字,在一些示例中使用了修饰词“大约”、“近似”或“大体上”来修饰。除非另外说明,“大约”、“近似”或“大体上”表明所述数字允许有±20%的变化。相应地,在一些实施例中,说明书和权利要求中使用的数值参数均为近似值,该近似值根据个别实施例所需特点可以发生改变。在一些实施例中,数值参数应考虑规定的有效数位并采用一般位数保留的方法。尽管本说明书一些实施例中用于确认其范围广度的数值域和参数为近似值,在具体实施例中,此类数值的设定在可行范围内尽可能精确。In some embodiments, numbers describing the quantity of components and attributes are used, and it should be understood that such numbers used in the description of the embodiments, in some examples, use the modifiers "about", "approximately" or "substantially" to express grooming. Unless otherwise stated, "about", "approximately" or "substantially" indicates that the stated figure allows for a variation of ±20%. Accordingly, in some embodiments, the numerical parameters used in the specification and claims are approximations that can vary depending upon the desired characteristics of individual embodiments. In some embodiments, numerical parameters should take into account the specified significant digits and adopt the general digit reservation method. Although the numerical ranges and parameters used in some embodiments of this specification to confirm the breadth of the range are approximations, in specific embodiments, such numerical values are set as precisely as practicable.
针对本说明书引用的每个专利、专利申请、专利申请公开物和其他材料,如文章、书籍、说明书、出版物、文档等,特此将其全部内容并入本说明书作为参考。与本说明 书内容不一致或产生冲突的申请历史文件除外,对本说明书权利要求最广范围有限制的文件(当前或之后附加于本说明书中的)也除外。需要说明的是,如果本说明书附属材料中的描述、定义、和/或术语的使用与本说明书所述内容有不一致或冲突的地方,以本说明书的描述、定义和/或术语的使用为准。Each patent, patent application, patent application publication, and other material, such as article, book, specification, publication, document, etc., cited in this specification is hereby incorporated by reference in its entirety. Application history documents that are inconsistent with or conflict with the content of this specification are excluded, and documents (currently or later appended to this specification) that limit the broadest scope of the claims of this specification are excluded. It should be noted that if there is any inconsistency or conflict between the descriptions, definitions, and/or terms used in the accompanying materials of this manual and the contents of this manual, the descriptions, definitions and/or terms used in this manual shall prevail .
最后,应当理解的是,本说明书中所述实施例仅用以说明本说明书实施例的原则。其他的变形也可能属于本说明书的范围。因此,作为示例而非限制,本说明书实施例的替代配置可视为与本说明书的教导一致。相应地,本说明书的实施例不仅限于本说明书明确介绍和描述的实施例。Finally, it should be understood that the embodiments described in this specification are only used to illustrate the principles of the embodiments of this specification. Other modifications are also possible within the scope of this description. Therefore, by way of example and not limitation, alternative configurations of the embodiments of this specification may be considered consistent with the teachings of this specification. Accordingly, the embodiments of this specification are not limited to the embodiments explicitly introduced and described in this specification.

Claims (19)

  1. 一种图数据的存储方法,所述图数据包括节点和边;所述存储方法包括:A method for storing graph data, where the graph data includes nodes and edges; the storage method includes:
    将所述图数据中的若干个节点的节点信息存储在数据块的点表中;所述节点信息包括节点标识;storing the node information of several nodes in the graph data in the point table of the data block; the node information includes a node identifier;
    将所述若干个节点的边的边信息存储在所述数据块的边表中;所述边信息包括与边连接的目标节点的节点标识;storing the edge information of the edges of the several nodes in the edge table of the data block; the edge information includes a node identifier of a target node connected to the edge;
    将所述若干个节点的属性信息存储在所述数据块的点属性表中;storing the attribute information of the several nodes in the point attribute table of the data block;
    将所述若干个节点的边的属性信息存储在所述数据块的边属性表中。Store the edge attribute information of the several nodes in the edge attribute table of the data block.
  2. 根据权利要求1所述的方法,所述若干个节点的边在边表中的存储顺序与所述若干个节点在点表中的存储顺序一致;According to the method according to claim 1, the storage order of the edges of the several nodes in the edge table is consistent with the storage order of the several nodes in the point table;
    所述若干个节点的属性信息在点属性表的存储顺序与所述若干个节点在点表中的存储顺序一致;The storage order of the attribute information of the several nodes in the point attribute table is consistent with the storage order of the several nodes in the point table;
    所述若干个节点的边的属性信息在边属性表的存储顺序与所述若干个节点的边在边表中的存储顺序一致。The storage order of the edge attribute information of the several nodes in the edge attribute table is consistent with the storage order of the edges of the several nodes in the edge table.
  3. 根据权利要求1或2所述的方法,所述边表包括边表索引区以及边表数据区;The method according to claim 1 or 2, wherein the edge table includes an edge table index area and an edge table data area;
    所述若干个节点的边的边信息存储在所述边表数据区中;The edge information of the edges of the several nodes is stored in the edge table data area;
    所述边表索引区存储有所述若干个节点的边的索引信息,所述边的索引信息包括对应节点的边的边信息在所述边表数据区中的存储地址信息;The edge table index area stores edge index information of the several nodes, and the edge index information includes storage address information of the edge information of the corresponding node in the edge table data area;
    所述若干个节点的边的索引信息在所述边表索引区中的存储顺序与所述若干个节点在点表中的存储顺序一致。The storage order of the edge index information of the several nodes in the edge table index area is consistent with the storage order of the several nodes in the point table.
  4. 根据权利要求3所述的方法,所述节点信息还包括节点的边的存储地址信息,所述点表中边的存储地址信息为对应边的索引信息在边表中的存储地址信息。According to the method according to claim 3, the node information further includes storage address information of edges of nodes, and the storage address information of edges in the point table is the storage address information of index information corresponding to edges in the edge table.
  5. 根据权利要求3所述的方法,同一节点的不同边的边信息在所述边表数据区中连续存储;所述若干个节点的边的边信息在所述边表数据区中的存储顺序与所述若干个节点在点表中的存储顺序一致。According to the method according to claim 3, the edge information of different edges of the same node is continuously stored in the edge table data area; the storage order of the edge information of the edges of the several nodes in the edge table data area is the same as The storage order of the several nodes in the point table is consistent.
  6. 根据权利要求5所述的方法,边的索引信息还包括边类型;边信息还包括目标节点的节点类型;同一个节点的边的边信息按照边的边类型在所述边表数据区中顺序存储,同一节点在边表索引区对应的边的索引信息包括一个或多个边类型以及与其对应的一个或多个存储地址信息,其中,所述一个或多个边类型连续存储,所述一个或多个存储地址信息也连续存储。According to the method according to claim 5, the edge index information also includes the edge type; the edge information also includes the node type of the target node; the edge information of the edge of the same node is in the order of the edge table data area according to the edge type of the edge Storage, the edge index information corresponding to the same node in the edge table index area includes one or more edge types and one or more storage address information corresponding thereto, wherein the one or more edge types are stored continuously, and the one Or a plurality of storage address information is also stored consecutively.
  7. 根据权利要求3所述的方法,所述边属性表包括边属性表索引区以及边属性表 数据区;The method according to claim 3, the edge attribute table includes an edge attribute table index area and an edge attribute table data area;
    所述若干个节点的边的属性信息存储在所述边属性表数据区中;The attribute information of the edges of the several nodes is stored in the edge attribute table data area;
    所述边属性表索引区存储有所述若干个节点的边的边属性索引信息,所述边属性索引信息包括对应节点的边的属性信息在所述边属性表数据区中的存储地址信息;The edge attribute table index area stores the edge attribute index information of the edges of the several nodes, and the edge attribute index information includes the storage address information of the edge attribute information of the corresponding node in the edge attribute table data area;
    所述若干个节点的边的边属性索引信息在所述边属性表索引区中的存储顺序与所述若干个节点的边的边信息在所述边表数据区中的存储顺序一致。The storage order of the edge attribute index information of the edges of the several nodes in the edge attribute table index area is consistent with the storage order of the edge information of the edges of the several nodes in the edge table data area.
  8. 根据权利要求1所述的方法,所述节点信息还包括节点类型,所述若干个节点的节点信息按照所述节点类型顺序存储在所述点表中。According to the method according to claim 1, the node information further includes node types, and the node information of the several nodes is stored in the point table according to the order of the node types.
  9. 根据权利要求1所述的方法,所述点属性表包括点属性表索引区以及点属性表数据区;The method according to claim 1, wherein the point attribute table includes a point attribute table index area and a point attribute table data area;
    所述若干个节点的属性信息存储在所述点属性表数据区中;The attribute information of the several nodes is stored in the point attribute table data area;
    所述点属性表索引区存储有所述若干个节点的节点属性索引信息,所述节点属性索引信息包括该节点的属性信息在所述点属性表数据区中的存储地址信息;The point attribute table index area stores the node attribute index information of the several nodes, and the node attribute index information includes the storage address information of the node attribute information in the point attribute table data area;
    所述若干个节点的节点属性索引信息在所述点属性表索引区中的存储顺序与所述若干个节点在点表中的存储顺序一致。The storage order of the node attribute index information of the several nodes in the point attribute table index area is consistent with the storage order of the several nodes in the point table.
  10. 根据权利要求1所述的方法,还包括:生成所述数据块的表元,所述表元包括所述数据块中各表的存储地址信息以及所述数据块中点表中第一个节点的节点标识。The method according to claim 1, further comprising: generating the table element of the data block, the table element including the storage address information of each table in the data block and the first node in the midpoint table of the data block node ID.
  11. 根据权利要求10所述的方法,所述数据块包括编码信息;所述方法还包括:生成包括多个所述数据块的图谱文件的词表;所述词表包括所述图谱文件中各数据块中的编码信息与原始信息的映射关系。The method according to claim 10, wherein the data block includes encoding information; the method further comprises: generating a vocabulary of map files comprising a plurality of the data blocks; the vocabulary includes each data in the map file The mapping relationship between the encoded information in the block and the original information.
  12. 根据权利要求10所述的方法,还包括:生成包括多个所述数据块的图谱文件的数据块索引;所述图谱文件的数据块索引包括图谱文件中各数据块的存储地址信息以及各数据块中第一个节点的节点标识。The method according to claim 10, further comprising: generating a data block index of a map file including a plurality of said data blocks; the data block index of said map file includes storage address information of each data block in the map file and each data Node ID of the first node in the block.
  13. 根据权利要求12所述的方法,还包括:生成图谱文件元,所述图谱文件元包括各图谱文件中各数据块所在的图谱文件以及在该图谱文件中的数据块序号、各图谱文件中第一个节点的节点标识以及各图谱文件中最后一个节点的节点标识。The method according to claim 12, further comprising: generating an atlas file element, the atlas file element including the atlas file where each data block in each atlas file is located, the serial number of the data block in the atlas file, and the number of data blocks in each atlas file. The node ID of a node and the node ID of the last node in each graph file.
  14. 根据权利要求1所述的方法,所述数据块为最小读写单元。According to the method according to claim 1, the data block is the minimum reading and writing unit.
  15. 根据权利要求1所述的方法,所述图数据的边包括出边和入边;所述边表包括出边表和入边表;所述边属性表包括出边属性表和入边属性表;所述节点信息还包括节点的出边的存储地址信息和入边的存储地址信息。According to the method according to claim 1, the edge of the graph data includes an outgoing edge and an incoming edge; the edge table includes an outgoing edge table and an incoming edge table; and the edge attribute table includes an outgoing edge attribute table and an incoming edge attribute table ; The node information also includes the storage address information of the outgoing edge and the storage address information of the incoming edge of the node.
  16. 一种图数据的存储系统,所述图数据包括节点和边;所述存储系统包括:A storage system for graph data, where the graph data includes nodes and edges; the storage system includes:
    节点信息存储模块,用于将所述图数据中的若干个节点的节点信息存储在数据块的点表中;所述节点信息包括节点标识;A node information storage module, configured to store the node information of several nodes in the graph data in the point table of the data block; the node information includes a node identifier;
    边信息存储模块,用于将所述若干个节点的边的边信息存储在所述数据块的边表中;所述边信息包括与边连接的目标节点的节点标识;An edge information storage module, configured to store the edge information of the edges of the several nodes in the edge table of the data block; the edge information includes a node identifier of a target node connected to the edge;
    节点属性信息存储模块,用于将所述若干个节点的属性信息存储在所述数据块的点属性表中;A node attribute information storage module, configured to store the attribute information of the several nodes in the point attribute table of the data block;
    边属性信息存储模块,用于将所述若干个节点的边的属性信息存储在所述数据块的边属性表中。The edge attribute information storage module is configured to store the edge attribute information of the plurality of nodes in the edge attribute table of the data block.
  17. 一种图数据存储装置,包括存储介质和处理器,所述存储介质用于存储计算机指令,所述处理器用于执行计算机指令以实现权利要求1-15中任一项所述的存储方法。A graph data storage device, comprising a storage medium and a processor, the storage medium is used to store computer instructions, and the processor is used to execute the computer instructions to implement the storage method according to any one of claims 1-15.
  18. 一种图数据的存储设备,所述图数据包括节点和边;所述存储设备存储有若干数据块,其中每个数据块包括:A storage device for graph data, the graph data includes nodes and edges; the storage device stores several data blocks, wherein each data block includes:
    点表,用于存储图数据中至少部分节点的节点信息;所述节点信息包括节点标识;A point table, used to store node information of at least some nodes in the graph data; the node information includes node identifiers;
    边表,用于存储所述节点的边的边信息;所述边信息包括与边连接的目标节点的节点标识;an edge table, configured to store edge information of the edge of the node; the edge information includes a node identifier of a target node connected to the edge;
    点属性表,用于存储所述节点的属性信息;A point attribute table, used to store the attribute information of the node;
    边属性表,用于存储所述节点的边的属性信息。The edge attribute table is used to store the attribute information of the edge of the node.
  19. 一种图数据查询方法,其包括:A graph data query method, comprising:
    接收查询请求,查询请求中包括目标查询节点的节点标识;Receive a query request, the query request includes the node identifier of the target query node;
    访问图谱文件元,通过图谱文件元中存储的各图谱文件的第一个节点的节点标识以及各图谱文件中最后一个节点的节点标识确定目标查询节点所在的目标图谱文件;Access the graph file element, determine the target graph file where the target query node is located by the node identifier of the first node of each graph file stored in the graph file element and the node identifier of the last node in each graph file;
    访问所述目标图谱文件的数据块索引,通过数据块索引中存储的目标图谱文件中各数据块中第一个节点的节点标识,确定目标查询节点所在的目标数据块;Access the data block index of the target map file, and determine the target data block where the target query node is located by the node identifier of the first node in each data block in the target map file stored in the data block index;
    基于所述数据块索引中存储的目标图谱文件中各数据块的存储地址信息,读取所述目标数据块;Reading the target data block based on the storage address information of each data block in the target atlas file stored in the data block index;
    在目标数据块中,基于其表元获取点表的存储地址信息,并在所述点表中基于目标查询节点的节点标识查找到目标查询节点的节点信息;In the target data block, the storage address information of the point table is obtained based on its table element, and the node information of the target query node is found in the point table based on the node identifier of the target query node;
    基于目标查询节点的节点信息在所述点表中的存储顺序或者边的存储地址信息,从所述目标数据块的边表、点属性表以及边属性表中的一个或多个表中获取目标查询节点的边信息、点属性信息以及边属性信息中的一种或多种信息。Based on the storage order of the node information of the target query node in the point table or the storage address information of the edge, the target is obtained from one or more tables in the edge table, point attribute table and edge attribute table of the target data block Query one or more types of edge information, point attribute information, and edge attribute information of nodes.
PCT/CN2023/070606 2022-01-07 2023-01-05 Graph data storage WO2023131218A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210014665.2A CN114077680B (en) 2022-01-07 2022-01-07 Graph data storage method, system and device
CN202210014665.2 2022-01-07

Publications (1)

Publication Number Publication Date
WO2023131218A1 true WO2023131218A1 (en) 2023-07-13

Family

ID=80284470

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/070606 WO2023131218A1 (en) 2022-01-07 2023-01-05 Graph data storage

Country Status (2)

Country Link
CN (1) CN114077680B (en)
WO (1) WO2023131218A1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114077680B (en) * 2022-01-07 2022-05-17 支付宝(杭州)信息技术有限公司 Graph data storage method, system and device
CN114282073B (en) * 2022-03-02 2022-07-15 支付宝(杭州)信息技术有限公司 Data storage method and device and data reading method and device
CN116204683A (en) * 2022-09-15 2023-06-02 阿里巴巴(中国)有限公司 Dynamic image data storage system, reading system and corresponding method
CN115481298B (en) * 2022-11-14 2023-03-14 阿里巴巴(中国)有限公司 Graph data processing method and electronic equipment

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104572740A (en) * 2013-10-23 2015-04-29 华为技术有限公司 Data storage method and device
US20180101559A1 (en) * 2016-10-06 2018-04-12 Microsoft Technology Licensing, Llc Diverse addressing of graph database entities by database applications
CN109189994A (en) * 2018-06-27 2019-01-11 北京中科睿芯科技有限公司 A kind of CAM structure storage system calculating application towards figure
CN111512303A (en) * 2017-12-29 2020-08-07 电子技术公司 Hierarchical graphics data structure
CN112287182A (en) * 2020-10-30 2021-01-29 杭州海康威视数字技术股份有限公司 Graph data storage and processing method and device and computer storage medium
CN112559631A (en) * 2020-12-15 2021-03-26 北京百度网讯科技有限公司 Data processing method and device of distributed graph database and electronic equipment
CN113609347A (en) * 2021-10-08 2021-11-05 支付宝(杭州)信息技术有限公司 Data storage and query method, device and database system
CN113722520A (en) * 2021-11-02 2021-11-30 支付宝(杭州)信息技术有限公司 Graph data query method and device
CN114077680A (en) * 2022-01-07 2022-02-22 支付宝(杭州)信息技术有限公司 Method, system and device for storing graph data

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8055864B2 (en) * 2007-08-06 2011-11-08 International Business Machines Corporation Efficient hierarchical storage management of a file system with snapshots
US9047189B1 (en) * 2013-05-28 2015-06-02 Amazon Technologies, Inc. Self-describing data blocks of a minimum atomic write size for a data store
CN104133970A (en) * 2014-08-06 2014-11-05 浪潮(北京)电子信息产业有限公司 Data space management method and device
US20180173755A1 (en) * 2016-12-16 2018-06-21 Futurewei Technologies, Inc. Predicting reference frequency/urgency for table pre-loads in large scale data management system using graph community detection
CN107657027B (en) * 2017-09-27 2021-09-21 北京小米移动软件有限公司 Data storage method and device
US10810075B2 (en) * 2018-04-23 2020-10-20 EMC IP Holding Company Generating a social graph from file metadata

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104572740A (en) * 2013-10-23 2015-04-29 华为技术有限公司 Data storage method and device
US20180101559A1 (en) * 2016-10-06 2018-04-12 Microsoft Technology Licensing, Llc Diverse addressing of graph database entities by database applications
CN111512303A (en) * 2017-12-29 2020-08-07 电子技术公司 Hierarchical graphics data structure
CN109189994A (en) * 2018-06-27 2019-01-11 北京中科睿芯科技有限公司 A kind of CAM structure storage system calculating application towards figure
CN112287182A (en) * 2020-10-30 2021-01-29 杭州海康威视数字技术股份有限公司 Graph data storage and processing method and device and computer storage medium
CN112559631A (en) * 2020-12-15 2021-03-26 北京百度网讯科技有限公司 Data processing method and device of distributed graph database and electronic equipment
CN113609347A (en) * 2021-10-08 2021-11-05 支付宝(杭州)信息技术有限公司 Data storage and query method, device and database system
CN113722520A (en) * 2021-11-02 2021-11-30 支付宝(杭州)信息技术有限公司 Graph data query method and device
CN114077680A (en) * 2022-01-07 2022-02-22 支付宝(杭州)信息技术有限公司 Method, system and device for storing graph data

Also Published As

Publication number Publication date
CN114077680B (en) 2022-05-17
CN114077680A (en) 2022-02-22

Similar Documents

Publication Publication Date Title
WO2023131218A1 (en) Graph data storage
CN109254733B (en) Method, device and system for storing data
KR101445950B1 (en) Method and apparatus for utilizing a scalable data structure
WO2018149271A1 (en) Data query method, device and calculating apparatus
WO2017107414A1 (en) File operation method and device
CN111339382B (en) Character string data retrieval method, device, computer equipment and storage medium
CN107704202B (en) Method and device for quickly reading and writing data
CN107103011B (en) Method and device for realizing terminal data search
CN111629081A (en) Internet protocol IP address data processing method and device and electronic equipment
US10248736B1 (en) Data loader and mapper tool
CN104021123A (en) Method and system for data transfer
WO2017097159A1 (en) Method and apparatus for generating random character string
CN101576919B (en) Mark generating method and device
WO2023143096A1 (en) Data query method and apparatus, and device and storage medium
US20220253419A1 (en) Multi-record index structure for key-value stores
CN106570153A (en) Data extraction method and system for mass URLs
CN112912870A (en) Tenant identifier conversion
US20220019907A1 (en) Dynamic In-Memory Construction of a Knowledge Graph
CN110049133B (en) Method and device for issuing full amount of DNS zone files
CN112925954A (en) Method and apparatus for querying data in a graph database
CN111310076A (en) Geographic position query method, device, medium and electronic equipment
CN111125216A (en) Method and device for importing data into Phoenix
CN107463618B (en) Index creating method and device
CN109947739A (en) Data power supply management method and device
CN109063061A (en) Across distributed system data processing method, device, equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23737081

Country of ref document: EP

Kind code of ref document: A1