CN117785889A

CN117785889A - Index management method for graph database and related equipment

Info

Publication number: CN117785889A
Application number: CN202410200092.1A
Authority: CN
Inventors: 唐浩栋; 吴涛
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2024-02-22
Filing date: 2024-02-22
Publication date: 2024-03-29

Abstract

The specification provides an index management method for a graph database and related equipment. The method comprises the following steps: acquiring a target index to be stored, which corresponds to target graph data; the target index comprises a main key, an auxiliary key and an attribute data storage address of the target graph data; the main key is data obtained after mapping the map data identifier of the target map data; the auxiliary key comprises metadata corresponding to the target graph data; and storing the auxiliary key and the attribute data storage address into a disk, and storing the disk addresses of the auxiliary key and the attribute data storage address in the disk into a memory.

Description

Index management method for graph database and related equipment

Technical Field

One or more embodiments of the present disclosure relate to the field of database technologies, and in particular, to an index management method for a graph database and related devices.

Background

The graph is a complex data structure comprising points and edges, including a plurality of attributes, wherein points may represent entities and edges connecting points may represent relationships between points.

In graph queries, graph data indexing is a relatively important technique. The index built for each point and edge in the graph is typically stored in memory managed by the storage engine of the graph database so that the storage engine can quickly retrieve the index and read the corresponding graph data based on the index.

However, as the size of the graph data is continuously enlarged, the index amount is increased, which places a great burden on the memory overhead.

Disclosure of Invention

In view of this, one or more embodiments of the present disclosure provide an index management method for a graph database and related devices.

In a first aspect, the present specification provides a method of index management for a graph database, the method comprising:

acquiring a target index to be stored, which corresponds to target graph data; the target index comprises a main key, an auxiliary key and an attribute data storage address of the target graph data; the main key is data obtained after mapping the map data identifier of the target map data; the auxiliary key comprises metadata corresponding to the target graph data;

and storing the auxiliary key and the attribute data storage address into a disk, and storing the disk addresses of the auxiliary key and the attribute data storage address in the disk into a memory.

In a second aspect, the present specification provides an index management apparatus for a graph database, the apparatus comprising:

the acquisition unit is used for acquiring a target index to be stored, which corresponds to the target graph data; the target index comprises a main key, an auxiliary key and an attribute data storage address of the target graph data; the main key is data obtained after mapping the map data identifier of the target map data; the auxiliary key comprises metadata corresponding to the target graph data;

and the storage unit is used for storing the auxiliary key and the attribute data storage address into a disk and storing the disk addresses of the auxiliary key and the attribute data storage address in the disk into a memory.

In a third aspect, the present disclosure provides a graph data query method for a graph database, where a graph data index in the graph database includes a primary key, a secondary key, and an attribute data storage address of the graph data; the primary key is data obtained after mapping the graph data identifier related to the graph data; the auxiliary key comprises metadata corresponding to the graph data; the method comprises the following steps:

receiving a query statement aiming at the graph database, and analyzing to obtain query conditions which are contained in the query statement and are related to graph data to be queried;

Obtaining disk addresses of a plurality of auxiliary keys and attribute data storage addresses stored in a memory, and reading the auxiliary keys and the attribute data storage addresses from the disk according to the disk addresses;

searching auxiliary keys containing metadata meeting the query conditions in the auxiliary keys, and acquiring attribute data of the map data to be queried according to an attribute data storage address corresponding to the auxiliary keys.

In a fourth aspect, the present specification provides a graph data query apparatus for a graph database, where a graph data index in the graph database includes a primary key, a secondary key, and an attribute data storage address of the graph data; the primary key is data obtained after mapping the graph data identifier related to the graph data; the auxiliary key comprises metadata corresponding to the graph data; the device comprises:

the receiving unit is used for receiving query sentences aiming at the graph database, and analyzing to obtain query conditions which are contained in the query sentences and are related to the graph data to be queried;

the acquisition unit is used for acquiring disk addresses of a plurality of auxiliary keys and attribute data storage addresses stored in the memory, and reading the plurality of auxiliary keys and the attribute data storage addresses from the disk according to the plurality of disk addresses;

And the query unit is used for searching auxiliary keys which contain metadata and meet the query conditions in the auxiliary keys, and acquiring the attribute data of the map data to be queried according to the attribute data storage address corresponding to the auxiliary keys.

Accordingly, the present specification also provides a computer apparatus comprising: a memory and a processor; the memory has stored thereon a computer program executable by the processor; when the processor runs the computer program, the index management method for the graph database according to the first aspect or the graph data query method for the graph database according to the third aspect is executed.

Accordingly, the present specification also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the index management method for a graph database as described in the first aspect above, or the graph data query method for a graph database as described in the third aspect.

In summary, in one aspect, the present application first defines that the map data index includes three parts, namely a primary key, a secondary key, and an attribute data storage address. The primary key is normalized data obtained by mapping a graph data identifier (such as a point id or a starting point id of an edge) input by a user, has a fixed format, can reduce occupied memory space, and is convenient to manage. The auxiliary key comprises point or side metadata so as to filter out a large amount of unconditional graph data in advance at an index layer and further reduce disk IO.

On the other hand, under the condition of insufficient memory space, the auxiliary key and the attribute data storage address in the index can be stored in the disk, and then the disk address of the auxiliary key and the attribute data storage address in the disk is stored in the memory, so that the index organization mode of the memory and the disk is realized, the memory overhead is greatly reduced, and the storage and management cost of the index is reduced.

Drawings

FIG. 1 is a schematic diagram of an index management system architecture for a graph database, provided in an exemplary embodiment;

FIG. 2 is a flow chart of a method of index management for a graph database, provided in an exemplary embodiment;

FIG. 3 is a schematic diagram of a storage structure of an LSM tree provided by an exemplary embodiment;

FIG. 4 is a schematic diagram of an index management device for a graph database according to an exemplary embodiment;

fig. 5 is a schematic diagram of a computer device according to an exemplary embodiment.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with one or more embodiments of the present specification. Rather, they are merely examples of apparatus and methods consistent with aspects of one or more embodiments of the present description as detailed in the accompanying claims.

It should be noted that: in other embodiments, the steps of the corresponding method are not necessarily performed in the order shown and described in this specification. In some other embodiments, the method may include more or fewer steps than described in this specification. Furthermore, individual steps described in this specification, in other embodiments, may be described as being split into multiple steps; while various steps described in this specification may be combined into a single step in other embodiments.

The term "plurality" as used herein refers to two or more.

As described in the background above, a graph is a complex data structure that consists of points and edges, including multiple attributes (properties), where points may represent entities and edges connecting between points may represent relationships between points. Edges tend to have directionality, so two points that an edge connects to can be the start (source) and end (target) of the edge, respectively. The point-edge data may include information such as a point id, a tag (label), a type, and a time stamp, in addition to attribute data of points and edges. It should be appreciated that for an edge, a point id may include a start point id and an end point id corresponding thereto.

Graph data storage is a mature and important scheme, in a conventional graph data storage scheme, graph data is usually split and stored in a key-value database in the form of a key and a value, wherein the key comprises a point id input by a user or a starting point id of an edge, and the value can comprise attribute data of the point or the edge. However, since the key-value database has no semantic meaning of the graph, the graph index cannot be established, and great difficulty is brought to the query of the graph data.

In order to solve the above problem, in one embodiment of the present application, a graph storage engine is first constructed for graph data, and then indexes can be constructed for points and edges in the graph, and the indexes are stored in a memory managed by the storage engine for use in querying the graph.

In one embodiment of the present application, the graph data index may be directly formed by the point id or the start point id of the edge input by the user. However, in general, the point ids or the starting point ids of the edges input by different users and even the same user often have different sizes, which may be 4bytes, or may be 8bytes, or even 32bytes, which is very irregular, so that the index management is difficult, and the larger point ids or the starting point ids of the edges directly cause that the index occupies a large amount of memory space, so that the memory overhead is seriously increased, and the actual demands of the users cannot be met.

Based on the above, the present disclosure provides a technical solution, in which data obtained by normalizing a point id or a starting point id of an edge input by a user is used as a primary key of an index, so as to reduce occupied memory space. And under the condition of insufficient memory space, the index (comprising point-side metadata and attribute data storage addresses) can be stored in the disk, so that an index organization mode of memory and the disk is constructed, and the memory overhead is greatly reduced.

When the method is implemented, the target index to be stored corresponding to the target graph data is acquired firstly; the target index comprises a main key, a secondary key and an attribute data storage address of the target graph data. The main key is the data obtained after mapping the map data identifier of the target map data; the secondary key includes metadata corresponding to the target graph data. Further, the application may store the auxiliary key and the attribute data storage address in a disk, and store the disk address of the auxiliary key and the attribute data storage address in the disk in a memory.

In the above technical solution, on one hand, the present application defines that the map data index includes three parts, namely, a primary key, a secondary key, and an attribute data storage address. The primary key is normalized data obtained by mapping a graph data identifier (such as a point id or a starting point id of an edge) input by a user, has a fixed format, can reduce occupied memory space, and is convenient to manage. The auxiliary key comprises point or side metadata so as to filter out a large amount of unconditional graph data in advance at an index layer and further reduce disk IO. On the other hand, under the condition of insufficient memory space, the auxiliary key and the attribute data storage address in the index can be stored in the disk, and then the disk address of the auxiliary key and the attribute data storage address in the disk is stored in the memory, so that the index organization mode of the memory and the disk is realized, the memory overhead is greatly reduced, and the storage and management cost of the index is reduced.

Referring to fig. 1, fig. 1 is a schematic diagram of an index management system architecture for a graph database according to an exemplary embodiment. One or more embodiments provided herein may be embodied in the system architecture shown in fig. 1 or a similar system architecture. As shown in FIG. 1, the index management system may include a storage engine 100 in a graph database, as well as memory 101 and disk 102 managed by the storage engine 100.

The specific type of the magnetic disk 102 is not particularly limited in this application. In an embodiment shown, the disk 102 may be a local disk managed by a storage engine, or may be a remote cloud disk (simply referred to as a cloud disk) connected through a network, for example, may be a cloud disk in a distributed file system (Distributed File System, DFS), which is not specifically limited in this specification.

It should be understood that, in this application, on the premise of implementing graph data storage, a corresponding graph data index is built for each graph data (i.e. point/edge data), and the indexes are stored. The storage method of the map data is not particularly limited in the present application. In an embodiment, the present application may store The attribute data in The graph data into The memory, or in an embodiment, considering that The attribute data of The points and edges is generally larger, or may store The attribute data separately into The local disk or The cloud disk, and may use a multi-layer storage structure of a Log-structured merge Tree (The Log-Structured Merge Tree, LSM-Tree), or The like, which is not limited in this specification.

First, the present application defines that the graph data index includes: a primary key, a secondary key (secondary key), and an attribute data storage address in the map data.

The main key of the index in the application is not directly formed by the point id or the side starting point id input by the user, but the point id or the side starting point id input by the user is received, the point id or the side starting point id is normalized, and the processed data is used as the main key of the index.

In an illustrated embodiment, the normalization process described above may be a mapping process to map a point id or a starting point id of an edge into data having a predefined format. In an illustrated embodiment, a point id or a start point id of an edge input by a user may be mapped to an integer greater than or equal to 0 through a mapping process. By way of example, the user-entered point id or the starting point id of an edge may be mapped to an integer of a fixed size (e.g., 4 bytes) starting from 0, such as 200, 1022, 1024, etc., which may be the primary key of the index.

The specific mapping method is not particularly limited in this application. In an illustrated embodiment, dictionary mapping or any other possible mapping may be employed.

Wherein, the secondary key of the index in the present application may contain primitive data (i.e. metadata of a point or an edge). In an illustrated embodiment, the metadata may comprise any one or a combination of the following: whether the graph object included in the graph data is a point, whether the graph object included in the graph data is an edge, a time stamp of the graph data, a tag of the graph data, time information of the graph data, and the like, which are not particularly limited in this specification.

In an illustrated embodiment, the metadata may have a fixed structure and size. For example, the size of metadata may be fixed to 8 bytes in total, and the structure may include a plurality of fields arranged in sequence, which may be used to describe in sequence whether the above is a point, whether it is an in-edge, a time stamp, a tag, time information, and so on.

In an illustrated embodiment, the secondary key may also contain any one or more of the combinations of data shown below: the writing time (write ts) of the map data, the end point id of the map data, the system serial number (sequence id) of the map data, and the like, which are not particularly limited in this specification. In an illustrated embodiment, the writes ts may occupy 4bytes,sequence id may occupy 4bytes, the target id may occupy n bytes, and the specific size of the target id (i.e., the value of n) may depend on the user definition, which is not specifically limited in this specification. The write ts is a write time given by the system when writing data, and operations such as deleting specified data can be performed based on the write time, which will not be described in detail herein. Each point or edge written into the graph database generates a unique (unique) id, i.e., a sequence id, which is subsequently used in merging (comparison), and will not be described in detail herein.

Wherein the attribute data storage address is associated with a storage mode of the attribute data. For example, if the attribute data is stored in the memory, the attribute data storage address is a memory address; if the attribute data is stored in the local disk, the attribute data storage address is a local disk address; if the attribute data is stored in the cloud disk, the attribute data storage address is a cloud disk address, which is not specifically limited in the present specification.

In an illustrated embodiment, if the storage space of the memory 101 is insufficient, the application may store the auxiliary key and the attribute data storage address in the index into the disk 102 first, and then store the disk addresses of the auxiliary key and the attribute data storage address in the disk 102 into the memory 101, thereby constructing an index organization mode of the memory and the disk, and greatly reducing the memory overhead.

Further, as shown in fig. 1, a memory object for storing a graph data index is predefined in the memory 101, and a storage space corresponding to the memory object may include a plurality of sub-storage spaces respectively corresponding to different storage space identifiers. As shown in fig. 1, the storage space corresponding to the memory object may include, for example: sub-storage 1 corresponding to storage identifier 1, sub-storage 2 corresponding to storage identifier 2, sub-storage 3 corresponding to storage identifier 3, … …, and sub-storage Y corresponding to storage identifier Y.

It should be noted that, the specific type of the memory object is not particularly limited in the present application. In an embodiment, the memory object may be an array object, and the memory space corresponding to the array object may include a plurality of memory blocks (blocks) corresponding to different array indices (indexes), where values (values) of the array are stored in the blocks, which will be described in detail below with reference to fig. 2. In an embodiment shown, the present application may also use a hash table or any other possible structure to store the index in the memory, and the like, which is not specifically limited in this specification. In general, the array consumes less memory than other structures.

The method for managing the index of graph data in the present application will be described below based on the predefined index structure and the memory object in the memory 101 for storing the index of graph data.

First, the storage engine 100 may acquire a target index to be stored corresponding to target map data. Wherein the target index may include a primary key, a secondary key, and an attribute data storage address of the target map data.

Wherein the primary key identifies the map data associated with the target map data for the storage engine 100 as the mapped data. It should be understood that, if the graph object included in the target graph data is a point, the graph data identifier is the id of the point; if the graph object contained in the target graph data is an edge, the graph data identification is the id of the starting point of the edge.

The auxiliary key may include metadata corresponding to the target graph data, and specifically may include: whether the map object included in the target map data is a point, whether the map object included in the target map data is an in-edge, a time stamp of the target map data, a tag of the target map data, time information of the target map data, and the like, which are not particularly limited in this specification.

Further, storage engine 100 may determine whether memory 101 satisfies the condition for storing the target index described above.

In one illustrated embodiment, if memory 101 does not meet the conditions for storing the target index, e.g., memory 101 has less or even insufficient remaining storage capacity to store the target index, storage engine 100 may calculate the target storage space identifier corresponding to the target index based on the primary key of the target index.

Further, the storage engine 100 may store the auxiliary key and the attribute data storage address in the target index in the disk 102, and store the disk address of the auxiliary key and the attribute data storage address in the disk 102 in the target sub-storage space corresponding to the target storage space identifier in the memory 101. Therefore, the memory overhead is effectively reduced by the index organization mode of the memory and the disk.

In an embodiment shown, the present application may store a plurality of indexes in batches, where the target index may be one of the plurality of indexes, and a plurality of auxiliary keys and a plurality of attribute data storage addresses corresponding to the plurality of indexes may be written in batches to the disk 102, which may be specifically referred to in the following description of the corresponding embodiment of fig. 2, which is not described in detail herein.

Further, the graph data query method in the present application will be explained below based on the graph data index management manner discussed above.

First, a query engine (or computation engine) in the database may receive a query statement for the graph database and parse the query statement to obtain a graph data identifier and a query condition associated with the graph data to be queried contained in the query statement.

Further, the query engine may send the parsed graph data identifier related to the graph data to be queried and the query condition to the storage engine 100. Accordingly, the storage engine 100 receives the graph data identification and the query condition.

Further, the storage engine 100 may perform normalization processing (e.g., the dictionary mapping processing described above) on the graph data identifier to obtain a primary key corresponding to the index of the graph data to be queried. Further, the storage engine 100 may calculate, according to the primary key, a storage space identifier corresponding to the index of the map data to be queried.

Further, the storage engine 100 may determine a sub-storage space corresponding to the storage space identifier, and read the plurality of auxiliary keys and the attribute data storage addresses from the disk 102 according to the disk addresses of the plurality of auxiliary keys and the attribute data storage addresses stored in the sub-storage space.

Further, the storage engine 100 may search for the auxiliary key that includes metadata that satisfies the above query condition from the plurality of auxiliary keys, and read out the attribute data of the map data to be queried according to the attribute data storage address corresponding to the auxiliary key.

It should be appreciated that fig. 1 is merely illustrative, and in some possible embodiments, more or fewer structures than those shown in fig. 1 may be included in the index management system for a graph database, for example, a query engine in a graph database, etc., which is not specifically limited in this disclosure.

Referring to fig. 2, fig. 2 is a flowchart of an index management method for a graph database according to an exemplary embodiment. The method can be applied to the index management system for the graph database shown in fig. 1, and particularly can be applied to the storage engine 100. As shown in fig. 2, the method may specifically include the following steps S201 to S204.

Step S201, obtaining a target index to be stored, which corresponds to target graph data; the target index comprises a main key, an auxiliary key and an attribute data storage address of the target graph data; the main key is data obtained after mapping the map data identifier of the target map data; the secondary key includes metadata corresponding to the target graph data.

In the case that the target graph data is stored or updated, the application can construct a corresponding target index for the target graph data and store the target index. Wherein the target index may include a primary key, a secondary key, and an attribute data storage address of the target map data.

As described above, the storage engine may acquire a map data identifier (e.g., an id of a point, or an id of a start point of an edge) related to target map data input by a user, perform mapping processing on the map data identifier, and use data having a fixed format obtained after the mapping processing as a primary key of a target index. In an illustrated embodiment, the graph data identification may be a point id directly entered by a user or a start point id of an edge; alternatively, in an embodiment shown, the graph data identifier may also be determined by the graph database according to a related operation initiated by a user, which is not specifically limited in this specification.

The auxiliary key in the target index may include metadata corresponding to the target map data, a writing time of the target map data, an id of an end point of the target map data, a system serial number of the target map data, and the like, which are not specifically limited in this specification, and specific reference may be made to the description of the corresponding embodiment of fig. 1, which is not repeated herein.

It should be noted that, unlike the conventional key-value pair (KV) data, the graph data has a specific access mode, and usually, point-edge data with the same starting point id is accessed together with a high probability. While edges with the same starting point id may be very numerous, how to quickly index to a desired edge among a large number of edges with the same starting point id is important to graph queries. Based on the method, more information such as metadata and terminal point id is added to the auxiliary key of the index, so that indexes which do not meet query conditions can be filtered out in a large number in the index layer, and further needed edges can be indexed to a large number of edges with the same starting point id, and the query efficiency of the graph is improved.

As described above, the attribute data storage address in the target index may be a memory address, a local disk address, or a cloud disk address based on different storage manners of the map data, which is not specifically limited in this specification. In an embodiment, the present application may also store a copy of attribute data in the cloud disk while storing the attribute data to the local disk, so as to maintain data security.

In an embodiment shown, the attribute data may be stored in the form of a file in a disk based on an LSM tree storage structure, and accordingly, the attribute data storage address may specifically point to a file identifier (fid) and a file offset of the disk file, etc., which is not specifically limited in this specification.

Step S202, storing the auxiliary key and the attribute data storage address in a disk, and storing the disk addresses of the auxiliary key and the attribute data storage address in the disk in a memory.

As described above, the method can store the auxiliary key and the attribute data storage address in the index into the disk first, and then store the disk addresses of the auxiliary key and the attribute data storage address in the disk into the memory, thereby constructing an index organization mode of the memory and the disk.

In one illustrated embodiment, the present application predefines memory objects for storing graph data indexes in memory managed by a storage engine. The storage space corresponding to the memory object may be divided into a plurality of sub-storage spaces, and each sub-storage space may correspond to a different storage space identifier.

In an embodiment, the memory object may be an array object, and the memory space corresponding to the array object may include a plurality of memory blocks corresponding to different array indices.

In an embodiment, after the storage engine obtains the target index to be stored, it may first determine whether the memory managed by the storage engine meets the condition of storing the target index.

Illustratively, the above conditions may include: the remaining storage capacity of the memory is greater than the size of the target index, or the remaining storage capacity of the memory is greater than a certain preset threshold, or the like, in short, whether the memory space is sufficient is determined, which is not specifically limited in this specification.

In an embodiment, if the memory managed by the storage engine does not meet the condition of storing the target index, the application may store the target index by combining the disk on the basis of the original memory without expanding the memory, and the memory overhead is reduced by a mode of memory sparse index and disk secondary index.

First, the storage engine may calculate a target storage space identification corresponding to a target index based on a primary key in the target index.

Further, the storage engine may store the auxiliary key and the attribute data storage address of the target index into a disk managed by the storage engine, and store the disk addresses of the auxiliary key and the attribute data storage address in the disk into the target sub-storage space corresponding to the target storage space identifier obtained by the calculation.

Specifically, the storage engine may sequence the auxiliary key and the attribute data storage address first, and store the auxiliary key and the attribute data storage address after the sequence in the disk.

In an embodiment, the auxiliary key and the attribute data storage address may be stored in a disk based on the LSM tree storage structure in the form of a file, and accordingly, the disk addresses of the auxiliary key and the attribute data storage address may specifically point to a file identifier and a file offset of a disk file, and the like, which are not specifically limited in this specification.

In an embodiment, the storage engine may store the auxiliary key and the attribute data storage address of the target index in a local disk managed by the storage engine, where the disk addresses of the auxiliary key and the attribute data storage address are the local disk addresses, and point to the file identifier and the file offset of the local disk file.

In an embodiment, the storage engine may store the auxiliary key and the attribute data storage address of the target index in the cloud disk of the DFS, where the disk addresses of the auxiliary key and the attribute data storage address are cloud disk addresses, and point to the file identifier and the file offset of the cloud disk file, which are not specifically limited in this specification.

In an embodiment, the application stores the auxiliary key and the attribute data storage address to the local disk, and simultaneously stores a part of auxiliary key and attribute data storage address in the cloud disk to maintain data security. Accordingly, the disk addresses of the auxiliary key and the attribute data storage address may include a local disk address and a cloud disk address, which refer to a file identifier and a file offset of the local disk file, and a file identifier and a file offset of the cloud disk file, which are not specifically limited in this specification. Alternatively, in an embodiment, where the auxiliary key and the attribute data storage address are stored in the local disk and the cloud disk at the same time, preferably, the disk address stored in the target sub-storage space of the memory may include only the local disk address, which is not specifically limited in this specification.

The primary key of the target index may be an integer greater than or equal to 0. Accordingly, in an illustrated embodiment, when the storage engine calculates the target storage space identifier corresponding to the target index based on the primary key in the target index, the storage engine may specifically include: dividing the primary key by a preset value N and rounding to obtain a numerical value serving as a target storage space identifier corresponding to the target index; and the disk addresses which can be used for storing N auxiliary keys and attribute data storage addresses in the sub storage space corresponding to each storage space identifier.

Taking the memory object as a plurality of groups of objects as an example, the array can comprise 10 blocks corresponding to the array subscripts 0-9 in total, and if N is 1024, the block corresponding to each array subscript can store 1024 auxiliary keys and disk addresses of attribute data storage addresses. If the primary key of the target index is 1023, the value obtained by dividing the primary key by the preset value 1024 and rounding is 0, and correspondingly, the secondary key of the target index and the disk address of the attribute data storage address can be stored in a block with the subscript of 0. If the primary key of the target index is 1027, the value obtained by dividing the primary key by the preset value 1024 and rounding is 1, and correspondingly, the secondary key of the target index and the disk address of the attribute data storage address can be stored in a block with the subscript of 1.

In an embodiment, the disk addresses of the auxiliary keys and the attribute data storage addresses corresponding to the plurality of indexes stored in each block may be further sorted according to the size of the main key of the index for management. For example, the primary key of the target index is 1024, the value obtained by dividing the primary key by the preset value 1024 and taking the remainder is 0, and correspondingly, the secondary key of the target index and the disk address of the attribute data storage address may be sequentially arranged at the 1 st position in the block with the subscript of 1. For example, the primary key of the target index is 1028, the value obtained by dividing the primary key by the preset value 1024 and taking the remainder is 4, and correspondingly, the secondary key of the target index and the disk address of the attribute data storage address may be sequentially arranged at the 5 th position in the block with the subscript 1.

In an embodiment, if the memory managed by the storage engine meets the condition of storing the target index, the method can directly store the target index into the memory in a memory dense index manner, which is more convenient and efficient.

First, the storage engine may determine the primary key of the target index as a target storage space identifier corresponding to the target index, and further, the storage engine may store the secondary key of the target index and the attribute data storage address in a target sub-storage space corresponding to the target storage space identifier. That is, in the case where the memory space is sufficient and a dense index is used, the primary key of the index is equal to the memory space identifier, and only the secondary key of one index and the attribute data storage address are stored in the sub-memory space corresponding to each memory space identifier. Or, in the case of sufficient memory space and dense memory index, the preset value N used when calculating the memory space identifier corresponding to the index based on the primary key in the index is 1.

Taking the memory object as an array object, the array may include 1024 blocks corresponding to array indices 0-1023. If the primary key of the target index is 0, the secondary key and the attribute data storage address of the target index may be stored in a block with a subscript of 0. If the primary key of the target index is 10, the secondary key and the attribute data storage address of the target index may be stored in a block with a subscript of 10. If the primary key of the target index is 220, the secondary key of the target index and the attribute data storage address may be stored in a block with a subscript of 220.

In addition, in an embodiment, on the premise of using a memory sparse index and a disk secondary index, the application can also use a multi-layer storage structure of the LSM tree to manage the index.

Referring to fig. 3, fig. 3 is a schematic diagram of a storage structure of an LSM tree according to an exemplary embodiment. As shown in FIG. 3, multiple storage layers based on LSM trees may be included in the disk, such as Level 0-Level M layers, M being an integer greater than or equal to 1. In general, the storage capacity of each storage layer in the disk may gradually increase from the Level 0 layer to the Level M layer, and the storage capacity of each layer may be 10 times that of the previous layer, which is not specifically limited in this specification. As shown in fig. 3, a plurality of files are stored in at least some storage layers of the plurality of storage layers, for example, a file 1 is stored in a Level 0 layer, a file 2, a file 3, etc. are stored in a Level 1 layer, a file 4, a file 5, etc. are stored in a Level 2 layer, a file 6, etc. are stored in a Level M layer, which is not particularly limited in this specification. Wherein, each file can contain a plurality of auxiliary keys and attribute data storage addresses corresponding to a plurality of indexes.

It should be noted that, writing and updating of the index is generally performed in batch, and each batch operation forms an independent file, where a plurality of auxiliary keys and attribute data storage addresses corresponding to a plurality of indexes written or updated in the batch may be included in the file. In an illustrated embodiment, the secondary key and attribute data storage address within the same block in the array are often written together in the form of a file to disk. Every time a new index is written or updated, the disk address of the new auxiliary key and the disk address of the attribute data storage address are saved behind the value of the array to form a value list (table), and the value list can contain a series of disk addresses of the auxiliary key and the attribute data storage address. In this way, after multiple updates, the value list may be very long, and value list compact (merging) is needed to gather data corresponding to the same array index as much as possible in view of memory capacity and read performance optimization.

In an illustrated embodiment, in a case where file merging is required for a first storage layer of the plurality of storage layers, the storage engine may select a first file to be merged from a plurality of first files stored in the first storage layer; the first file to be merged may include the auxiliary key and the attribute data storage address corresponding to the first index.

Further, the storage engine may determine, from a plurality of second files stored in a second storage layer of the plurality of storage layers, a second file that also includes a secondary key and an attribute data storage address corresponding to the first index, and merge the first file to be merged with the second file. Wherein the second storage layer is located adjacent to the next layer of the first storage layer.

In an embodiment, the merging operation in the LSM tree may be triggered by an asynchronous thread, and each time the compact may collect the number and size of files in each storage layer first, and calculate the priority score of each storage layer that needs to be compacted according to a preset algorithm, where the higher the score, the priority is. For example, the ratio (number of files/storage capacity) between the number of files in each storage tier and the storage capacity of that tier may be calculated, the greater the ratio, the higher the priority.

For example, taking merging the index file 2 stored in the Level 1 layer shown in fig. 3 as an example, the file 2 includes an auxiliary key and an attribute data storage address corresponding to the primary key 1, and if the files 4 and 5 stored in the Level 2 layer also include auxiliary keys and attribute data storage addresses corresponding to the primary key 1, the storage engine may merge the file 2 stored in the Level 1 layer with the files 4 and 5 stored in the Level 2 layer.

As described above, in one aspect, merging operations can be automatically performed on each layer of index files in a disk through an LSM tree to reduce duplicate data. On the other hand, index file read performance may be accelerated. On the other hand, since DFS does not support random writing, it is necessary to require distributed files to support random writing if a conventional b+ Tree is employed, whereas LSM Tree structure does not have this requirement, which can be well circumvented.

Next, a graph data query method in the present application will be explained based on the graph data index management method discussed above.

First, a query engine in a graph database may receive query statements for the graph database. Further, the query engine may parse the query statement to obtain graph data identifications and query conditions associated with the graph data to be queried contained in the query statement.

Further, the query engine can send the graph data identification and query conditions, which are obtained through analysis and are related to the graph data to be queried, to the storage engine. Accordingly, the storage engine receives the graph data identification and the query condition.

Further, the storage engine may normalize the graph data identifier related to the graph data to be queried to obtain a primary key corresponding to the index of the graph data to be queried.

Further, the storage engine may calculate, according to the primary key, a storage space identifier corresponding to the index of the map data to be queried, determine a sub-storage space corresponding to the storage space identifier, and obtain data stored in the sub-storage space.

Further, the storage engine may determine whether the data stored in the sub-storage space are all disk address data, and query target graph data satisfying the query condition based on the determined result.

In an embodiment, if the data stored in the sub-storage space includes a plurality of disk addresses of auxiliary keys and attribute data storage addresses, that is, the sparse memory index+the secondary disk index is currently adopted, the storage engine may read the auxiliary keys and attribute data storage addresses from the disk based on the plurality of disk addresses. It should be understood that the auxiliary key and attribute data storage addresses stored in the disk are serialized data, and the storage engine needs to deserialize the auxiliary key and attribute data storage addresses after reading the auxiliary key and attribute data storage addresses from the disk to obtain identifiable auxiliary key and attribute data storage addresses.

Further, the storage engine may search for auxiliary keys that contain metadata that satisfy the above query condition from the acquired plurality of auxiliary keys, and acquire attribute data of the map data to be queried according to the attribute data storage address corresponding to the auxiliary key.

In an illustrated embodiment, if the data stored in the sub-storage space includes a plurality of auxiliary keys and an attribute data storage address, that is, a dense index of memory is currently adopted, the storage engine may directly search for the auxiliary key that includes metadata that satisfies the above query condition in the plurality of auxiliary keys, and obtain the attribute data of the map data to be queried according to the attribute data storage address corresponding to the auxiliary key.

For example, the process of querying graph data is described below by taking a sparse memory index+a secondary disk index, wherein the memory object is a data object, the array includes 10 blocks corresponding to the array subscripts 0-9 in total, and the disk address storing 1024 auxiliary keys and attribute data storage addresses is stored in the blocks corresponding to each array subscript.

First, the query conditions parsed from the query statement may be: and inquiring the transaction records with the transfer amount of more than 500 yuan in 10-12 months 2023. If the primary key obtained by normalizing the graph data identifier analyzed according to the query statement is 1024, the storage engine calculates an array index according to the primary key to be 1.

Further, the storage engine can directly and quickly read out 1024 auxiliary keys and disk addresses of the attribute data storage addresses from the blocks corresponding to the array subscript 1 in the memory. Therefore, the time consumption of the index query can be reduced to O (1) by means of the random rapid read-write characteristic of the array, and the overall efficiency of the graph data query is greatly improved.

Further, the storage engine may read the 1024 secondary keys and the attribute data storage address from the disk based on the read disk address of the 1024 secondary keys and the attribute data storage address.

Further, the present application may filter out at least one secondary key indicating that the graph object is out-of-edge and that the time information is in the range of 10-12 months of 2023 according to metadata contained in each of the 1024 secondary keys. Then, the storage engine may acquire the corresponding attribute data based on the attribute data storage addresses respectively corresponding to the at least one auxiliary key.

Further, the storage engine may also send the obtained attribute data to the query engine, so that the query engine may perform further graph data query and graph calculation based on the attribute data, and the present disclosure is not limited in detail.

Corresponding to the implementation of the method flow, the embodiment of the specification also provides an index management device for the graph database. Referring to fig. 4, fig. 4 is a schematic structural diagram of an index management apparatus for a graph database according to an exemplary embodiment, and the apparatus 40 may be applied to the storage engine 100 shown in fig. 1. As shown in fig. 4, the apparatus 40 includes:

an obtaining unit 401, configured to obtain a target index to be stored, which corresponds to target graph data; the target index comprises a main key, an auxiliary key and an attribute data storage address of the target graph data; the main key is data obtained after mapping the map data identifier of the target map data; the auxiliary key comprises metadata corresponding to the target graph data;

and the storage unit 402 is configured to store the auxiliary key and the attribute data storage address in a disk, and store the disk addresses of the auxiliary key and the attribute data storage address in the disk in a memory.

In an embodiment, the memory includes a predefined memory object for storing the graph data index; the storage space corresponding to the memory object is divided into a plurality of sub-storage spaces, and each sub-storage space corresponds to a different storage space identifier; the apparatus 40 further comprises a computing unit 403 for:

Calculating a target storage space identifier corresponding to the target index based on the primary key;

the storage unit 402 is specifically configured to:

and storing the auxiliary key and the disk address of the attribute data storage address in the disk into a target sub-storage space corresponding to the target storage space identifier in a memory.

The memory object is an array object, and a storage space corresponding to the array object comprises a plurality of storage blocks respectively corresponding to different array subscripts;

the computing unit 403 is specifically configured to: calculating a target array index corresponding to the target index based on the primary key;

the storage unit 402 is specifically configured to: and storing the auxiliary key and the disk address of the attribute data storage address in the disk into a target storage block corresponding to the target array subscript in a memory.

In an embodiment, the map object contained in the target map data is a point, and the map data is identified as an id of the point; alternatively, the graph object contained in the target graph data is an edge, and the graph data is identified as an id of a starting point of the edge.

In an illustrated embodiment, the metadata includes any one or more of the combinations of data shown below: whether the graph object contained in the target graph data is a point, whether the graph object contained in the target graph data is an incoming edge, a time stamp of the target graph data, a label of the target graph data and time information of the target graph data.

In an illustrated embodiment, the secondary key further includes any one or more of the combinations of data shown below: the writing time of the target graph data, the id of the end point of the target graph data and the system serial number of the target graph data.

In an embodiment, the disk is a local disk, and the disk addresses of the auxiliary key and the attribute data storage address are local disk addresses;

and/or the disk is a cloud disk, and the disk addresses of the auxiliary key and the attribute data storage address are cloud disk addresses.

In an illustrated embodiment, the primary bond is an integer greater than or equal to 0; the computing unit 403 is specifically configured to:

dividing the primary key by a preset value N and rounding to obtain a numerical value serving as a target storage space identifier corresponding to the target index; and the disk addresses are used for storing N auxiliary keys and attribute data storage addresses in the sub storage space corresponding to each storage space identifier.

In an illustrated embodiment, the apparatus 40 further comprises a graph data querying unit 404 configured to:

receiving a query statement aiming at the graph database, and analyzing to obtain graph data identifiers and query conditions, which are contained in the query statement and are related to graph data to be queried;

Mapping the map data identifier to obtain a primary key corresponding to the index of the map data to be queried, and further calculating according to the primary key to obtain a storage space identifier corresponding to the index;

determining a sub storage space corresponding to the storage space identifier, and acquiring disk addresses of a plurality of auxiliary keys and attribute data storage addresses stored in the sub storage space;

reading the auxiliary keys and the attribute data storage addresses from the disk based on the disk addresses;

In an embodiment, the disk includes a plurality of storage layers based on an LSM tree, and at least some of the storage layers store a plurality of files, each file including a secondary key and an attribute data storage address corresponding to a plurality of indexes.

In an illustrated embodiment, the apparatus 40 further comprises a file merging unit 405 configured to:

selecting a first file to be combined from a plurality of first files stored in a first storage layer in the plurality of storage layers; the first file to be combined contains an auxiliary key and an attribute data storage address corresponding to the first index;

Determining a second file containing an auxiliary key corresponding to the first index and an attribute data storage address from a plurality of second files stored in a second storage layer of the plurality of storage layers, and merging the first file to be merged with the second file; the second storage layer is located adjacent to a next layer of the first storage layer.

The implementation process of the functions and roles of the units in the above device 40 is specifically described in the above corresponding embodiments of fig. 1 to 3, and will not be described in detail herein. It should be understood that the apparatus 40 may be implemented in software, or may be implemented in hardware or a combination of hardware and software. Taking software implementation as an example, the device in a logic sense is formed by reading corresponding computer program instructions into a memory by a processor (CPU) of the device. In addition to the CPU and the memory, the device in which the above apparatus is located generally includes other hardware such as a chip for performing wireless signal transmission and reception, and/or other hardware such as a board for implementing a network communication function.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network modules. Some or all of the units or modules may be selected according to actual needs to achieve the purposes of the present description. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The apparatus, units, modules illustrated in the above embodiments may be implemented in particular by a computer chip or entity or by a product having a certain function. A typical implementation device is a computer, which may be in the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email device, game console, tablet computer, wearable device, vehicle-mounted computer, or a combination of any of these devices.

Corresponding to the method embodiments described above, embodiments of the present disclosure also provide a computer device. Referring to fig. 5, fig. 5 is a schematic structural diagram of a computer device according to an exemplary embodiment. The computer device shown in fig. 5 may be equipped with a storage engine, where a memory managed by the storage engine includes a predefined memory object for storing the graph data index; the storage space corresponding to the memory object comprises a plurality of sub-storage spaces respectively corresponding to different storage space identifiers. As shown in fig. 5, the computer device includes a processor 1001 and a memory 1002, and may further include an input device 1004 (e.g., keyboard, etc.) and an output device 1005 (e.g., display, etc.). The processor 1001, memory 1002, input devices 1004, and output devices 1005 may be connected by a bus or other means. As shown in fig. 5, the memory 1002 includes a computer-readable storage medium 1003, which computer-readable storage medium 1003 stores a computer program executable by the processor 1001. The processor 1001 may be a CPU, microprocessor, or integrated circuit for controlling the execution of the above method embodiments. The processor 1001, when executing the stored computer program, may perform the steps of the index management method for a graph database in the embodiment of the present specification, including: acquiring a target index to be stored, which corresponds to target graph data; the target index comprises a main key, an auxiliary key and an attribute data storage address of the target graph data; the main key is data obtained after mapping the map data identifier of the target map data; the auxiliary key comprises metadata corresponding to the target graph data; storing the auxiliary key and the attribute data storage address in a disk, storing the disk addresses of the auxiliary key and the attribute data storage address in the disk in a memory, and the like.

For a detailed description of each step of the index management method for the graph database, please refer to the previous contents, and a detailed description thereof will not be repeated here.

Corresponding to the above method embodiments, embodiments of the present description also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the index management method for a graph database in the embodiments of the present description. Please refer to the above description of the corresponding embodiments of fig. 1-3, and detailed descriptions thereof are omitted herein.

The foregoing description of the preferred embodiments is provided for the purpose of illustration only, and is not intended to limit the scope of the disclosure, since any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the disclosure are intended to be included within the scope of the disclosure.

In a typical configuration, the terminal device includes one or more CPUs, input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data.

Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, embodiments of the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Moreover, embodiments of the present description may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

Claims

1. An index management method for a graph database, the method comprising:

2. The method of claim 1, wherein the memory includes a predefined memory object for storing the graph data index; the storage space corresponding to the memory object is divided into a plurality of sub-storage spaces, and each sub-storage space corresponds to a different storage space identifier; the method further comprises the steps of:

the storing the auxiliary key and the disk address of the attribute data storage address in the disk into a memory includes:

3. The method of claim 2, wherein the memory object is an array object, and the memory space corresponding to the array object includes a plurality of memory blocks respectively corresponding to different array subscripts;

the calculating, based on the primary key, a target storage space identifier corresponding to the target index includes: calculating a target array index corresponding to the target index based on the primary key;

Storing the auxiliary key and the attribute data storage address in a target sub-storage space corresponding to the target storage space identifier in a memory by using a disk address in the disk, including:

and storing the auxiliary key and the disk address of the attribute data storage address in the disk into a target storage block corresponding to the target array subscript in a memory.

4. The method of claim 1, wherein the target graph data comprises a graph object that is a point, and wherein the graph data is identified as an id of the point; alternatively, the graph object contained in the target graph data is an edge, and the graph data is identified as an id of a starting point of the edge.

5. The method of claim 1, wherein the metadata comprises a combination of any one or more of the data shown below: whether the graph object contained in the target graph data is a point, whether the graph object contained in the target graph data is an incoming edge, a time stamp of the target graph data, a label of the target graph data and time information of the target graph data.

6. The method of claim 1, wherein the secondary key further comprises a combination of any one or more of the data shown below: the writing time of the target graph data, the id of the end point of the target graph data and the system serial number of the target graph data.

7. The method of claim 1, wherein the disk is a local disk, and the disk addresses of the auxiliary key and the attribute data storage address are local disk addresses;

8. The method of claim 2, wherein the primary bond is an integer greater than or equal to 0; the calculating, based on the primary key, a target storage space identifier corresponding to the target index includes:

9. The method of claim 8, wherein the method further comprises:

10. The method according to any one of claims 1-9, wherein the disk comprises a plurality of storage layers based on LSM tree, and a plurality of files are stored in at least some storage layers of the plurality of storage layers, and each file comprises a secondary key and an attribute data storage address corresponding to a plurality of indexes.

11. The method according to claim 10, wherein the method further comprises:

12. The map data query method for the map database is characterized in that a map data index in the map database comprises a main key, an auxiliary key and an attribute data storage address of map data; the primary key is data obtained after mapping the graph data identifier related to the graph data; the auxiliary key comprises metadata corresponding to the graph data; the method comprises the following steps:

13. An index management apparatus for a graph database, the apparatus comprising:

14. A graph data query device for a graph database, wherein a graph data index in the graph database comprises a primary key, a secondary key and an attribute data storage address of graph data; the primary key is data obtained after mapping the graph data identifier related to the graph data; the auxiliary key comprises metadata corresponding to the graph data; the device comprises:

15. A computer device, comprising: a memory and a processor; the memory has stored thereon a computer program executable by the processor; the processor, when running the computer program, performs the method of any one of claims 1 to 11 or 12.

16. A computer readable storage medium, characterized in that a computer program is stored thereon, which computer program, when being executed by a processor, implements the method according to any of claims 1 to 11 or 12.