CN118069365A

CN118069365A - Data rapid loading method, system, equipment and storage medium for distributed graph database

Info

Publication number: CN118069365A
Application number: CN202410253830.9A
Authority: CN
Inventors: 于骞; 付新; 王学海; 徐奇
Original assignee: Dameng Data Technology Jiangsu Co ltd
Current assignee: Dameng Data Technology Jiangsu Co ltd
Priority date: 2024-03-06
Filing date: 2024-03-06
Publication date: 2024-05-24

Abstract

The invention discloses a data rapid loading method, a system, equipment and a storage medium for a distributed graph database, wherein a rowId of a storage system is introduced into an analysis system in advance, a system side double-storage associated information processing scheme of the distributed graph database is designed, the processing logic of a distributed lower cross node is simplified, and the storage system realizes rapid disc drop according to pre-allocation rowId; a hash scheme of the relation between keys and rowId of maintenance points under a limited memory system is provided, rowId can be directly obtained under most scenes, comparison with disk data is avoided, and the purpose of quick loading in a small memory system is achieved.

Description

Data rapid loading method, system, equipment and storage medium for distributed graph database

Technical Field

The invention relates to the technical field of database processing, in particular to a data rapid loading method, a system, equipment and a storage medium for a distributed graph database.

Background

The distributed graph database system may not have the start and end vertices of an edge on the same storage unit because the point data is distributed on different storage nodes (on a segmented graph database system, the point data is distributed on different segment nodes). There are two common ways of processing: the first is that an edge follows only the starting vertex or the ending vertex deposit, but when the user specifies that the ending vertex queries for an adjacent edge, additional interactions are required to obtain the edge on the starting vertex if the edge follows the starting vertex deposit. The second method is that the edge is simultaneously stored with the start vertex and the stop vertex, the expansion of edge storage data is necessarily caused by the way of double-storage edge information, but the way of accelerating query is also the mainstream processing method at present because the graph database is mostly used for data analysis.

To implement the edge double-store scheme in the distributed graph database system, it is required that the edges must be stored in both the start vertex and the end vertex storage units during the data loading process (if the start vertex and the end vertex belong to exactly the same storage unit, only one piece of edge data needs to be stored). The same edge is stored on two storage units, and some relevance information is necessarily added to indicate that the edges on two storage units are actually the same edge. The information of the edges generally includes a key of a start vertex and a key of a stop vertex, and if the start vertex and the stop vertex of two edges on two nodes are the same and belong to the same relationship, the two edges can be considered as the storage of the same edge on different nodes. Relational databases typically use rowId to uniquely identify a piece of data, and the array storage subscript of an edge or point in the protogram store is analogous to the rowId function in the relational library (hereinafter rowId refers to the unique identification of data in the graph database). The same edge is different at rowId of different storage nodes, and at the same time, their start vertex and end vertex corresponding rowId should be the same, so the selection of the associated information of the added edge has the following problems: (1) Selecting which associated information is added when loading, wherein the added associated information is used for reflecting the relationship between points and edges and between edges and heteronode edges, and avoiding the increase of maintenance cost caused by adding excessive associated information; (2) How the loading system uses the association information to increase the loading speed; (3) How to maintain the related information of mass data under the loading system with limited memory.

Disclosure of Invention

The invention aims to solve the technical problem of providing a data rapid loading method, a system, equipment and a storage medium for a distributed graph database, which provides a hash scheme of the relationship between keys of maintenance points and rowId under a set of limited memory system, and can directly obtain rowId under most scenes to avoid comparison with disk data, thereby achieving the purpose of rapid loading in a small memory system.

In order to solve the technical problems, the invention provides a data rapid loading method for a distributed graph database, which comprises the following steps:

Step 1, the point or the edge has unique identification rowId in the storage node to represent the position information, and the information of the initial vertex and the ending vertex on the edge is represented by recording rowId of the initial vertex and the ending vertex;

Step 2, under the condition of double storage of the same side, the side of one storage unit maintains the interrelationship between two sides by recording rowId of the same side of the other storage unit;

And 3, selecting a hash table to maintain a mapping relation of key value pairs, taking keys of points as keys in the hash, taking rowId of the points as values in the hash, and maintaining the mapping relation of the points key to rowId as a pair < key, rowId >.

Preferably, in step 1, an edge across nodes has its own information identified thereon along with rowId, rowId for the start vertex, rowId for the end vertex, and rowId for the remote storage of the same edge.

Preferably, in step 3, a hash table is selected to maintain a mapping relationship between key-value pairs, and the process of inserting points includes the following steps:

(a) Combining key-i and rowId-i into a record, inserting the record into a file buffer area, obtaining the file offset lsn-i of the record, if the file buffer area is full, placing the record into a brushing thread for asynchronously brushing, and writing the record into a second file buffer area;

(b) Calculating a hash value corresponding to the key-i, using the hash value to make a remainder of the number M of the non-conflict array units to obtain a non-conflict array subscript arr-i, if the conflict mark corresponding to the array unit arr-i is false and other positions are in an initialized state, representing that the unit does not store data yet, and if the hash conflict does not occur, recording lsn-i and rowId-i at the corresponding positions of the array unit, and completing insertion;

(c) If the conflict flag corresponding to the array unit arr-i in the step (b) is true, but other positions are not in an initialized state, which means that the array unit arr-i already stores data, and a hash conflict occurs, at this time, the contents of the array unit arr-i need to be copied to the conflict array unit to be sequentially stored, and the use of the conflict array unit is sequentially used from the position 0; recording the unit number of the copy data in the conflict array as pre-i, storing lsn-i and pre-i in the corresponding positions of the arr-i units, changing the conflict mark into true, and completing the insertion;

(d) If the collision flag corresponding to arr-i of the array unit in step (b) is true, indicating that a hash collision has occurred multiple times, referring to step (c), the insertion of the collision data and the insertion of new data are completed.

Preferably, in step 3, a hash table is selected to maintain a mapping relationship between key-value pairs, and the process of searching for points includes the following steps:

(e) Calculating a hash value corresponding to the key-i, and taking the remainder of the number M of non-conflict array units by using the hash value to obtain a non-conflict array subscript arr-i, wherein if the conflict mark corresponding to the arr-i is false and lsn-i corresponding to the arr-i is an initialization state, no rowId-i is inquired, and no data is inquired;

(f) If the conflict mark corresponding to the arr-i in the step (e) is false and lsn-i is not in an initialized state, which means that the hash conflict does not occur in the array unit, the 9-16 byte content is rowId-i to be queried, the query is ended, and rowId-i is returned;

(g) If the conflict mark corresponding to the arr-i in the step (e) is true, indicating that the array unit generates a hash conflict, acquiring a key in the file through the offset position of the lsn-i positioning file, comparing whether the key-i is the same as the key, if so, finishing the inquiry, and returning to rowId-i in the file;

(h) If the key-i is different from the key in the step (g), obtaining the pre-i in the ar-i, wherein the pre-i is the subscript of the previous conflict value in the conflict array, and if the corresponding array unit of the pre-i is ar-t, obtaining the key in the file again through lsn-t, comparing with the key-i to be equal, and if the key-i is not equal, obtaining the previous conflict data again through the pre-t until the conflict mark on the array unit is false, wherein the fact indicates that the previous conflict data is not available.

Correspondingly, a data rapid loading system facing to the distributed graph database comprises: the system comprises a file analysis system, a rowld distribution system, a hash system and a graph database system; the file analysis system is responsible for reading the data file, analyzing the text data into point or side information according to the description information of the table, the rowId distribution system is responsible for distributing unique identification ID for the analyzed data, the hash system maintains the mapping relation between the key of the point and rowId, and the graph database system sustains the record loading data.

Preferably, the hash system maintains a mapping relationship between the key and rowId of the points, the key of the points is used as a key in the hash, rowId of the points is used as a value in the hash, and the pair < key, rowId > maintains the mapping relationship from the point key to rowId.

Preferably, the hash table is selected to maintain a mapping relationship of key-value pairs, and the inserting process for the points comprises the following steps:

Preferably, the hash table is selected to maintain a mapping relationship of key-value pairs, and the searching process for the point comprises the following steps:

Correspondingly, the data rapid loading device facing the distributed graph database is characterized by comprising the following components: one or more processors;

a storage means for storing one or more programs, user data;

The one or more programs, when executed by one or more processors, cause the one or more processors to implement the method for fast loading of data for a distributed graph database as recited in any one of claims 1 to 4.

Correspondingly, a data fast loading storage medium facing the distributed graph database, characterized in that a computer program is stored thereon, which when executed by a processor, implements the data fast loading method facing the distributed graph database according to any one of claims 1 to 4.

The beneficial effects of the invention are as follows: by introducing rowId of the storage system into the analysis system in advance, a distributed graph database system side double-memory associated information processing scheme is designed, the processing logic of a distributed lower cross-node is simplified, and the storage system realizes quick disc drop according to pre-allocation rowId; a hash scheme of the relation between keys and rowId of maintenance points under a limited memory system is provided, rowId can be directly obtained under most scenes, comparison with disk data is avoided, and the purpose of quick loading in a small memory system is achieved.

Drawings

FIG. 1 is a schematic flow chart of the method of the present invention.

Fig. 2 is a schematic diagram of a system structure according to the present invention.

Fig. 3 is a diagram showing a hash insert according to the present invention.

Fig. 4 is a diagram illustrating a hash query according to the present invention.

Detailed Description

As shown in fig. 1, a method for quickly loading data to a distributed graph database includes the following steps:

As shown in fig. 2, a data rapid loading system for a distributed graph database includes: the system comprises a file analysis system, a rowld distribution system, a hash system and a graph database system; the file analysis system is responsible for reading the data file, analyzing the text data into point or side information according to the description information of the table, the rowId distribution system is responsible for distributing unique identification ID for the analyzed data, the hash system maintains the mapping relation between the key of the point and rowId, and the graph database system sustains the record loading data.

The invention provides a method and a system for realizing rapid loading of a distributed graph database, which designs compact association information for points and edges by utilizing rowId to maintain mutual dependency relationship in a distributed graph database system requiring edge double storage, simultaneously presets rowId information between the points and edges when analyzing the data of the points and edges so as to simplify the processing flow of a storage node pair rowId, designs the mapping relation between keys and rowId of a high-efficiency hash scheme maintenance point aiming at the point information, and evolves a set of persistent hash scheme under a limited memory system to solve the problem of insufficient loading memory. The specific design scheme is as follows:

the point or edge has unique identifier rowId in the storage node to indicate its position information, so the edge start vertex and end vertex information can be represented by recording rowId of the start vertex and end vertex; similarly, in the case of double-sided memory, the sides of one memory cell may also maintain the relationship between the sides by recording rowId of the same side of another memory cell. To sum up, an edge across nodes has its own information identified as rowId, rowId for the start vertex, rowId for the end vertex, and rowId for storing the same edge at the far end. The data structure of the edges may be simplified as follows:

e-info：e-rowId：v-start-rowId：v-end-rowId：e-remote-rowId

Likewise, the data structure of the points can be simplified as follows:

v-info：v-rowId

The info information includes original key and value information, v-start-rowId and v-end-rowId are rowId corresponding to the start and stop vertices, and e-remote-rowId is rowId of the distal edge.

Typically rowId is generated by the storage node itself and is given newly inserted data, and the allocation of rowId in MySql is done by the engine layer, which is hidden from external users. In the above-mentioned side double-storage scheme, rowId is used to replace the original key information in order to simplify the associated information, so that the corresponding rowId information needs to be set before the data is inserted. The parsing system may batch apply rowId to the storage system during the loading process or obtain the start rowId of the storage system before the loading begins, and then assign rowId to each piece of parsed data as each piece of point data or edge data is parsed. Therefore, other loading logic besides the parsing system is not required to process rowId the related processes, and the storage system can directly store the loading data according to the already allocated rowId to find the corresponding storage location. The above procedure simplifies the processing logic related to rowId in the storage system, and increases the concurrent processing capacity of the storage system.

Since edges exist depending on points in the graph database system, it is determined in the order of introduction that the points are first introduced and then the edge information generated by attachment can be introduced. This means that the parsing system is not only responsible for allocating rowId of the imported data, but also maintains the mapping relationship between the keys of the point data and rowId, so that rowId of the edge start and stop points can be obtained according to the keys of the edge start and stop points when the subsequent edge data is imported. A hash table may typically be selected to maintain a mapping of key-value pairs, e.g. key of a point as a key in the hash, rowId of a point as a value in the hash, pair < key, rowId > i.e. the mapping of point key to rowId is maintained. The order of magnitude of points in the graph database can reach hundreds of millions or even 10 hundreds of millions, assuming that the average length of keys is 100 bytes, the capacity of a hash table required for owning the points with the order of one hundred million is nearly 10GB, and because of the dependency relationship between the points and the edges, the hash table of a plurality of labels needs to be maintained in a loading system for a certain period of time, so that the loading of hundreds of millions cannot be completed in a small memory system. A set of hash system schemes are designed by combining the graph database and rowId characteristics, and are as follows:

Assuming that the number of top points of a certain tag is N (pre-estimated value), a one-dimensional array with the number of units of N x M (M is configurable, and the number of floating points is 1-4), the array command is a non-conflict array, each array unit occupies 16 bytes, because the array needs continuous memory, if the array memory cannot be applied successfully at one time, one array can be subdivided into a plurality of small arrays, and 16 bytes of each array unit can be divided into: 1 byte is used as a flag bit to identify whether the conflict exists, 2-8 bytes are used as an offset, the offset positions of keys and rowId in the persistent file are recorded, and 9-16 bytes are used as rowId or a pre-record pointer. The array element format is as follows:

1flag:2-8lsn:9-16rowId or pre

Similarly, a one-dimensional array with N x P units (P is configurable and takes on floating point numbers between 0 and 1) is created, the array command is a conflict array, and each array unit occupies 16 bytes. Its array element division may refer to a non-conflicting array. In addition, a file buffer with a size of S (can be matched, and can be defaulted to 8M) is needed, the buffer is used for temporarily storing a batch of key and rowId values, and after the buffer is fully written, the content of the buffer is persisted into the file, so that a plurality of buffers can be prepared for recycling, and the buffer is prevented from being unavailable when the file is written. The buffer write format is as follows:

rowId1:key1:rowId2:key2...:rowId-n:key-n

The file offset locations corresponding to keys and rowId in each buffer are lsn, lsn, which are 9-16 bytes of content on the conflicting array and non-conflicting array units, so a rowId and dot-edge relationship based hash system includes two arrays and one file buffer and corresponding persistent files. If the system memory is sufficient, all information can be stored in the system memory, if the system is limited, the key and rowId contents of the key pair are preferably selected to be persisted into the hard disk through the file buffer, and then the two sets of contents can be persisted to further reduce the memory usage. Assuming that the number of points N is 1 hundred million, the number of non-conflict array coefficients m=1.5, the number of conflict array coefficients p=0.5, the array is stored in the system memory, the key and rowId are stored in a persistence scheme adopting double buffers, the size of the required buffer is 16M, the scheme can ensure that most of the process of searching rowId according to the key is completed in the non-conflict array without searching persistence files of the key and rowId, because each array unit only occupies 16 bytes, the using amount of the memory is about 3GB and is far smaller than that of a scheme of storing the hash by using the key as a key, meanwhile, the size of the using amount of the memory is irrelevant to the length of the key, and in the case of a very small memory system, the two arrays can be persistence through a memory mapping technology, so that the using amount of the memory is further reduced to the 16M size of the two buffers. In order to better illustrate the implementation steps of the above-mentioned hash scheme, the key steps of the insertion and search procedure are given below:

The insertion process comprises the following steps: inserting keys-i and rowId-i

Step one: and combining the key-i and rowId-i into a record, inserting the record into a file buffer area, acquiring the file offset lsn-i of the record, if the file buffer area is full, placing the record into a brushing thread for asynchronously brushing, and writing the record into a second file buffer area.

Step two: and calculating a hash value corresponding to the key-i, taking the remainder of the number M of the non-conflict array units by using the hash value to obtain a non-conflict array subscript arr-i, and if the conflict mark corresponding to the array unit arr-i is false and other positions are in an initialized state, representing that the unit does not store data yet, recording lsn-i and rowId-i at the corresponding positions of the array unit and completing the insertion.

Step three: if the conflict mark corresponding to the array unit arr-i in the second step is true, but other positions are not in an initialized state, which means that the arr-i unit already stores data, and a hash conflict occurs, at this time, the contents of the arr-i array unit need to be copied into the conflict array unit to be sequentially stored, and the use of the conflict array unit is sequentially started from the position 0. Recording the unit number of the copy data in the conflict array as pre-i, storing lsn-i and pre-i in the corresponding positions of the arr-i units, changing the conflict mark into true, and completing the insertion.

Step four: if the conflict mark corresponding to the arr-i of the group units in the second step is true, the hash conflict is generated for a plurality of times, and the inserting of the conflict data and the inserting of the new data are completed in the third step.

The searching process comprises the following steps: corresponding rowId-i is searched according to key-i

Step one: and calculating a hash value corresponding to the key-i, and taking the remainder of the number M of the non-conflict array units by using the hash value to obtain a non-conflict array subscript arr-i, wherein if the conflict mark corresponding to the arr-i is false and lsn-i corresponding to the arr-i is in an initialization state, no query of rowId-i is indicated, and no data is queried.

Step two: if the conflict mark corresponding to the arr-i in the first step is false and lsn-i is not in an initialized state, which means that the hash conflict does not occur in the array unit, the 9-16 byte content is rowId-i to be queried, the query is ended, and rowId-i is returned.

Step three: if the conflict mark corresponding to the arr-i in the first step is true, indicating that the array unit generates a hash conflict, acquiring a key in the file through the offset position of the lsn-i positioning file, comparing whether the key-i is identical to the key, if so, ending the inquiry, and returning to rowId-i in the file.

Step four: if the key-i is different from the key in the third step, pre-i in the ar-i is obtained, the pre-i is the subscript of the previous conflict value in the conflict array, the key in the file is obtained again through lsn-t assuming that the corresponding array unit of the pre-i is ar-t, whether the key is equal to the key-i is compared, if the key is not equal, the previous conflict data is obtained again through the pre-t until the conflict mark on the array unit is false, and the fact that the previous conflict data does not exist is indicated.

It should be noted that rowId represents only a unique data identifier in the graph database, and the specific content of rowId should not be considered as limiting the scope of the present invention, and similarly, the writing manner and writing format of the persistent file are given by way of example only for convenience of illustration, and not as limiting the scope of the present invention. The invention introduces rowId as a related information into the analysis system in advance, designs a set of hash scheme for maintaining the mapping relation between keys and rowId, is suitable for loading systems of memories of various scales, and provides a fast searching mode without comparison under most scenes (non-conflict scenes).

Embodiment one: point loading

Step 1: the parsing system obtains the start rowId of the storage system or obtains rowId information of the storage system in batches by means of a message or the like.

Step 2: and (3) analyzing the data file by the analysis system to obtain relevant point information, and distributing rowId to each point according to rowId information applied in the step (1) after the information is checked to be effective.

Step 3: the parsing system stores the key and rowId of the point in the hash system of the present invention, and the insertion example is shown in fig. 3:

the insertion steps are as follows:

(1) Newly inserting key=18, rowid=600 data, and firstly inserting the data into a file buffer to obtain an offset lsn =60.

(2) Assuming that the hash value of the key is equal to 1, the non-conflicting array subscript is 1 of the existing data, the original data (1) 50 and 0 are moved to the conflicting array, and the insertion position preId (1) is recorded.

(3) New values (1) 60 and preId are inserted at non-conflicting array index 1.

Step 4: the parsing system sends rowId point information to the storage system, and the storage system searches a storage position according to rowId information to store the point information into the system.

Embodiment two: edge loading

Step 2: the parsing system first parses the data file to obtain relevant side information, and then after verifying that the information is valid, assigns rowId to each side according to rowId information applied in step 1. If the parsing system determines that the edge needs to be stored across the storage unit, a piece of far-end edge information is cloned and allocated rowId, two pieces of edge information are mutually set rowId as own far-end rowId (remote-rowId), and in addition, the parsing system needs to assign a correct start-end vertex to the edge by searching rowId in the hash system according to the key of the start-end vertex of the edge. An example of a query is shown in FIG. 4:

The query steps are as follows: rowId assuming find key=1

(1) Assuming that the hash value of the key is equal to 1, the data (1) 60 at the array unit 1 is fetched, the highest bit is 1 to represent conflict, the search key is compared with the key at lsn =60 in the file, and if the search key is not equal to the search key, the search is continued.

(2) The next data to be found is in array unit 1 of the conflict array, array unit data (1) 50 is fetched, the highest bit is 1 to indicate that the conflict still exists, whether keys at lsn =50 are equal or not is compared, and the search is continued under the assumption that the keys are not equal.

(3) Further searching for the 0 number array unit of the conflict array, wherein the conflict mark is 0, which indicates that no conflict data exists, rowId =200 can be considered as the data to be queried, and if necessary, a key check at lsn =20 can be taken out to see whether the key check is equal.

Step 3: the analysis system respectively sends the side data needing double storage to two storage nodes, and the storage system searches a storage position according to rowId information and stores the side information into the system.

Claims

1. The data rapid loading method for the distributed graph database is characterized by comprising the following steps of:

2. The rapid loading method of data for a distributed graph database of claim 1, wherein in step 1, a cross-node edge is identified with its own information and rowId, rowId for the start vertex, rowId for the end vertex, and rowId for storing the same edge at the far end.

3. The method for quickly loading data into a distributed graph database according to claim 1, wherein in step 3, a hash table is selected to maintain a mapping relationship between key-value pairs, and the process of inserting points comprises the steps of:

4. The method for quickly loading data into a distributed graph database according to claim 1, wherein in step 3, a hash table is selected to maintain a mapping relationship between key-value pairs, and the process of searching for points includes the steps of:

5. A loading system for a rapid loading method of data for a distributed graph database as claimed in claim 1, comprising: the system comprises a file analysis system, a rowld distribution system, a hash system and a graph database system; the file analysis system is responsible for reading the data file, analyzing the text data into point or side information according to the description information of the table, the rowId distribution system is responsible for distributing unique identification ID for the analyzed data, the hash system maintains the mapping relation between the key of the point and rowId, and the graph database system sustains the record loading data.

6. The rapid loading system of data for a distributed graph database of claim 5 wherein the hash system maintains a mapping between keys of points and rowId, the keys of points are keys in the hash, rowId of points are values in the hash, and pair < key, rowId > maintains a mapping of point keys to rowId.

7. The distributed graph database oriented data rapid loading system of claim 6 wherein the hash table is selected to maintain a mapping of key-value pairs, the insertion process for points comprising the steps of:

8. The distributed graph database oriented data rapid loading system of claim 6 wherein the hash table is selected to maintain a mapping of key-value pairs, the lookup process for points comprising the steps of:

9. A distributed graph database oriented data rapid loading device, comprising: one or more processors;

a storage means for storing one or more programs, user data;

10. A data fast loading storage medium for a distributed graph database, wherein a computer program is stored thereon, which when executed by a processor, implements a data fast loading method for a distributed graph database according to any one of claims 1 to 4.