CN118069365A - Data rapid loading method, system, equipment and storage medium for distributed graph database - Google Patents

Data rapid loading method, system, equipment and storage medium for distributed graph database Download PDF

Info

Publication number
CN118069365A
CN118069365A CN202410253830.9A CN202410253830A CN118069365A CN 118069365 A CN118069365 A CN 118069365A CN 202410253830 A CN202410253830 A CN 202410253830A CN 118069365 A CN118069365 A CN 118069365A
Authority
CN
China
Prior art keywords
conflict
rowid
key
data
hash
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410253830.9A
Other languages
Chinese (zh)
Inventor
于骞
付新
王学海
徐奇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dameng Data Technology Jiangsu Co ltd
Original Assignee
Dameng Data Technology Jiangsu Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dameng Data Technology Jiangsu Co ltd filed Critical Dameng Data Technology Jiangsu Co ltd
Priority to CN202410253830.9A priority Critical patent/CN118069365A/en
Publication of CN118069365A publication Critical patent/CN118069365A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data rapid loading method, a system, equipment and a storage medium for a distributed graph database, wherein a rowId of a storage system is introduced into an analysis system in advance, a system side double-storage associated information processing scheme of the distributed graph database is designed, the processing logic of a distributed lower cross node is simplified, and the storage system realizes rapid disc drop according to pre-allocation rowId; a hash scheme of the relation between keys and rowId of maintenance points under a limited memory system is provided, rowId can be directly obtained under most scenes, comparison with disk data is avoided, and the purpose of quick loading in a small memory system is achieved.

Description

Data rapid loading method, system, equipment and storage medium for distributed graph database
Technical Field
The invention relates to the technical field of database processing, in particular to a data rapid loading method, a system, equipment and a storage medium for a distributed graph database.
Background
The distributed graph database system may not have the start and end vertices of an edge on the same storage unit because the point data is distributed on different storage nodes (on a segmented graph database system, the point data is distributed on different segment nodes). There are two common ways of processing: the first is that an edge follows only the starting vertex or the ending vertex deposit, but when the user specifies that the ending vertex queries for an adjacent edge, additional interactions are required to obtain the edge on the starting vertex if the edge follows the starting vertex deposit. The second method is that the edge is simultaneously stored with the start vertex and the stop vertex, the expansion of edge storage data is necessarily caused by the way of double-storage edge information, but the way of accelerating query is also the mainstream processing method at present because the graph database is mostly used for data analysis.
To implement the edge double-store scheme in the distributed graph database system, it is required that the edges must be stored in both the start vertex and the end vertex storage units during the data loading process (if the start vertex and the end vertex belong to exactly the same storage unit, only one piece of edge data needs to be stored). The same edge is stored on two storage units, and some relevance information is necessarily added to indicate that the edges on two storage units are actually the same edge. The information of the edges generally includes a key of a start vertex and a key of a stop vertex, and if the start vertex and the stop vertex of two edges on two nodes are the same and belong to the same relationship, the two edges can be considered as the storage of the same edge on different nodes. Relational databases typically use rowId to uniquely identify a piece of data, and the array storage subscript of an edge or point in the protogram store is analogous to the rowId function in the relational library (hereinafter rowId refers to the unique identification of data in the graph database). The same edge is different at rowId of different storage nodes, and at the same time, their start vertex and end vertex corresponding rowId should be the same, so the selection of the associated information of the added edge has the following problems: (1) Selecting which associated information is added when loading, wherein the added associated information is used for reflecting the relationship between points and edges and between edges and heteronode edges, and avoiding the increase of maintenance cost caused by adding excessive associated information; (2) How the loading system uses the association information to increase the loading speed; (3) How to maintain the related information of mass data under the loading system with limited memory.
Disclosure of Invention
The invention aims to solve the technical problem of providing a data rapid loading method, a system, equipment and a storage medium for a distributed graph database, which provides a hash scheme of the relationship between keys of maintenance points and rowId under a set of limited memory system, and can directly obtain rowId under most scenes to avoid comparison with disk data, thereby achieving the purpose of rapid loading in a small memory system.
In order to solve the technical problems, the invention provides a data rapid loading method for a distributed graph database, which comprises the following steps:
Step 1, the point or the edge has unique identification rowId in the storage node to represent the position information, and the information of the initial vertex and the ending vertex on the edge is represented by recording rowId of the initial vertex and the ending vertex;
Step 2, under the condition of double storage of the same side, the side of one storage unit maintains the interrelationship between two sides by recording rowId of the same side of the other storage unit;
And 3, selecting a hash table to maintain a mapping relation of key value pairs, taking keys of points as keys in the hash, taking rowId of the points as values in the hash, and maintaining the mapping relation of the points key to rowId as a pair < key, rowId >.
Preferably, in step 1, an edge across nodes has its own information identified thereon along with rowId, rowId for the start vertex, rowId for the end vertex, and rowId for the remote storage of the same edge.
Preferably, in step 3, a hash table is selected to maintain a mapping relationship between key-value pairs, and the process of inserting points includes the following steps:
(a) Combining key-i and rowId-i into a record, inserting the record into a file buffer area, obtaining the file offset lsn-i of the record, if the file buffer area is full, placing the record into a brushing thread for asynchronously brushing, and writing the record into a second file buffer area;
(b) Calculating a hash value corresponding to the key-i, using the hash value to make a remainder of the number M of the non-conflict array units to obtain a non-conflict array subscript arr-i, if the conflict mark corresponding to the array unit arr-i is false and other positions are in an initialized state, representing that the unit does not store data yet, and if the hash conflict does not occur, recording lsn-i and rowId-i at the corresponding positions of the array unit, and completing insertion;
(c) If the conflict flag corresponding to the array unit arr-i in the step (b) is true, but other positions are not in an initialized state, which means that the array unit arr-i already stores data, and a hash conflict occurs, at this time, the contents of the array unit arr-i need to be copied to the conflict array unit to be sequentially stored, and the use of the conflict array unit is sequentially used from the position 0; recording the unit number of the copy data in the conflict array as pre-i, storing lsn-i and pre-i in the corresponding positions of the arr-i units, changing the conflict mark into true, and completing the insertion;
(d) If the collision flag corresponding to arr-i of the array unit in step (b) is true, indicating that a hash collision has occurred multiple times, referring to step (c), the insertion of the collision data and the insertion of new data are completed.
Preferably, in step 3, a hash table is selected to maintain a mapping relationship between key-value pairs, and the process of searching for points includes the following steps:
(e) Calculating a hash value corresponding to the key-i, and taking the remainder of the number M of non-conflict array units by using the hash value to obtain a non-conflict array subscript arr-i, wherein if the conflict mark corresponding to the arr-i is false and lsn-i corresponding to the arr-i is an initialization state, no rowId-i is inquired, and no data is inquired;
(f) If the conflict mark corresponding to the arr-i in the step (e) is false and lsn-i is not in an initialized state, which means that the hash conflict does not occur in the array unit, the 9-16 byte content is rowId-i to be queried, the query is ended, and rowId-i is returned;
(g) If the conflict mark corresponding to the arr-i in the step (e) is true, indicating that the array unit generates a hash conflict, acquiring a key in the file through the offset position of the lsn-i positioning file, comparing whether the key-i is the same as the key, if so, finishing the inquiry, and returning to rowId-i in the file;
(h) If the key-i is different from the key in the step (g), obtaining the pre-i in the ar-i, wherein the pre-i is the subscript of the previous conflict value in the conflict array, and if the corresponding array unit of the pre-i is ar-t, obtaining the key in the file again through lsn-t, comparing with the key-i to be equal, and if the key-i is not equal, obtaining the previous conflict data again through the pre-t until the conflict mark on the array unit is false, wherein the fact indicates that the previous conflict data is not available.
Correspondingly, a data rapid loading system facing to the distributed graph database comprises: the system comprises a file analysis system, a rowld distribution system, a hash system and a graph database system; the file analysis system is responsible for reading the data file, analyzing the text data into point or side information according to the description information of the table, the rowId distribution system is responsible for distributing unique identification ID for the analyzed data, the hash system maintains the mapping relation between the key of the point and rowId, and the graph database system sustains the record loading data.
Preferably, the hash system maintains a mapping relationship between the key and rowId of the points, the key of the points is used as a key in the hash, rowId of the points is used as a value in the hash, and the pair < key, rowId > maintains the mapping relationship from the point key to rowId.
Preferably, the hash table is selected to maintain a mapping relationship of key-value pairs, and the inserting process for the points comprises the following steps:
(a) Combining key-i and rowId-i into a record, inserting the record into a file buffer area, obtaining the file offset lsn-i of the record, if the file buffer area is full, placing the record into a brushing thread for asynchronously brushing, and writing the record into a second file buffer area;
(b) Calculating a hash value corresponding to the key-i, using the hash value to make a remainder of the number M of the non-conflict array units to obtain a non-conflict array subscript arr-i, if the conflict mark corresponding to the array unit arr-i is false and other positions are in an initialized state, representing that the unit does not store data yet, and if the hash conflict does not occur, recording lsn-i and rowId-i at the corresponding positions of the array unit, and completing insertion;
(c) If the conflict flag corresponding to the array unit arr-i in the step (b) is true, but other positions are not in an initialized state, which means that the array unit arr-i already stores data, and a hash conflict occurs, at this time, the contents of the array unit arr-i need to be copied to the conflict array unit to be sequentially stored, and the use of the conflict array unit is sequentially used from the position 0; recording the unit number of the copy data in the conflict array as pre-i, storing lsn-i and pre-i in the corresponding positions of the arr-i units, changing the conflict mark into true, and completing the insertion;
(d) If the collision flag corresponding to arr-i of the array unit in step (b) is true, indicating that a hash collision has occurred multiple times, referring to step (c), the insertion of the collision data and the insertion of new data are completed.
Preferably, the hash table is selected to maintain a mapping relationship of key-value pairs, and the searching process for the point comprises the following steps:
(e) Calculating a hash value corresponding to the key-i, and taking the remainder of the number M of non-conflict array units by using the hash value to obtain a non-conflict array subscript arr-i, wherein if the conflict mark corresponding to the arr-i is false and lsn-i corresponding to the arr-i is an initialization state, no rowId-i is inquired, and no data is inquired;
(f) If the conflict mark corresponding to the arr-i in the step (e) is false and lsn-i is not in an initialized state, which means that the hash conflict does not occur in the array unit, the 9-16 byte content is rowId-i to be queried, the query is ended, and rowId-i is returned;
(g) If the conflict mark corresponding to the arr-i in the step (e) is true, indicating that the array unit generates a hash conflict, acquiring a key in the file through the offset position of the lsn-i positioning file, comparing whether the key-i is the same as the key, if so, finishing the inquiry, and returning to rowId-i in the file;
(h) If the key-i is different from the key in the step (g), obtaining the pre-i in the ar-i, wherein the pre-i is the subscript of the previous conflict value in the conflict array, and if the corresponding array unit of the pre-i is ar-t, obtaining the key in the file again through lsn-t, comparing with the key-i to be equal, and if the key-i is not equal, obtaining the previous conflict data again through the pre-t until the conflict mark on the array unit is false, wherein the fact indicates that the previous conflict data is not available.
Correspondingly, the data rapid loading device facing the distributed graph database is characterized by comprising the following components: one or more processors;
a storage means for storing one or more programs, user data;
The one or more programs, when executed by one or more processors, cause the one or more processors to implement the method for fast loading of data for a distributed graph database as recited in any one of claims 1 to 4.
Correspondingly, a data fast loading storage medium facing the distributed graph database, characterized in that a computer program is stored thereon, which when executed by a processor, implements the data fast loading method facing the distributed graph database according to any one of claims 1 to 4.
The beneficial effects of the invention are as follows: by introducing rowId of the storage system into the analysis system in advance, a distributed graph database system side double-memory associated information processing scheme is designed, the processing logic of a distributed lower cross-node is simplified, and the storage system realizes quick disc drop according to pre-allocation rowId; a hash scheme of the relation between keys and rowId of maintenance points under a limited memory system is provided, rowId can be directly obtained under most scenes, comparison with disk data is avoided, and the purpose of quick loading in a small memory system is achieved.
Drawings
FIG. 1 is a schematic flow chart of the method of the present invention.
Fig. 2 is a schematic diagram of a system structure according to the present invention.
Fig. 3 is a diagram showing a hash insert according to the present invention.
Fig. 4 is a diagram illustrating a hash query according to the present invention.
Detailed Description
As shown in fig. 1, a method for quickly loading data to a distributed graph database includes the following steps:
Step 1, the point or the edge has unique identification rowId in the storage node to represent the position information, and the information of the initial vertex and the ending vertex on the edge is represented by recording rowId of the initial vertex and the ending vertex;
Step 2, under the condition of double storage of the same side, the side of one storage unit maintains the interrelationship between two sides by recording rowId of the same side of the other storage unit;
And 3, selecting a hash table to maintain a mapping relation of key value pairs, taking keys of points as keys in the hash, taking rowId of the points as values in the hash, and maintaining the mapping relation of the points key to rowId as a pair < key, rowId >.
As shown in fig. 2, a data rapid loading system for a distributed graph database includes: the system comprises a file analysis system, a rowld distribution system, a hash system and a graph database system; the file analysis system is responsible for reading the data file, analyzing the text data into point or side information according to the description information of the table, the rowId distribution system is responsible for distributing unique identification ID for the analyzed data, the hash system maintains the mapping relation between the key of the point and rowId, and the graph database system sustains the record loading data.
The invention provides a method and a system for realizing rapid loading of a distributed graph database, which designs compact association information for points and edges by utilizing rowId to maintain mutual dependency relationship in a distributed graph database system requiring edge double storage, simultaneously presets rowId information between the points and edges when analyzing the data of the points and edges so as to simplify the processing flow of a storage node pair rowId, designs the mapping relation between keys and rowId of a high-efficiency hash scheme maintenance point aiming at the point information, and evolves a set of persistent hash scheme under a limited memory system to solve the problem of insufficient loading memory. The specific design scheme is as follows:
the point or edge has unique identifier rowId in the storage node to indicate its position information, so the edge start vertex and end vertex information can be represented by recording rowId of the start vertex and end vertex; similarly, in the case of double-sided memory, the sides of one memory cell may also maintain the relationship between the sides by recording rowId of the same side of another memory cell. To sum up, an edge across nodes has its own information identified as rowId, rowId for the start vertex, rowId for the end vertex, and rowId for storing the same edge at the far end. The data structure of the edges may be simplified as follows:
e-info:e-rowId:v-start-rowId:v-end-rowId:e-remote-rowId
Likewise, the data structure of the points can be simplified as follows:
v-info:v-rowId
The info information includes original key and value information, v-start-rowId and v-end-rowId are rowId corresponding to the start and stop vertices, and e-remote-rowId is rowId of the distal edge.
Typically rowId is generated by the storage node itself and is given newly inserted data, and the allocation of rowId in MySql is done by the engine layer, which is hidden from external users. In the above-mentioned side double-storage scheme, rowId is used to replace the original key information in order to simplify the associated information, so that the corresponding rowId information needs to be set before the data is inserted. The parsing system may batch apply rowId to the storage system during the loading process or obtain the start rowId of the storage system before the loading begins, and then assign rowId to each piece of parsed data as each piece of point data or edge data is parsed. Therefore, other loading logic besides the parsing system is not required to process rowId the related processes, and the storage system can directly store the loading data according to the already allocated rowId to find the corresponding storage location. The above procedure simplifies the processing logic related to rowId in the storage system, and increases the concurrent processing capacity of the storage system.
Since edges exist depending on points in the graph database system, it is determined in the order of introduction that the points are first introduced and then the edge information generated by attachment can be introduced. This means that the parsing system is not only responsible for allocating rowId of the imported data, but also maintains the mapping relationship between the keys of the point data and rowId, so that rowId of the edge start and stop points can be obtained according to the keys of the edge start and stop points when the subsequent edge data is imported. A hash table may typically be selected to maintain a mapping of key-value pairs, e.g. key of a point as a key in the hash, rowId of a point as a value in the hash, pair < key, rowId > i.e. the mapping of point key to rowId is maintained. The order of magnitude of points in the graph database can reach hundreds of millions or even 10 hundreds of millions, assuming that the average length of keys is 100 bytes, the capacity of a hash table required for owning the points with the order of one hundred million is nearly 10GB, and because of the dependency relationship between the points and the edges, the hash table of a plurality of labels needs to be maintained in a loading system for a certain period of time, so that the loading of hundreds of millions cannot be completed in a small memory system. A set of hash system schemes are designed by combining the graph database and rowId characteristics, and are as follows:
Assuming that the number of top points of a certain tag is N (pre-estimated value), a one-dimensional array with the number of units of N x M (M is configurable, and the number of floating points is 1-4), the array command is a non-conflict array, each array unit occupies 16 bytes, because the array needs continuous memory, if the array memory cannot be applied successfully at one time, one array can be subdivided into a plurality of small arrays, and 16 bytes of each array unit can be divided into: 1 byte is used as a flag bit to identify whether the conflict exists, 2-8 bytes are used as an offset, the offset positions of keys and rowId in the persistent file are recorded, and 9-16 bytes are used as rowId or a pre-record pointer. The array element format is as follows:
1flag:2-8lsn:9-16rowId or pre
Similarly, a one-dimensional array with N x P units (P is configurable and takes on floating point numbers between 0 and 1) is created, the array command is a conflict array, and each array unit occupies 16 bytes. Its array element division may refer to a non-conflicting array. In addition, a file buffer with a size of S (can be matched, and can be defaulted to 8M) is needed, the buffer is used for temporarily storing a batch of key and rowId values, and after the buffer is fully written, the content of the buffer is persisted into the file, so that a plurality of buffers can be prepared for recycling, and the buffer is prevented from being unavailable when the file is written. The buffer write format is as follows:
rowId1:key1:rowId2:key2...:rowId-n:key-n
The file offset locations corresponding to keys and rowId in each buffer are lsn, lsn, which are 9-16 bytes of content on the conflicting array and non-conflicting array units, so a rowId and dot-edge relationship based hash system includes two arrays and one file buffer and corresponding persistent files. If the system memory is sufficient, all information can be stored in the system memory, if the system is limited, the key and rowId contents of the key pair are preferably selected to be persisted into the hard disk through the file buffer, and then the two sets of contents can be persisted to further reduce the memory usage. Assuming that the number of points N is 1 hundred million, the number of non-conflict array coefficients m=1.5, the number of conflict array coefficients p=0.5, the array is stored in the system memory, the key and rowId are stored in a persistence scheme adopting double buffers, the size of the required buffer is 16M, the scheme can ensure that most of the process of searching rowId according to the key is completed in the non-conflict array without searching persistence files of the key and rowId, because each array unit only occupies 16 bytes, the using amount of the memory is about 3GB and is far smaller than that of a scheme of storing the hash by using the key as a key, meanwhile, the size of the using amount of the memory is irrelevant to the length of the key, and in the case of a very small memory system, the two arrays can be persistence through a memory mapping technology, so that the using amount of the memory is further reduced to the 16M size of the two buffers. In order to better illustrate the implementation steps of the above-mentioned hash scheme, the key steps of the insertion and search procedure are given below:
The insertion process comprises the following steps: inserting keys-i and rowId-i
Step one: and combining the key-i and rowId-i into a record, inserting the record into a file buffer area, acquiring the file offset lsn-i of the record, if the file buffer area is full, placing the record into a brushing thread for asynchronously brushing, and writing the record into a second file buffer area.
Step two: and calculating a hash value corresponding to the key-i, taking the remainder of the number M of the non-conflict array units by using the hash value to obtain a non-conflict array subscript arr-i, and if the conflict mark corresponding to the array unit arr-i is false and other positions are in an initialized state, representing that the unit does not store data yet, recording lsn-i and rowId-i at the corresponding positions of the array unit and completing the insertion.
Step three: if the conflict mark corresponding to the array unit arr-i in the second step is true, but other positions are not in an initialized state, which means that the arr-i unit already stores data, and a hash conflict occurs, at this time, the contents of the arr-i array unit need to be copied into the conflict array unit to be sequentially stored, and the use of the conflict array unit is sequentially started from the position 0. Recording the unit number of the copy data in the conflict array as pre-i, storing lsn-i and pre-i in the corresponding positions of the arr-i units, changing the conflict mark into true, and completing the insertion.
Step four: if the conflict mark corresponding to the arr-i of the group units in the second step is true, the hash conflict is generated for a plurality of times, and the inserting of the conflict data and the inserting of the new data are completed in the third step.
The searching process comprises the following steps: corresponding rowId-i is searched according to key-i
Step one: and calculating a hash value corresponding to the key-i, and taking the remainder of the number M of the non-conflict array units by using the hash value to obtain a non-conflict array subscript arr-i, wherein if the conflict mark corresponding to the arr-i is false and lsn-i corresponding to the arr-i is in an initialization state, no query of rowId-i is indicated, and no data is queried.
Step two: if the conflict mark corresponding to the arr-i in the first step is false and lsn-i is not in an initialized state, which means that the hash conflict does not occur in the array unit, the 9-16 byte content is rowId-i to be queried, the query is ended, and rowId-i is returned.
Step three: if the conflict mark corresponding to the arr-i in the first step is true, indicating that the array unit generates a hash conflict, acquiring a key in the file through the offset position of the lsn-i positioning file, comparing whether the key-i is identical to the key, if so, ending the inquiry, and returning to rowId-i in the file.
Step four: if the key-i is different from the key in the third step, pre-i in the ar-i is obtained, the pre-i is the subscript of the previous conflict value in the conflict array, the key in the file is obtained again through lsn-t assuming that the corresponding array unit of the pre-i is ar-t, whether the key is equal to the key-i is compared, if the key is not equal, the previous conflict data is obtained again through the pre-t until the conflict mark on the array unit is false, and the fact that the previous conflict data does not exist is indicated.
It should be noted that rowId represents only a unique data identifier in the graph database, and the specific content of rowId should not be considered as limiting the scope of the present invention, and similarly, the writing manner and writing format of the persistent file are given by way of example only for convenience of illustration, and not as limiting the scope of the present invention. The invention introduces rowId as a related information into the analysis system in advance, designs a set of hash scheme for maintaining the mapping relation between keys and rowId, is suitable for loading systems of memories of various scales, and provides a fast searching mode without comparison under most scenes (non-conflict scenes).
Embodiment one: point loading
Step 1: the parsing system obtains the start rowId of the storage system or obtains rowId information of the storage system in batches by means of a message or the like.
Step 2: and (3) analyzing the data file by the analysis system to obtain relevant point information, and distributing rowId to each point according to rowId information applied in the step (1) after the information is checked to be effective.
Step 3: the parsing system stores the key and rowId of the point in the hash system of the present invention, and the insertion example is shown in fig. 3:
the insertion steps are as follows:
(1) Newly inserting key=18, rowid=600 data, and firstly inserting the data into a file buffer to obtain an offset lsn =60.
(2) Assuming that the hash value of the key is equal to 1, the non-conflicting array subscript is 1 of the existing data, the original data (1) 50 and 0 are moved to the conflicting array, and the insertion position preId (1) is recorded.
(3) New values (1) 60 and preId are inserted at non-conflicting array index 1.
Step 4: the parsing system sends rowId point information to the storage system, and the storage system searches a storage position according to rowId information to store the point information into the system.
Embodiment two: edge loading
Step 1: the parsing system obtains the start rowId of the storage system or obtains rowId information of the storage system in batches by means of a message or the like.
Step 2: the parsing system first parses the data file to obtain relevant side information, and then after verifying that the information is valid, assigns rowId to each side according to rowId information applied in step 1. If the parsing system determines that the edge needs to be stored across the storage unit, a piece of far-end edge information is cloned and allocated rowId, two pieces of edge information are mutually set rowId as own far-end rowId (remote-rowId), and in addition, the parsing system needs to assign a correct start-end vertex to the edge by searching rowId in the hash system according to the key of the start-end vertex of the edge. An example of a query is shown in FIG. 4:
The query steps are as follows: rowId assuming find key=1
(1) Assuming that the hash value of the key is equal to 1, the data (1) 60 at the array unit 1 is fetched, the highest bit is 1 to represent conflict, the search key is compared with the key at lsn =60 in the file, and if the search key is not equal to the search key, the search is continued.
(2) The next data to be found is in array unit 1 of the conflict array, array unit data (1) 50 is fetched, the highest bit is 1 to indicate that the conflict still exists, whether keys at lsn =50 are equal or not is compared, and the search is continued under the assumption that the keys are not equal.
(3) Further searching for the 0 number array unit of the conflict array, wherein the conflict mark is 0, which indicates that no conflict data exists, rowId =200 can be considered as the data to be queried, and if necessary, a key check at lsn =20 can be taken out to see whether the key check is equal.
Step 3: the analysis system respectively sends the side data needing double storage to two storage nodes, and the storage system searches a storage position according to rowId information and stores the side information into the system.

Claims (10)

1. The data rapid loading method for the distributed graph database is characterized by comprising the following steps of:
Step 1, the point or the edge has unique identification rowId in the storage node to represent the position information, and the information of the initial vertex and the ending vertex on the edge is represented by recording rowId of the initial vertex and the ending vertex;
Step 2, under the condition of double storage of the same side, the side of one storage unit maintains the interrelationship between two sides by recording rowId of the same side of the other storage unit;
And 3, selecting a hash table to maintain a mapping relation of key value pairs, taking keys of points as keys in the hash, taking rowId of the points as values in the hash, and maintaining the mapping relation of the points key to rowId as a pair < key, rowId >.
2. The rapid loading method of data for a distributed graph database of claim 1, wherein in step 1, a cross-node edge is identified with its own information and rowId, rowId for the start vertex, rowId for the end vertex, and rowId for storing the same edge at the far end.
3. The method for quickly loading data into a distributed graph database according to claim 1, wherein in step 3, a hash table is selected to maintain a mapping relationship between key-value pairs, and the process of inserting points comprises the steps of:
(a) Combining key-i and rowId-i into a record, inserting the record into a file buffer area, obtaining the file offset lsn-i of the record, if the file buffer area is full, placing the record into a brushing thread for asynchronously brushing, and writing the record into a second file buffer area;
(b) Calculating a hash value corresponding to the key-i, using the hash value to make a remainder of the number M of the non-conflict array units to obtain a non-conflict array subscript arr-i, if the conflict mark corresponding to the array unit arr-i is false and other positions are in an initialized state, representing that the unit does not store data yet, and if the hash conflict does not occur, recording lsn-i and rowId-i at the corresponding positions of the array unit, and completing insertion;
(c) If the conflict flag corresponding to the array unit arr-i in the step (b) is true, but other positions are not in an initialized state, which means that the array unit arr-i already stores data, and a hash conflict occurs, at this time, the contents of the array unit arr-i need to be copied to the conflict array unit to be sequentially stored, and the use of the conflict array unit is sequentially used from the position 0; recording the unit number of the copy data in the conflict array as pre-i, storing lsn-i and pre-i in the corresponding positions of the arr-i units, changing the conflict mark into true, and completing the insertion;
(d) If the collision flag corresponding to arr-i of the array unit in step (b) is true, indicating that a hash collision has occurred multiple times, referring to step (c), the insertion of the collision data and the insertion of new data are completed.
4. The method for quickly loading data into a distributed graph database according to claim 1, wherein in step 3, a hash table is selected to maintain a mapping relationship between key-value pairs, and the process of searching for points includes the steps of:
(e) Calculating a hash value corresponding to the key-i, and taking the remainder of the number M of non-conflict array units by using the hash value to obtain a non-conflict array subscript arr-i, wherein if the conflict mark corresponding to the arr-i is false and lsn-i corresponding to the arr-i is an initialization state, no rowId-i is inquired, and no data is inquired;
(f) If the conflict mark corresponding to the arr-i in the step (e) is false and lsn-i is not in an initialized state, which means that the hash conflict does not occur in the array unit, the 9-16 byte content is rowId-i to be queried, the query is ended, and rowId-i is returned;
(g) If the conflict mark corresponding to the arr-i in the step (e) is true, indicating that the array unit generates a hash conflict, acquiring a key in the file through the offset position of the lsn-i positioning file, comparing whether the key-i is the same as the key, if so, finishing the inquiry, and returning to rowId-i in the file;
(h) If the key-i is different from the key in the step (g), obtaining the pre-i in the ar-i, wherein the pre-i is the subscript of the previous conflict value in the conflict array, and if the corresponding array unit of the pre-i is ar-t, obtaining the key in the file again through lsn-t, comparing with the key-i to be equal, and if the key-i is not equal, obtaining the previous conflict data again through the pre-t until the conflict mark on the array unit is false, wherein the fact indicates that the previous conflict data is not available.
5. A loading system for a rapid loading method of data for a distributed graph database as claimed in claim 1, comprising: the system comprises a file analysis system, a rowld distribution system, a hash system and a graph database system; the file analysis system is responsible for reading the data file, analyzing the text data into point or side information according to the description information of the table, the rowId distribution system is responsible for distributing unique identification ID for the analyzed data, the hash system maintains the mapping relation between the key of the point and rowId, and the graph database system sustains the record loading data.
6. The rapid loading system of data for a distributed graph database of claim 5 wherein the hash system maintains a mapping between keys of points and rowId, the keys of points are keys in the hash, rowId of points are values in the hash, and pair < key, rowId > maintains a mapping of point keys to rowId.
7. The distributed graph database oriented data rapid loading system of claim 6 wherein the hash table is selected to maintain a mapping of key-value pairs, the insertion process for points comprising the steps of:
(a) Combining key-i and rowId-i into a record, inserting the record into a file buffer area, obtaining the file offset lsn-i of the record, if the file buffer area is full, placing the record into a brushing thread for asynchronously brushing, and writing the record into a second file buffer area;
(b) Calculating a hash value corresponding to the key-i, using the hash value to make a remainder of the number M of the non-conflict array units to obtain a non-conflict array subscript arr-i, if the conflict mark corresponding to the array unit arr-i is false and other positions are in an initialized state, representing that the unit does not store data yet, and if the hash conflict does not occur, recording lsn-i and rowId-i at the corresponding positions of the array unit, and completing insertion;
(c) If the conflict flag corresponding to the array unit arr-i in the step (b) is true, but other positions are not in an initialized state, which means that the array unit arr-i already stores data, and a hash conflict occurs, at this time, the contents of the array unit arr-i need to be copied to the conflict array unit to be sequentially stored, and the use of the conflict array unit is sequentially used from the position 0; recording the unit number of the copy data in the conflict array as pre-i, storing lsn-i and pre-i in the corresponding positions of the arr-i units, changing the conflict mark into true, and completing the insertion;
(d) If the collision flag corresponding to arr-i of the array unit in step (b) is true, indicating that a hash collision has occurred multiple times, referring to step (c), the insertion of the collision data and the insertion of new data are completed.
8. The distributed graph database oriented data rapid loading system of claim 6 wherein the hash table is selected to maintain a mapping of key-value pairs, the lookup process for points comprising the steps of:
(e) Calculating a hash value corresponding to the key-i, and taking the remainder of the number M of non-conflict array units by using the hash value to obtain a non-conflict array subscript arr-i, wherein if the conflict mark corresponding to the arr-i is false and lsn-i corresponding to the arr-i is an initialization state, no rowId-i is inquired, and no data is inquired;
(f) If the conflict mark corresponding to the arr-i in the step (e) is false and lsn-i is not in an initialized state, which means that the hash conflict does not occur in the array unit, the 9-16 byte content is rowId-i to be queried, the query is ended, and rowId-i is returned;
(g) If the conflict mark corresponding to the arr-i in the step (e) is true, indicating that the array unit generates a hash conflict, acquiring a key in the file through the offset position of the lsn-i positioning file, comparing whether the key-i is the same as the key, if so, finishing the inquiry, and returning to rowId-i in the file;
(h) If the key-i is different from the key in the step (g), obtaining the pre-i in the ar-i, wherein the pre-i is the subscript of the previous conflict value in the conflict array, and if the corresponding array unit of the pre-i is ar-t, obtaining the key in the file again through lsn-t, comparing with the key-i to be equal, and if the key-i is not equal, obtaining the previous conflict data again through the pre-t until the conflict mark on the array unit is false, wherein the fact indicates that the previous conflict data is not available.
9. A distributed graph database oriented data rapid loading device, comprising: one or more processors;
a storage means for storing one or more programs, user data;
The one or more programs, when executed by one or more processors, cause the one or more processors to implement the method for fast loading of data for a distributed graph database as recited in any one of claims 1 to 4.
10. A data fast loading storage medium for a distributed graph database, wherein a computer program is stored thereon, which when executed by a processor, implements a data fast loading method for a distributed graph database according to any one of claims 1 to 4.
CN202410253830.9A 2024-03-06 2024-03-06 Data rapid loading method, system, equipment and storage medium for distributed graph database Pending CN118069365A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410253830.9A CN118069365A (en) 2024-03-06 2024-03-06 Data rapid loading method, system, equipment and storage medium for distributed graph database

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410253830.9A CN118069365A (en) 2024-03-06 2024-03-06 Data rapid loading method, system, equipment and storage medium for distributed graph database

Publications (1)

Publication Number Publication Date
CN118069365A true CN118069365A (en) 2024-05-24

Family

ID=91107206

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410253830.9A Pending CN118069365A (en) 2024-03-06 2024-03-06 Data rapid loading method, system, equipment and storage medium for distributed graph database

Country Status (1)

Country Link
CN (1) CN118069365A (en)

Similar Documents

Publication Publication Date Title
CN102779180B (en) The operation processing method of data-storage system, data-storage system
JP4206586B2 (en) Database management method and apparatus, and storage medium storing database management program
EP2069979B1 (en) Dynamic fragment mapping
US8099421B2 (en) File system, and method for storing and searching for file by the same
US7693875B2 (en) Method for searching a data page for inserting a data record
US6697797B1 (en) Method and apparatus for tracking data in a database, employing last-known location registers
US6654868B2 (en) Information storage and retrieval system
US5742809A (en) Database generic composite structure processing system
CN111007990B (en) Positioning method for quickly positioning data block references in snapshot system
JP4199888B2 (en) Database management method
CN114610708A (en) Vector data processing method and device, electronic equipment and storage medium
US20070174264A1 (en) Three-dimensional data structure for storing data of multiple domains and the management thereof
KR20190123819A (en) Method for managing of memory address mapping table for data storage device
CN118069365A (en) Data rapid loading method, system, equipment and storage medium for distributed graph database
CN112181288B (en) Data processing method of nonvolatile storage medium and computer storage medium
CN109325023B (en) Data processing method and device
US20040078519A1 (en) Data reorganization method in a RAID striping system
CN114185934A (en) Indexing and query method and system based on Tiandun database column storage
CN104834664A (en) Optical disc juke-box oriented full text retrieval system
RU2621628C1 (en) Way of the linked data storage arrangement
CN111831423A (en) Method and system for realizing Redis memory database on nonvolatile memory
CN114356232B (en) Data reading and writing method and device
CN111459949B (en) Data processing method, device and equipment for database and index updating method
JP3622443B2 (en) T-tree index construction method and apparatus, and storage medium storing T-tree index construction program
CN118051642A (en) Data storage method and system of database system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination