CN112579575A

CN112579575A - Method for quickly constructing database index structure

Info

Publication number: CN112579575A
Application number: CN202011580768.2A
Authority: CN
Inventors: 王培培; 陈乃阔; 吴之光; 牛晓威; 张明瑞
Original assignee: Chaoyue Technology Co Ltd
Current assignee: Chaoyue Technology Co Ltd
Priority date: 2020-12-28
Filing date: 2020-12-28
Publication date: 2021-03-30

Abstract

A method for quickly constructing a database index structure comprises the following steps: sorting the index data to be inserted by using a sorting algorithm; creating leaf nodes, and parallelly inserting the sorted index data into the leaf nodes in a hash storage mode; constructing a tree structure based on leaf nodes, and determining a layer structure of the tree structure and internal nodes or root nodes contained in each layer; according to the mapping relation among the leaf nodes, the internal nodes and the root nodes determined by the tree structure, computing key values of the internal nodes and key values of the root nodes in parallel; and parallelly inserting the obtained key values into corresponding internal nodes or root nodes. The invention provides a database index structure combining key value indexes and hash indexes, and provides a rapid construction method based on the index structure.

Description

Method for quickly constructing database index structure

Technical Field

The invention relates to the technical field of databases, in particular to a method for quickly constructing a database index structure.

Background

With the rapid development of the information industry, data information grows exponentially, data security becomes the key point of games in various countries at present, and database products based on domestic platforms become targets for competitive research of various database manufacturers.

Common database acceleration technologies, such as GPU and FPGA, are implemented using OpenCL high-level programming architecture. The OpenCL high-level comprehensive technology oriented to general computing is gradually mature, so that the FPGA algorithm design is easier to use and more efficient, and an engineer can design on a higher abstraction level without paying attention to the design details of a hardware bottom layer. However, the OpenCL architecture is mainly designed based on the x86 platform, and due to the lack of a dynamic link library supporting a domestic processor, the OpenCL architecture cannot exert its practical value in the field of database acceleration computing from autonomous domestic platforms such as feiteng, loongson, and shenwei. The traditional FPGA development method has low dependency on a processor platform, can be compatible with interfaces of any platform as long as a driver meets the requirement of a protocol, and is widely applied to autonomous platform heterogeneous acceleration calculation design.

Therefore, a database structure suitable for a domestic platform and a method for quickly constructing a database index structure by fully utilizing the characteristics of the domestic platform are needed.

Disclosure of Invention

In order to solve the technical problems in the background art, in one aspect of the present invention, a method for quickly constructing a database index structure is provided, where the method includes: sorting the index data to be inserted by using a sorting algorithm; creating leaf nodes, and parallelly inserting the sorted index data into the leaf nodes in a hash storage mode; constructing a tree structure based on the leaf nodes, and determining a layer structure of the tree structure and internal nodes or root nodes contained in each layer; according to the mapping relation among the leaf nodes, the internal nodes and the root nodes determined by the tree structure, computing key values of the internal nodes and key values of the root nodes in parallel; and parallelly inserting the obtained key values into corresponding internal nodes or root nodes.

In one or more embodiments, the creating a leaf node and inserting the sorted index data into the leaf node in parallel in a hash storage manner includes: configuring preset parameters; determining the number of required leaf nodes according to the preset parameters and the number of the index data to be inserted; creating a corresponding number of leaf nodes, and allocating a corresponding hash storage space for each leaf node, wherein the hash storage space of each leaf node comprises a hash bucket for storing index data and an overflow bucket for resolving data collision; and parallelly inserting the sorted index data into the hash bucket of each leaf node.

In one or more embodiments, the preset parameters include: the number of hash buckets each leaf node has on average, the capacity of each hash bucket, and the fill factor of each leaf node.

In one or more embodiments, the allocating a respective hash storage space for each of the leaf nodes includes: one or more hash buckets are allocated for each leaf node and are grouped into hash bucket chains, and an overflow bucket is allocated for each hash bucket chain.

In one or more embodiments, the computing the key values of the internal nodes and the key values of the root nodes in parallel according to the mapping relationships among the leaf nodes, the internal nodes and the root nodes determined by the tree structure includes: numbering internal nodes from bottom to top and from left to right based on the tree structure; determining the layer number of each internal node in the tree, the node number at the layer and the key number in the node; calculating leaf offset between internal nodes of each layer and leaf offset between index key values; calculating a target leaf node corresponding to each key value; and extracting the boundary value of the index data in the target leaf node as a key value of an internal node or a key value of a root node.

In one or more embodiments, the value boundaries of the index data are determined by the sorting rules of the sorting algorithm on the index data.

In one or more embodiments, the method further comprises: the fill factor β for each internal node is set to determine the number of key values in each internal node.

In one or more embodiments, the method comprises: in response to receiving a data query request, searching a target leaf node based on key values of the root node and the internal node; and carrying out hash calculation in the target leaf node, and obtaining target index data according to the hash result.

In one or more embodiments, the method comprises:

in response to receiving a request for deleting index data of a target leaf node, deleting the index data, performing defragmentation on the index data, and judging whether the deleted index data is boundary data of the target leaf node;

in response to the deleted index data being boundary data of the target leaf node, traversing all hash buckets within the target leaf node to obtain a minimum value of index data within the target leaf node; and taking the minimum value as a key value of the corresponding internal node or the root node, and upwards adjusting the structure of the internal node of the tree.

In one or more embodiments, the data deletion method further comprises determining whether the target leaf node underflows; in response to the target leaf node underflowing, transferring, by an adjacent leaf node, a portion of the index data to the target node; or merging the target leaf node and the adjacent leaf node, and adjusting the structure of the internal node of the tree upwards.

The beneficial effects of the invention include: the invention provides a database index structure combining key value index and hash index, and provides a method for quickly constructing the index structure based on the structural characteristics of the index structure. And the key values of all the nodes in the tree structure part can be parallelly inserted at one time, and the like.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by using the drawings without creative efforts.

FIG. 1 is a flowchart illustrating a method for rapidly building a data index structure according to the present invention;

FIG. 2 is a schematic diagram of a tree structure according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a database index structure according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.

It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are merely for convenience of description and should not be construed as limitations of the embodiments of the present invention, and they are not described in any more detail in the following embodiments.

In order to fully exert the characteristics of an FPGA (field programmable gate array) in a domestic platform, in one aspect of the invention, the invention provides a method for quickly constructing a database index structure, wherein the database index combines the characteristics of a tree index structure and a hash index structure, and not only has accurate range search performance, but also has certain random search performance; in the process of constructing the database index structure, the invention realizes the rapid construction of the database index structure by calculating the key values of the root nodes and the internal nodes of all the tree structures in parallel and completing the key value insertion operation of the root nodes and all the internal nodes at one time. The more detailed process of the invention is as follows:

FIG. 1 is a flowchart illustrating a method for quickly constructing a data index structure according to the present invention. Referring to fig. 1, the workflow of the method for quickly constructing a data index structure of the present invention includes: s1, sorting the index data to be inserted by using a sorting algorithm; step S2, creating leaf nodes, and inserting the sorted index data into the leaf nodes in parallel in a hash storage mode; step S3, building a tree structure based on the leaf nodes, and determining the layer structure of the tree structure and the internal nodes or root nodes contained in each layer; step S4, according to the mapping relation between the leaf node, the internal node and the root node determined by the tree structure, calculating the key value of the internal node and the key value of the root node in parallel; and step S5, inserting the obtained key values into corresponding internal nodes or root nodes in parallel.

Wherein, the index data to be inserted in the step S1 is sorted by using a sorting algorithm; the purpose of sequencing the index data is to calculate key values of a root node and an internal node in the tree structure; however, since the specific index data is stored in the leaf nodes, and the leaf nodes adopt a hash storage mode, when the index data to be inserted into each leaf node is determined and then inserted into the corresponding leaf node, the insertion operation can be executed in parallel by the FPGA without inserting the index data in sequence. In an alternative embodiment, the index data may be sorted from high to low or from low to high according to the size of the value of the data to be inserted.

For step S2, creating a leaf node, and inserting sorted index data into the leaf node in parallel in a hash storage manner; the leaf node of the invention adopts a hash storage mode, and the hash storage space of the leaf node is composed of a hash bucket for storing index data and an overflow bucket for solving data collision.

Specifically, the process of creating the leaf node includes step S2.1, configuring preset parameters; s2.2, determining the number of required leaf nodes according to the preset parameters and the number of the index data to be inserted; and S2.3, creating a corresponding number of leaf nodes, and distributing corresponding hash storage space for each leaf node. The preset parameters comprise the number of hash buckets which each leaf node has on average, the capacity of each hash bucket and the filling factor of each leaf node.

More specifically, the number of hash buckets averagely possessed by each leaf node is set as buckets, the capacity of each hash bucket is set as b, and the filling factor of each leaf node is set as alpha; then, the number of hash buckets actually used for storing index data in each leaf node is bucks α, and the data amount of index data that can be actually inserted in each leaf node is bucks b α; then L are needed for N index data to be inserted₀A leaf node; wherein,

after the number of the leaf nodes is determined, a certain hash storage space needs to be allocated to each leaf node, and the specific process comprises allocating one or more hash buckets to each leaf node, enabling the one or more hash buckets to form a hash bucket chain, and allocating overflow buckets to each hash bucket chain. Since the capacity b of the hash bucket is already determined, the size of the hash storage space required to be allocated to each leaf node can be obtained by counting the number of hash buckets in each leaf node and the size of the overflow bucket. The overflow bucket is used for temporarily storing data in the overflow bucket when the storage space of the hash bucket chain is insufficient, triggering the splitting of the leaf nodes, and transferring the index data stored in the overflow bucket to the hash bucket of the newly generated leaf node after the new leaf node is generated, so that the insertion operation of the new index data is completed. The process of creating the leaf nodes and the splitting process of the leaf nodes can be completed in parallel by the FPGA, and because the leaf nodes adopt a hash storage mode, the insertion operation can be executed in parallel by the FPGA without inserting the index data in sequence after determining which index data are to be inserted into each leaf node.

For step S3, constructing a tree structure based on the leaf nodes to determine the layer structure of the tree structure and the internal node or root node included in each layer, the specific process includes:

preferably, the number of layers of the tree structure and the number of internal nodes contained in each layer need to be determined according to the number of leaf nodes, and the calculation formula is as follows:

calculating the layer number of the tree structure:

where h is the height (i.e., the number of layers) of the tree structure, m is the recording capacity of the internal node, β is the fill factor of the internal node, and L is₀Is the number of leaf nodes.

Calculating the number of internal nodes contained in each layer of the tree structure:

wherein Li is the number of the internal nodes of the ith layer. And (3) determining the number of internal nodes contained in each layer of the tree structure from leaf nodes layer by layer upwards by using the formula (2) and the formula (3) until the root node is determined, and then completing the construction of the tree structure.

The formula for calculating the total number of index nodes (internal nodes and root nodes) of the tree structure is as follows:

the tree structure constructed at this time is shown in fig. 2. Fig. 2 is a schematic diagram of a tree structure according to an embodiment of the invention. The tree structure constructed at this time only forms the mapping relation among the leaf nodes, the internal nodes and the root nodes. On the basis of the mapping relation, the invention utilizes the parallel processing capacity of the FPGA to calculate the index key values of each internal node and the root node in parallel and completes the insertion operation of the index key values at one time, thereby generating the database index structure of the invention rapidly.

For step S4, concurrently calculating the key values of the internal nodes and the key value of the root node according to the mapping relationship between the leaf nodes, the internal nodes and the root node determined by the tree structure; after the tree structure is constructed, the process of calculating the key values of each internal node and the root node comprises the following steps: numbering internal nodes from bottom to top and from left to right based on the tree structure; determining the layer number of each internal node in the tree, the node number at the layer and the key number in the node; calculating leaf offset between internal nodes of each layer and leaf offset between index key values; calculating a target leaf node corresponding to each key value; and extracting the boundary value of the index data in the target leaf node as the key value of the internal node or the key value of the root node. Wherein the boundary value of the index data is determined by the sorting rule of the index data by the sorting algorithm.

Before calculating the key values of the internal nodes and the root node, the capacities (i.e., the number of key values that can be stored) of the internal nodes and the root node and a filling factor need to be set to determine the number of key values in each internal node. The formula for calculating the leaf offset between the internal nodes of each layer is as follows:

nodeOffset＝(m×β+1)ⁿ (5)

and the formula for calculating the leaf offset between the index key values is as follows;

keyOffset＝(m×β+1)^n-1 (6)

wherein m is the capacity of the internal node, β is the fill factor of the internal node, and n is the number of layers where the internal node is located.

The formula for calculating the target leaf node corresponding to each key value is as follows:

target＝NodeNo*nodeOffset+(KeyNo+1)*KeyOffset (7)

wherein, target is the number of the target leaf node, node no is the number of the internal node on the layer (the internal node of each layer starts with 0 and is numbered from left to right in sequence), and KeyNo is the number of the key value (the key value in each internal node starts with 0 and is numbered from left to right in sequence).

And finally, extracting the boundary value of the index data in the target leaf node as a key value of an internal node or a key value of a root node.

For step S5, the obtained key values are inserted into corresponding internal nodes or root nodes in parallel; corresponding internal node and root node key values are inserted into the pre-allocated storage space, the method which adopts a completely parallel insertion method is the method with the fastest efficiency, but the same memory can not carry out multiple read-write operations at the same time, and the key values are actually operated in series when being stored. The method adopted by the invention is as follows:

1) completely performing parallel operation on key values which cannot be stored in the same memory;

2) for the key values which need to be stored in the same memory, the key values are calculated by adopting a completely parallel method, the calculated key values are temporarily stored in an extra storage space, and then the cached key values are stored in a final storage space at one time, so that the calculation time and the storage time in serial operation after calculation are saved, and the efficiency of the algorithm is obviously improved by increasing extra space overhead.

To more clearly and completely illustrate the calculation process of the key values, taking the construction of the tree structure shown in fig. 2 as an example:

referring to fig. 2, a total of N — 42 index data to be inserted is set; the number of hash buckets on average of the leaf nodes, i.e., the number of hash buckets bucks, is 4, the capacity b of each hash bucket is 1 (i.e., each hash bucket may store 1 piece of index data), the padding factor α of each leaf node is 0.75, the key value capacity m of the internal node is 2, and the padding factor β of the internal node is 1;

the number L of the required leaf nodes is obtained by the calculation of the formula (1)₀Comprises the following steps:

namely, Y is required to be created into 14 leaf nodes; next, the number of layers of the tree structure is calculated by formula (2) based on the number of leaf nodes as:

namely, a tree structure is constructed based on leaf nodes, and the tree structure has n-3 layers; next, the number of internal nodes of layer 1 (the internal node having a direct index relationship with the leaf node is layer 1) is calculated based on formula (3) as:

constructing a tree structure from the determined layer structure and the internal nodes or root nodes contained in each layer; numbering the tree structure from left to right and from bottom to top; wherein the leaf nodes are numbered from 0 to 15 in sequence; the serial numbers of the internal nodes are 0-6 in sequence, and the serial number of the root node is 7.

Next, based on the tree structure constructed above, the process of calculating the key values of each internal node and the root node is as follows:

taking the calculation of the index key value of the internal node with the number of 1 as an example; as can be seen from the previously set values of 2 and β 1, the key value of each internal node is 2, which is numbered 0 and 1, and the internal node 1 is at the level 1, i.e., n is 1; since the internal node 1 is at the 2 nd position where it is located, the number is counted from 0, and its nodeon no is 1, and assuming that the 1 st key value is currently calculated, KeyNo is 0 can be obtained, and then, it is obtained by formula (5) and formula (6), respectively:

nodeOffset＝(2×1+1)¹＝3；

keyOffset＝(2×1+1)^1-1＝1；

then the target formula for calculating the 1 st key value of the internal node 1 is:

target＝1*3+(0+1)*1＝4；

the minimum value 34 in the leaf node numbered 4 (i.e., 5 th from the left) is used as the value of the 1 st index key value of the internal node 1.

Similarly, when 2 key values of the internal node 1 are calculated, KeyNO is 1, other parameters are not changed, and the target of the 2 nd key value is:

target＝1*3+(1+1)*1＝5

the minimum value 41 in the leaf node numbered 5 (6 th from left) is taken as the value of the 2 nd index key of the internal node 1.

Taking the index key value of the internal node with the calculation number 5 as an example, when the 1 st key value of the internal node 5 is calculated, nodeon is 0, and KeyNO is 0;

nodeOffset＝(2×1+1)²＝9；

keyOffset＝(2×1+1)^2-1＝3

the target formula for calculating the 1 st and key values of the internal node 5 is:

target＝0*9+(0+1)*3＝3

namely the minimum value 23 in the leaf node numbered 3 (4 th from left) as the key value;

similarly, when the 2 nd key value of the internal node 5 is calculated, nodeon is 0, KeyNO is 1, nodeOffset is 9, keyOffset is 3,

target＝0*9+(1+1)*3＝6

the minimum 51 in the leaf node numbered 6 (7 th from left) is used as the key.

The key values of the internal nodes and the key value of the root node are calculated sequentially through the steps, and finally, the index structure of the database is constructed, and the index structure is shown in fig. 3. Fig. 3 is a schematic diagram of a database index structure according to an embodiment of the present invention.

It can be seen from the above process of calculating key values that the present invention can calculate key values of each internal node and root node at the same time, therefore, the present invention can calculate each key value in parallel by using parallel processing capability of the FPGA, and insert it into the corresponding internal node or root node at one time, thereby completing construction of the database index structure of the present invention quickly.

If the existing tree structure creation algorithm is used, an index tree with the internal node key value capacity of m and the number of children of m +1 is built. If an index tree having the same index record (N-42) as that of fig. 2 is constructed, the tree height is 5 levels. The invention can generate the initial database index structure with the minimum storage space by the construction mode from the leaf node to the root node, and can be realized by the splitting process of the leaf node when new index data needs to be inserted.

Further, in one or more embodiments, the method comprises: in response to receiving the data query request, searching a target leaf node based on key values of the root node and each internal node; and carrying out hash calculation in the target leaf node, and obtaining target index data according to the hash result.

Further, in one or more embodiments, the method comprises: in response to receiving a request for deleting index data of a target leaf node, deleting the index data, performing defragmentation on the index data, and judging whether the deleted index data is boundary data (index data corresponding to a boundary value) of the target leaf node; traversing all hash buckets in the target leaf node to obtain the minimum value of the index data in the target leaf node in response to the deleted index data being the boundary data of the target leaf node; taking the minimum value as a key value corresponding to the internal node or the root node, and upwards adjusting the structure of the internal node of the tree; judging whether the target leaf node has underflow; in response to the target leaf node underflow, transferring part of the index data to the target node by the adjacent leaf node; or merging the target leaf node and the adjacent leaf nodes and adjusting the structure of the internal nodes of the tree upwards. Because each execution process of the algorithm can fully utilize the parallel throughput of the FPGA, the parallelism degree is extremely high, and therefore, the method can realize simultaneous deletion of a plurality of records, and has higher operation efficiency in practical application compared with the traditional mode of deleting the records one by one.

The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the present disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.

It should be understood that, as used herein, a single tree version "a" is intended to include a multiple tree version as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.

The numbers of the embodiments disclosed in the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, of embodiments of the invention is limited to these examples; within the idea of an embodiment of the invention, also technical features in the above embodiment or in different embodiments may be combined and there are many other variations of the different aspects of the embodiments of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Claims

1. A method for quickly constructing a database index structure is characterized by comprising the following steps:

sorting the index data to be inserted by using a sorting algorithm;

creating leaf nodes, and parallelly inserting the sorted index data into the leaf nodes in a hash storage mode;

constructing a tree structure based on the leaf nodes, and determining a layer structure of the tree structure and internal nodes or root nodes contained in each layer;

according to the mapping relation among the leaf nodes, the internal nodes and the root nodes determined by the tree structure, computing key values of the internal nodes and key values of the root nodes in parallel;

and parallelly inserting the obtained key values into corresponding internal nodes or root nodes.

2. The method for rapidly building the database index structure according to claim 1, wherein the creating leaf nodes and the inserting the sorted index data into the leaf nodes in parallel in a hash storage manner comprises:

configuring preset parameters;

determining the number of required leaf nodes according to the preset parameters and the number of the index data to be inserted;

creating a corresponding number of leaf nodes, and allocating a corresponding hash storage space for each leaf node, wherein the hash storage space of each leaf node comprises a hash bucket for storing index data and an overflow bucket for resolving data collision;

and parallelly inserting the sorted index data into the hash bucket of each leaf node.

3. The method for rapidly building the database index structure according to claim 2, wherein the preset parameters comprise:

the number of hash buckets each leaf node has on average, the capacity of each hash bucket, and the fill factor of each leaf node.

4. The method for rapidly building the database index structure according to claim 2, wherein the allocating a corresponding hash storage space for each of the leaf nodes comprises:

one or more hash buckets are allocated for each leaf node and are grouped into hash bucket chains, and an overflow bucket is allocated for each hash bucket chain.

5. The method for rapidly building the database index structure according to claim 1, wherein the computing the key values of the internal nodes and the key values of the root nodes in parallel according to the mapping relationships among the leaf nodes, the internal nodes and the root nodes determined by the tree structure comprises:

numbering internal nodes from bottom to top and from left to right based on the tree structure;

determining the layer number of each internal node in the tree, the node number at the layer and the key number in the node;

calculating leaf offset between internal nodes of each layer and leaf offset between index key values;

calculating a target leaf node corresponding to each key value;

and extracting the boundary value of the index data in the target leaf node as a key value of an internal node or a key value of a root node.

6. The method for rapidly building the database index structure according to claim 5, wherein the boundary value of the index data is determined by the sorting rule of the sorting algorithm on the index data.

7. The method for rapid building of a database index structure of claim 5, wherein the method further comprises:

the fill factor β for each internal node is set to determine the number of key values in each internal node.

8. The method for rapidly building the database index structure according to any one of claims 1 to 7, wherein the method comprises the following steps:

in response to receiving a data query request, searching a target leaf node based on key values of the root node and the internal node;

and carrying out hash calculation in the target leaf node, and obtaining target index data according to the hash result.

9. The method for rapidly building the database index structure according to any one of claims 1 to 7, wherein the method comprises the following steps:

in response to the deleted index data being boundary data of the target leaf node, traversing all hash buckets within the target leaf node to obtain a minimum value of index data within the target leaf node;

and taking the minimum value as a key value of the corresponding internal node or the root node, and upwards adjusting the structure of the internal node of the tree.

10. The data deletion method of claim 9, wherein the method further comprises:

judging whether the target leaf node underflows;

in response to the target leaf node underflowing, transferring, by an adjacent leaf node, a portion of the index data to the target node; or

And merging the target leaf node and the adjacent leaf node, and upwards adjusting the structure of the internal node of the tree.