CN112000847A

CN112000847A - GPU parallel-based adaptive radix tree dynamic indexing method

Info

Publication number: CN112000847A
Application number: CN202010836011.9A
Authority: CN
Inventors: 谷峪; 宛长义; 李传文; 宋振; 于戈
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2020-08-19
Filing date: 2020-08-19
Publication date: 2020-11-27
Anticipated expiration: 2040-08-19
Also published as: CN112000847B

Abstract

The invention provides a GPU parallel-based adaptive radix tree dynamic indexing method. Firstly, an adaptive radix tree data structure is constructed, the first two layers are used for creating Node256 type tree nodes, the third and fourth layers are based on a high-order-priority radix sorting method, tree nodes with corresponding sizes can be created according to the number of branches, the creation of a dynamic data structure is realized, the latest update in the original batch of data is still behind the old update after sorting, then, the duplication removing operation is carried out, the redundant old update is removed, the latest update is reserved, each section of sequence without duplicated data is inserted into the corresponding Node after the duplication removing operation, the creation of the whole adaptive radix tree is completed, and then, based on the GPU parallel computing capability, the insertion, query and deletion operation of the data can be carried out in parallel.

Description

GPU parallel-based adaptive radix tree dynamic indexing method

Technical Field

The invention relates to the technical field of parallel indexing of computer database directions, in particular to a GPU parallel-based adaptive radix tree dynamic indexing method.

Background

The indexing technology is one of the key technologies of modern information retrieval, search application and data mining, and various types of data indexes are provided aiming at different types of data and different query requirements in order to accurately locate data in a large amount of data. In recent years, the explosive growth of data scale expands the research field to large-scale, high-dimensional and sparse data sets, and efficient index construction and data query of such data sets have become the main research direction at present. In addition, as the demand for processors increases with the expansion of data size, the parallel capability of a graphics processing unit (abbreviated as GPU) provides an opportunity for solving problems, and has been widely used in different fields such as biological information and financial transactions due to excellent computing performance, and becomes one of important components of a large data processing system. Dynamic data structures can be architecturally adjusted to accommodate dynamic updates as data is inserted and deleted, however designing an efficient dynamic data structure based on a GPU faces many problems, such as leveraging hundreds or thousands of available GPU cores, avoiding thread branching, and reducing global memory access by using shared memory and caches located on the chip. In order to fully utilize the powerful computational resources provided by the GPU, design considerations should be taken into account depending on the particular hardware architecture of the GPU. Update operations are more challenging than query operations and also become a bottleneck, as they require inter-thread synchronization rather than running independently.

At present, there are four designs that can implement a dynamic index structure on a GPU. The method comprises the steps that a GPU SA (protected Array SA) maintains a section of ordered sequence, when a batch of data arrives, the ordered sequence is firstly sequenced and then merged with the existing sequence to synthesize a new ordered sequence, binary search is adopted in search operation, space locality is poor due to random access, cache cannot be well utilized, data are stored in multiple layers by a GPU LSM (Log-Structured Merge Tree LSM), the size of each layer is 2 times that of the previous layer, when the batch of data is inserted, the ordered sequence is firstly placed into the first layer, and when each layer is full, the ordered sequence is merged and stored into the next layer until the batch of data is completely stored into a Tree. Since the new value is at the upper level of the location of the old value in the GPU LSM, the lookup operation starts from the first level down, with each level performing a binary lookup until a key is found or the last level is found not. The SA and the LSM adopt the operation of sequencing and merging, so as to maintain one or more sections of ordered sequences for binary search, the insertion speed of the two structures depends on the batch size of data, and the larger the data volume, the higher the characteristic of the GPU can be utilized, so that the sequencing and merging speed is high. The LSM at the time of insertion does not need to merge with all original data like the SA due to the advantage of the multi-layer structure, so the average update speed is better than the SA. However, as the total data amount in the tree structure increases, the merging cost becomes higher when inserting, and the search speed drops suddenly, and the search speed is slower than the SA for the LSM because of the multi-layer query. The GPU B-Tree locks the nodes during updating so as to prevent updating conflict, and the conflict is less when the inserted sequence is uniformly distributed. However, if the sequence is ordered, the insertion positions are dense, which causes the locking conflict to be increased and reduces the insertion speed, so that the sequence does not need to be ordered first. The concurrency is poor because one thread group warp (32 threads are one group) can only process one data at a time, and the speed is basically unchanged along with the increase of the batch data quantity. But the B-Tree has the advantages that the size of the node can be suitable for the cache-line size, the spatial locality is good, the node does not need to be locked, and the searching speed is higher than that of the traditional LSM and SA. The Slab Hash provides a Hash structure capable of being dynamically expanded, optimizes a linked list structure and realizes a method for distributing memory by using a very high-speed dynamic memory distributor Slab alloc to replace a CUDA (computer Unified Device Architecture for short CUDA) tool, so that the structure has very high insertion and search speeds, but does not support operations such as range query and the like as the traditional Hash method.

Radix Tree Radix-Tree this Tree structure is determined by the distribution of key, not the insertion order, and the key location is not compared as many times as B-Tree, but similar to Hash, but determined by each part of the key value itself. Because the nodes are independent when data is inserted, no lock operation or atomic operation is needed, the locality of a search space inside the nodes is good, and the cache can be fully utilized. The optimized Adaptive Radix Tree (Adaptive Radix Tree) can dynamically change the type of the node according to the utilization rate of the node to improve the space utilization rate. The searching, inserting and deleting operations on four types of nodes of the self-adaptive radix tree index are all designed in a single-thread serial mode and cannot be operated in parallel, an existing GPU method is in a serial tree structure, the node types cannot be changed dynamically, warp and cache cannot be fully utilized during query, and the performance is poor.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a GPU parallel-based adaptive radix tree dynamic indexing method, which comprises the following steps:

step 1: constructing an adaptive radix tree data structure with four layers, comprising:

step 1.1: creating tree nodes of the type of a first layer Node256 and a second layer Node256, and connecting the second layer Node and the first layer Node;

step 1.2: performing high-order-first radix sequencing on S integer data to be stored, creating a third layer and a fourth layer data structure of the adaptive radix tree, connecting the second layer and the third layer, and connecting the third layer and the fourth layer;

step 1.3: carrying out deduplication operation by using each thread in the warp to process an 8-bit key, and ending processing on the xth data if the xth data processed by the xth thread in the warp is equal to the xth +1 data, wherein X is 0,1, …, and X denotes the total number of data processed by one warp;

step 1.4: inserting the sequence of the non-repeated data in each branch after the duplication removal into the node of the adaptive basis number tree;

step 2: performing query operation on data by using warp aiming at the constructed adaptive radix tree data structure;

and step 3: performing insertion operation on data by using warp aiming at the constructed adaptive radix tree data structure;

and 4, step 4: and (4) deleting the data by using warp aiming at the constructed adaptive radix tree data structure.

Step 1.1 comprises:

step 1.1.1: creating a Node N of Node256 type_1,1As the first level of the adaptive radix tree, i.e., the first level of the branches of the adaptive radix tree, node N_1,1Is shown as

An ith value representing a first level node;

step 1.1.2: to be provided with

Creating a Node N of Node256 type for a Node_2,iThe ith branch of the second layer as the adaptive radix tree, node N_2,iIs shown as

A jth value representing the ith node of the second layer;

step 1.1.3: let i equal 0,1, …,255, create a Node256 type Node with GPU's warp and for each value of the first layer, resulting in 256 branches of the second layer of the adaptive radix tree;

step 1.1.4: and storing the pointer of the ith node of the second layer in the ith value of the first layer.

Step 1.2 comprises:

step 1.2.1: dividing the data with the same 0-7 bits in the S data into first layer branches;

step 1.2.2: dividing the data with the same 8-15 bits in each branch of the first layer into the same branch of the second layer, wherein the data stored in each branch of the second layer are the data with the same prefix;

step 1.2.3: performing high-order-first base number sequencing on 16-23 bits of data in each branch of the second layer in parallel by using warp of the GPU, and performing high-order-first base number sequencing on 16-23 bits of data with the same prefixData with the same bits are divided into a group, and data with the same prefix in the ith branch defining the second layer is divided into H_iGroup, h in ith branch of second layer_iThe number of group data is

H is_i＝0,1,…,H_i；

Step 1.2.4: sequentially creating a node with the capacity more than or equal to the capacity of the node by taking each value in the ith branch of the second layer as the node

Node type of

Ith as third layer of adaptive radix tree_iA branch, then

Is shown as

U ∈ {4,16,48,256}, and

i x h representing the third layer_iThe u-th value of each Node, the Node types including a Node of the Node4 type, a Node of the Node16 type, a Node of the Node48 type and a Node of the Node256 type;

step 1.2.5: let h_i＝0,1,…,H_iWhen i is 0,1, …,255, creating a node type with the node capacity more than or equal to the size of the data to be stored by using the warp of the GPU and each value of the second layer, and obtaining all branches of the third layer of the adaptive radix tree;

step 1.2.6: will be first

The pointers of each node are stored in the h-th branch of the ith layer_iValue;

step 1.2.7: performing high-order-first base number sequencing on 24-31 bits of data in each branch of the third layer in parallel by using warp of a GPU (graphics processing unit), dividing data with the same 24-31 bits in the data of each branch into a group, and defining that data in the kth branch of the third layer is divided into H_kThe number of the groups is set to be,

h in the kth branch of the third layer_kThe number of group data is

H is_k＝0,1,…,H_k；

Step 1.2.8: sequentially creating a node with the capacity more than or equal to the capacity of the node by taking each value in the kth branch of the third layer as the node

Node type of

K x h as the fourth level of the adaptive radix tree_kBranch of

Is shown as

V is equal to {4,16,48,256}, and

k x h representing the fourth layer_kThe v-th value of the individual node;

step 1.2.9: let h_k＝0,1,…,H_k,

Establishing a node type with the node capacity larger than or equal to the size of data to be stored by using the warp of the GPU and acting as each node of the fourth layer to obtain all branches of the fourth layer of the adaptive radix tree;

step 1.2.10: will be first

Pointers to individual nodes being stored in the kth branch of the third layer_kValue.

The step 2 comprises the following steps:

step 2.1: allocating a warp process to each branch with the same prefix of 0bit to 23bit for query operation;

step 2.2: for nodes of Node4 type and Node16 type, according to the quantity F of stored data in the Node, judging whether the data to be inquired exists in the Node by using a thread voting mode by taking each F thread in each warp as a group, and for nodes of Node48 type and Node256 type, judging whether the data to be inquired exists in the Node according to an 8-bit key, if so, returning the value in the Node, and if not, returning a null value to indicate that the data is not found.

The step 3 comprises the following steps:

step 3.1: allocating a warp treatment to each branch with the same prefix of 0bit to 23bit for insertion operation;

step 3.2: judging whether a node to be inserted exists or not, if not, executing node creation and new data insertion;

step 3.3: if the node exists, each thread queries whether new data to be inserted exists in the node in parallel, if so, the value in the node to be inserted is updated, if not, whether the sum beta of the number of the new data to be inserted and the number of data existing in the node exceeds the capacity beta' of the type of the node to be inserted is judged, and if not, the sum beta is judged according to the existing data quantity in the node

From within the node to

Inserting a value of new data and a key of 24-31 bit into each position; if the node capacity is larger than or equal to beta, the node type is converted into a new node type, and then new data is inserted into the node of the new node type.

Step 4 comprises the following steps:

step 4.1: allocating a warp treatment to each branch with the same prefix of 0bit to 23bit for deleting operation;

step 4.2: judging whether the data to be deleted exists in the Node or not, if so, for the nodes of Node4, Node16 and Node48 types, firstly finding the last value stored in the Node, covering the position to be deleted by using the key and the value respectively, updating the number of the data in the Node to be deleted, and for the Node of Node256 type, finding the storage position of the data to be deleted, deleting the value, and then updating the number in the Node.

The invention has the beneficial effects that:

the invention provides a GPU (graphics processing unit) -parallel-based adaptive radix tree dynamic indexing method, which is based on a GPU multi-thread processing mode and provides a data structure of an adaptive radix tree.

Drawings

FIG. 1 is a diagram illustrating an adaptive radix tree data structure according to the present invention.

FIG. 2 is a diagram illustrating four node types of the adaptive radix tree according to the present invention.

FIG. 3 is a schematic diagram of parallel query of data in the present invention, in which (a) shows a schematic diagram of a Node16 type Node querying data in a thread voting manner, in which (b) shows a schematic diagram of a Node48 type Node querying a value array by accessing a value in the key array using a key as an array index, and in which (c) shows a schematic diagram of a Node256 type Node directly locating a value position by using a key.

FIG. 4 is a schematic diagram of data deletion of four Node types in the present invention, in which (a) shows a schematic diagram of data deletion of Node4 and Node16 type nodes, and (b) shows a schematic diagram of data deletion of Node48 type nodes.

Detailed Description

The invention is further described with reference to the following figures and specific examples.

A GPU parallel-based adaptive radix tree dynamic indexing method is a GPU parallel-based dynamic indexing design, makes full use of the parallel computing capability of a GPU, and parallelizes batch construction, insertion, deletion and query operations of indexes by combining an adaptive radix tree structure. Firstly, according to the data distribution characteristics in a radix tree, performing high-order-first radix sequencing on a batch of data to be processed, performing grouping operation for multiple times after the sequence is ordered, counting the highest 8-bit for a key of 32 bits for the first time, dividing the data sequence to be processed each time into at most 256 sections, and then performing parallel processing among the sections; and then, the inside of each section is divided according to the second 8bit, and then, the inside of each section is divided according to the third 8bit in the same way, the division result at this moment is the statistical information of 8-16bit of each section of the current third layer, the insertion operation is carried out on the third layer according to the information, at this moment, the whole batch of data is divided into a plurality of small sections in parallel, the inside of each section is orderly arranged, the same 24bit prefix is arranged, and the redundant updating operation exists. Because the base ranking method is a stable ranking method, the latest update in the original batch of data can be guaranteed to be still behind the old update after ranking. Thus, the duplication removing operation can be carried out from front to back, redundant old updates are removed, and the latest updates are reserved; after the deduplication, each non-duplicated data sequence is inserted or inquired in the corresponding node of the segment by using the redesigned parallel operation, and the method comprises the following steps:

step 1: an adaptive radix tree data structure with four layers is constructed, as shown in fig. 1, a Node256 type Node is created for the first layer of the tree, and the Node structure is shown in fig. 2. Node4 is the smallest type, and is composed of an array Key [4] with length of 4 storage Key and pointer array Value [4] with same length, and Key and pointer exist at corresponding position. Node16 is used to store 5 to 16 values, similar to Node4 Node, but with a total length of 16, and is used to store Key's array Key [16] and the same length pointer array Value [16 ]. Node48 stores nodes of 17 to 48 values, Value, consisting of a Key array Key [256] of 256 and a pointer array Value [48] of length 48. Node256 is the largest Node type, Node256 stores nodes with 49 to 256 values, and is composed of a pointer array Value [256] with length of 256, the Value is located directly by using key as subscript, and the Value can be located only by one effective search. The second layer creates 256 Node256 type nodes, pointers of each Node are respectively stored at corresponding positions in the first layer of nodes, value arrays in leaf nodes of the fourth layer store values, the pointers of the previous layer store pointers pointing to the corresponding nodes of the next layer, in order to improve processing efficiency, the second layer is set as 256 Node256 type nodes, and the pointers of each Node are respectively stored at positions in the corresponding value arrays in the first layer of nodes; the method specifically comprises the following steps:

step 1.1: creating tree nodes of the type of a first layer Node256 and a second layer Node256, and connecting the second layer Node and the first layer Node, wherein the steps comprise:

An ith value representing a first level node;

step 1.1.2: to be provided with

A jth value representing the ith node of the second layer;

A batch of data is subjected to high-bit-first base sorting, every 8 bits are divided into at most 256 sections, and the data with the same first 16 bits are divided into the same branch. The position information (namely prefix sum) of each branch can be obtained when the high-order-priority cardinal number sorting is carried out, so that the initial position of each branch can be found in parallel, and downward processing is continued; continuing to sort the 16 th to 24 th bits of each branch in parallel, creating tree nodes capable of containing corresponding sizes according to the number of branches as a third layer, storing pointers of the nodes into corresponding positions in tree nodes of a second layer, performing the same sorting processing on the 24 th to 32 th bits, and storing the pointers into corresponding positions of the third layer after creating nodes of corresponding types in a fourth layer, wherein the specific steps are as follows:

step 1.2: performing high-order-first radix sequencing on S integer data to be stored, creating a third layer and a fourth layer data structure of an adaptive radix tree, connecting the second layer and the third layer, and connecting the third layer and the fourth layer, wherein the method comprises the following steps:

step 1.2.2: dividing the data with the same 8-15 bits in the branches of the first layer into the same branches of the second layer, wherein the data stored in each branch of the second layer are the data with the same prefix;

step 1.2.3: performing high-order-first base number sequencing on 16-23 bits of data in each branch of the second layer in parallel by using warp of the GPU, dividing the data with the same prefix in 16-23 bits into a group, and defining that the data with the same prefix in the ith branch of the second layer is divided into H_iGroup, h in ith branch of second layer_iThe number of group data is

H is_i＝0,1,…,H_i；

Node type of

Ith as third layer of adaptive radix tree_iA branch, then

Is shown as

U ∈ {4,16,48,256}, and

step 1.2.5: let h_i＝0,1,…,

H

_i0,1, …,255, warp union using GPUEstablishing a node type with the node capacity larger than or equal to the size of the data to be stored by using each value of the second layer to obtain all branches of the third layer of the adaptive radix tree;

step 1.2.6: will be first

h in the kth branch of the third layer_kThe number of group data is

H is_k＝0,1,…,H_k；

Node type of

K x h as the fourth level of the adaptive radix tree_kBranch of

Is shown as

V is equal to {4,16,48,256}, and

k x h representing the fourth layer_kThe v-th value of the individual node;

step 1.2.9: let h_k＝0,1,…,H_k,

step 1.2.10: will be first

Step 1.3: processing an 8-bit key by each thread in the warp to perform deduplication operation, if the xth data processed by the xth thread in the warp is equal to the xth +1 data adjacent to the last, ending the processing of the xth data, removing redundant old updates, and keeping the latest updates, wherein X is 0,1, …, and X, X represents the total number of data processed by the warp;

and aiming at the constructed adaptive radix tree data structure, allocating a warp process to each branch with the same front 24bit prefix, wherein the processing modes of each batch of data comprise query, insertion and deletion operations.

Step 2: aiming at the constructed adaptive radix tree data structure, performing query operation on data by using warp, wherein the query operation comprises the following steps:

step 2.2: for nodes of Node4 type and Node16 type, according to the quantity F of stored data in the Node, judging whether the data to be inquired exists in the Node by using a thread voting mode by taking each F thread in each warp as a group, for nodes of Node48 and Node256 type, directly judging whether the data to be inquired exists in the Node according to an 8-bit key, if so, returning the value in the Node, and if not, returning a null value to indicate that the data is not found.

As shown in FIG. 3, a schematic diagram (a) shows that a Node of Node16 type queries data by using a thread voting method, in the diagram, 7 data already exist in the Node, Keys represents 8bit data to be queried in the Node, i.e.

random numbers

1, 0, 255 and 3, and a thread group warp is 32 threads in total, i.e. T₀～T₃₁Wherein T is₀、T₇、T₁₄、T₂₁Processing the 0 th data in the node, T₁、T₈、T₁₅、T₂₂Processing the 1 st data in the node, T₂、T₉、T₁₆、T₂₃Processing the 2 nd data in the node, T₆、T₁₃、T₂₀、T₂₇Processing the 6 th data in the node; FIG. (b) shows a schematic diagram of Node48 type Node query data, Keys represents 8bit data to be queried in the Node, namely

random numbers

1, 15, 46, … …, 0, 16, each warp can query 32 data at most simultaneously in parallel, 0,1, … …, 15, 16, … …, 46, 47 represents the subscript of value in the value array; the diagram (c) shows a schematic diagram of Node256 type Node directly querying value by key, 0,1, … …, 47, 48, … …, 254, 255 represents subscript of value in value array.

And step 3: and performing insertion operation on data by using warp aiming at the constructed adaptive radix tree data structure, wherein the insertion operation comprises the following steps:

step 3.3: if the node exists, each thread queries whether the new data to be inserted is stored in the node in parallelIf the node exists, updating the value of the node to be inserted, if the node does not exist, firstly judging whether the sum beta of the number of the new data to be inserted and the number of the data existing in the node exceeds the capacity beta 'of the type of the node to be inserted, and if the sum beta does not exceed the capacity beta', according to the amount of the data existing in the node

From within the node to

And 4, step 4: and (3) deleting data by using warp aiming at the constructed adaptive radix tree data structure, wherein the deleting operation comprises the following steps:

In the traditional ART (adaptive basis number tree (ART) for short), except for the Node256 Node type, other three types are all deleted at any position, which causes the discontinuity of non-null values of a value array, so that the search for the null position during the insertion increases the query cost and reduces the parallelism; the insert operation is written to the value array in an additional form, which requires the continuity of the value array to make it more convenient to find the empty location. In order to realize the deletion function without influencing the continuity, when the Node4, Node16 and Node48 type nodes process the deletion operation, each time one key is processed, the position of the value corresponding to the last key in the Node is found, and then the key to be deleted and the value are covered, so the continuity is ensured. As shown in fig. 4(a), the position of the last non-null value is obtained in Node4 and Node16 according to the data amount in the Node, and the key position of the last value of Node48 is obtained by searching the value in the key array with the length of 256 in common by 32 threads in warp; as shown in fig. 4(b), each warp in the graph contains 32 threads, a dashed box 1 in the graph represents a key0 to be deleted and a value stored at the 1 st position in the value array, the last value in the node is stored at the 46 th position in the value array, the 46 th value covers the 1 st value, the key at the 2 nd position in the key array is updated to be the new position 1 of the value, and then 1 stored at the position of the key0 is set to be null; and after the Node256 type Node is directly positioned, deleting the value at the corresponding position and then updating the data volume in the Node.

Claims

1. A GPU parallel-based adaptive radix tree dynamic indexing method is characterized by comprising the following steps:

step 1.3: processing an 8-bit key by each thread in the warp to perform deduplication operation, and if the xth data processed by the xth thread in the warp is equal to the xth +1 data, ending the processing of the xth data, wherein X is 0,1, …, and X denotes the total number of data processed in the warp;

2. The GPU-parallel-based adaptive radix tree dynamic indexing method according to claim 1, wherein step 1.1 comprises: