CN111966678A

CN111966678A - Optimization method for effectively improving B + tree retrieval efficiency on GPU

Info

Publication number: CN111966678A
Application number: CN202010640423.5A
Authority: CN
Inventors: 张为华; 蒋金虎; 宋昶衡
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2020-07-06
Filing date: 2020-07-06
Publication date: 2020-11-20

Abstract

The invention belongs to the technical field of heterogeneous computing, and particularly relates to an optimization method for effectively improving B + tree retrieval efficiency on a GPU. The invention comprises the following steps: designing a new B + tree data structure; designing a search method for improving the query efficiency of the B + tree data structure; the new B + tree data structure is a partition of the traditional B + tree into two parts: key areas and child node areas, and replacing pointer information of child nodes with larger volumes in the B + tree with prefixes and arrays with smaller volumes; the optimized search method comprises a search method based on sorting and a search method based on reducing the size of a thread group. The method can effectively solve the problem that the memory hierarchical structures of the B + tree program and the GPU are not matched, reduces the divergence of memory access and execution on the GPU, and improves the utilization rate of resources of the B + tree retrieval program on the GPU. The experimental result shows that the query throughput rate in the HyperSpace system reaches about 35 hundred million times per second, and is improved by nearly 3.4 times compared with the recent research result.

Description

Optimization method for effectively improving B + tree retrieval efficiency on GPU

Technical Field

The invention belongs to the technical field of heterogeneous computing, and particularly relates to an optimization method for effectively improving B + tree retrieval efficiency on a GPU (graphics processing Unit), which can be used for an index structure in a big data system.

Background

In recent years, with the rapid development of internet technology and the revolutionary progress of information technology, the world has entered a big data era. The amount of data generated annually worldwide has seen explosive growth. The amount of data produced globally in 2020 will reach about 40ZB (about 8000 times 2003). Under the background, applications for analyzing and predicting by using big data become more and more popular, and play more and more important roles in the life of people, for example, many applications use big data processing technology to provide personalized recommendation service for users, so that the users can conveniently perform daily activities such as shopping and reading. However, while these applications bring many conveniences to people, they are also continuously facing the challenges of data volume explosion, for example, statistics of Netflix 2018 show that an average daily Netflix stream processing system needs to upload 12 PB data to an AWS data center and process the data.

In the face of the explosion of exponential data size, how to efficiently retrieve and process big data in applications has become an important issue of general interest in the industry. The current main solution is to use an index structure to efficiently organize data to improve retrieval and processing efficiency, such as using a hash data structure to index a database table in SQLServer, using a B + tree to cluster and index data in MySQL database, and the like. In the index structure, the B + index tree is the most important and most widely applied one, and not only can provide data access operations such as query (Search), Update (Update) and the like on time complexity, but also can orderly organize the underlying data. As early as the 70 s in the 20 th century, people applied B + trees to various file systems based on their characteristics to reduce the overhead of disk access. Nowadays, with the advent of more application scenarios, B + tree structures are increasingly used in a variety of data systems, such as key-value storage systems, databases, and online Analytical Processing (OLAP) systems. Therefore, as the amount of data increases dramatically, the performance of the B + tree index data structure will directly impact the performance efficiency of these systems.

With the advent of the "big data" era, the classic B + tree index data structure also faces a number of difficulties and challenges. First, the increase in data volume causes the volume of data structures associated with data storage to expand dramatically, which puts higher demands on the design of data structures, such as the current requirement in the search engine system of Google to index hundreds of millions of web pages worldwide, i.e., the volume of web page data exceeding 1 billion GB in size. Secondly, in the big data era, the number of requests for accessing data increases, and the demand of the system for concurrent processing of requests also increases, for example, in the "double eleven cat activity of aribaca 2017, the ariloc cloud system needs to process millions of transaction operations per second. Therefore, in the era when the amount of data stored and the number of concurrent requests are exponentially increasing, improving the retrieval efficiency of the B + tree data structure to cope with the challenge of the "big data" era becomes one of the key issues in the system field.

Currently, with the rapid development of hardware technology, more and more new types of hardware are gradually appearing and widely used in various fields. Among them, as early as 90 s in the 20 th century, image processing units (GPUs) were applied to the specific field of graphic image rendering due to their rich computing resources. With the rapid development of GPU architectures and programming platforms in the last 10 years, GPUs have also gradually become one of the most commonly used multi-core processors in addition to CPUs. The method can not only process the problems in the aspect of graphic images, but also provide a generalized solution for the high-parallel computing problems in other fields. Since the GPU has rich computing and memory resources, and the concurrent B + tree concurrent lookup has high parallelism and a large amount of memory access requirements, it becomes a potential feasible solution to accelerate the processing speed of the conventional B + tree structure based on the GPU. In recent years, much research work has been focused on this direction and the results of the research have been published in succession. However, even though the GPU provides more abundant hardware resources compared to the CPU, the existing research results do not achieve the ideal processing effect.

Aiming at the problem that the conventional GPU is not ideal in B + tree search acceleration efficiency, the original B + tree is difficult to obtain an ideal accelerating effect on the GPU, and the main reason is that the design and optimization of the traditional B + tree structure are specific to a CPU (Central processing Unit) architecture. Therefore, in the face of a GPU architecture and its programming model that is distinct from the CPU, the traditional B + tree structure is difficult to adapt to this change, which creates a number of problems: first, for a conventional B + tree, a query request needs to traverse layer by layer from a Root Node (Root Node) to a leaf Node (Left Node), and such a traversal process may result in a large number of inefficient indirect address access operations. These indirect address access operations usually require frequent reading of Global Memory (Global Memory), which is the Memory with the highest transmission delay on the GPU, and therefore this has a large impact on the efficiency of program access and execution; secondly, the target value of multiple query requests executed concurrently is usually random, so that there will be great diversity in the search paths of concurrent requests on the B + tree, and it is difficult to share the same tree search path between multiple adjacent queries. This disparity can lead to various ramifications (divergences) when they are processed simultaneously by one thread bundle (Warp) on the GPU: a thread bundle Divergence (Warp Divergence) and a Memory Divergence (Memory Divergence). The thread bundle divergence affects the computational execution efficiency of the GPU program, while the memory access divergence causes the GPU to have difficulty in merging memory transfers, thereby affecting the data transfer throughput of the global memory on the GPU.

Aiming at the analysis results of various mismatching problems of the current B + tree structure and the GPU, the invention provides a novel B + tree data structure and an optimized search scheme, and the efficiency of B + tree retrieval on the GPU can be effectively improved through the scheme. Disclosure of Invention

The invention aims to provide an optimization method capable of effectively improving the B + tree retrieval efficiency on a GPU, so as to solve the problem that the current B + tree structure is not matched with the memory hierarchy of the GPU.

The optimization method for effectively improving the B + tree retrieval efficiency on the GPU comprises the following steps: firstly, designing a new B + tree data structure; then, an optimized search method for improving the query efficiency of the B + tree data structure is further designed on the basis.

The invention firstly designs a new B + tree data structure which is called a HyperSpace tree structure. This structure divides the traditional B + tree into two parts: key Region (Key Region) and Child node Region (Child Region), and replacing Child node pointer information (Child references) with a smaller prefix sum array (prefix surmar) in the B + tree. Such a structure provides the possibility of accessing other small-sized low-latency memory cache tree structures on the GPU, and the positions of the child nodes can be obtained through simple calculation.

A specific HyperSpace tree structure is shown in FIG. 1 (b). The key regions are organized in a one-dimensional array that stores key information for all tree nodes of a conventional B + tree. The tree node key information is stored in the key area array sequentially from left to right in the order of tree breadth first traversal. As can be seen from fig. 1 (a, B), the first element (index number 0) of the key region stores the key value information of the Root Node in the original tree, the second element stores all the key value information of the first Node in the second layer, and so on. Each element in the array has a fixed size:

i.e., the size of the key value portion in the conventional B + tree node. The length of the array is equal to the number of all tree nodes in the conventional B + tree structure.

In the HyperSpace tree structure, the concept of Prefix sum (Prefix Sums) is introduced in the design of the child node area. Prefix sum is a relatively widely applied data processing mode in the field of computers, and is widely applied to base ordering (Radix Sort) and High Order recursion (High Order Recurrences), for example. Each item element in the prefix and array represents the result of the accumulate operation of all preceding items, including the item element itself. The operation of constructing prefix and array requires inputting a binary associative operator ^ and a one-dimensional array

Processed and output as a one-dimensional numberGroup of

If the binary operation operator is set to an add operation, it is the most common Prefix-sum Array (Prefix-sum Array).

In the child node region, each element in the prefix and the array corresponds to the key value array element in the key region one to one, i.e., the information of the child nodes of the B + tree is also stored continuously in a tree breadth-first traversal order consistent with the key region. Each element in the array represents the cumulative sum of the number of children of all tree nodes preceding the current node during the traversal of the hierarchy (this is consistent with the concept of prefixes and arrays). And each element in the array also has a specific physical meaning, i.e., the index position of the first child node of the corresponding tree node in the key region. As can be seen from fig. 1 (B), the prefix sum array generated by the conventional B + tree of fig. 1 (a) is [1, 4, 6, 7, 9 … ]. It represents the meaning that the first child of Node 0 (Root Node) is located at the position with index 1 of the key region array, the first child of Node 1 (the first Node in the second level) is located at the position with key region sequence number 4 (i.e. 1+ 3), the first child of Node 2 (the second Node in the second level) is located at the position 6 (i.e. 1+3+ 2), and so on. According to the prefix and the array, not only can the positions of all child nodes of the node be efficiently obtained through calculation, but also the number of children contained in each node can be obtained by subtracting the value of the element corresponding to the node from the value of the subsequent element.

Since the child node information area is organized by using prefixes and arrays, the size of the whole B + tree structure is reduced by about 50% compared with the traditional B + tree. This also enables the HyperSpace tree structure to fully utilize the limited memory on the GPU chip for efficient caching, replacing high latency global memory accesses with low latency cache accesses. In addition, the design compresses the child node information of the tree node into an integer value, so that query requests for accessing the same tree node all access the same memory location, and therefore the B + tree query obtains better locality.

Based on the designed HyperSpace tree structure, the invention further designs an optimized search method for improving the query efficiency of the data structure of the B + tree, which comprises a search method based on sorting and a search method based on reducing the size of a thread group; moreover, batch update processing operation of the HyperSpace tree structure is executed at the CPU end, and two optimized search methods and concurrent query operation of the HyperSpace tree are executed at the GPU end; the tree structure synchronization between the GPU end and the CPU end is realized through PCIe memory transmission after the batch updating operation of the CPU end is completed each time; wherein:

the search method based on the ranking ranks all the query requests based on the target key before the query requests start to execute. The sorted adjacent queries can obtain similar search paths with a high probability, so that unnecessary memory access divergence and query divergence in the tree traversal process can be reduced. The memory access result is shown in fig. 2, for example, given 4 queries such as 1, 20, 2, 35, the queries are first sorted, and the adjacent queries are combined into two groups, one group is 1, 2, and the other group is 20, 35.

The searching method based on the thread group size reduces the number of threads required by one query request, can effectively reduce unnecessary comparison and improve the utilization rate of computing resources. As shown in fig. 3, for example, given two

searches

2 and 6, the original thread group size is 8, then the comparison needs to be made across 8 threads. The searching method can reduce the size of the thread group to 4, and can effectively reduce the comparison times among threads. The search optimization based on the reduced thread group size uses a smaller number of threads to service a query request than the traditional GPU optimized B + tree scheme, so that invalid comparison times can be avoided, and meanwhile, the utilization rate of computing resources on the GPU is improved. When HyperSpace uses less number of threads to service a request, more query requests can be processed by a thread bundle at the same time, thus improving the query parallelism.

The technical effects are as follows:

the invention mainly takes the prior heterogeneous B + Tree accelerated research result HB + Tree as a comparison object of technical effect and carries out detailed test on HyperSpace based on NVIDIA TITAN V experimental platform.

In the actual comparison of technical effects, the HyperSpace can reach the query throughput rate of 35 hundred million times per second, which is improved by nearly 3.4 times compared with the existing research result (HB + Tree). The HyperSpace B + Tree structure improves the query performance by nearly 1.4 times compared with the HB + Tree, which is mainly attributed to that the HyperSpace B + Tree brings better query execution locality and saves 50% of volume, and the structure provides possibility for better utilizing the multi-level on-chip high-efficiency cache on the GPU. Secondly, the query performance of the HyperSpace B + Tree structure plus the preprocessing operation optimization (HyperSpace Tree + PSA) based on sorting is improved by 2 times compared with that of the HB + Tree, and the performance improvement is mainly attributed to the fact that query sorting is carried out in advance, so that query requests processed in the same thread bundle can access similar Tree traversal paths, and the thread bundle divergence problem and the memory access divergence problem which affect the execution efficiency of the GPU are effectively reduced. Finally, by adding the preprocessing operation (PSA) based on the ordering and the search optimization operation (NTG) based on the thread group to the HyperSpace B + Tree structure, the query efficiency is improved by about 3.4 times compared with that of the HB + Tree, and the improvement of the step is mainly attributed to that the unnecessary comparison operation is reduced by the search optimization operation based on the thread group, the calculation resource of the existing GPU is fully utilized, and the concurrency of B + Tree query on the GPU is maximized.

In addition, based on the HyperSpace query performance results under different configuration conditions, the HyperSpace design has good expansibility, and an ideal query effect can be obtained under different configurations. In a word, the HyperSpace design can effectively improve the retrieval efficiency of the system.

Drawings

FIG. 1 shows a conventional B + tree structure and a HyperSpace structure. Wherein, (a) is a traditional B + tree node structure, and (B) is a HyperSpace tree structure.

FIG. 2 is an access pattern in a memory access for a partially ordered query request.

FIG. 3 is a search optimization scheme based on thread group size.

Fig. 4 is an overall overview.

FIG. 5 is a HyperSpace architecture implementation.

Detailed Description

In a heterogeneous system, a CPU and a GPU are usually executed in stages, generally, a CPU executes a portion with a complex control logic, and data on a GPU needs to be uploaded in a staged manner and then a batch of highly parallel operations are executed.

The specific process for implementing the method in the heterogeneous system is shown in fig. 4, a batch update processing operation of the HyperSpace tree structure is executed at the CPU end, and two optimized search methods and a concurrent query operation of the HyperSpace tree are executed at the GPU end. The tree structure synchronization between the GPU side and the CPU side is realized through PCIe memory transmission after the batch update operation of the CPU side is completed each time.

Tree structure

The HyperSpace tree structure divides the traditional B + tree structure into two parts: a key area and a child node area, and replaces the conventional child node reference information area with a prefix and an array. So as shown in fig. 5, the index and Key-value parts of the dual-level B + tree nodes are used in the HyperSpace system, which are stored together as array entries of the HyperSpace B + tree structure Key Region (Key Region), while the child Region structure still uses a single integer value to store information of child nodes. In this way, better memory access performance can be achieved on the GPU for both key region access and child region access. Wherein the double-layer B + tree nodes of the HyperSpace key region are realized by adopting double-layer index B + tree nodes based on the cache line size (64 bytes). The key region element is shown as this structure in fig. 5, which is based on a B + tree with a fan-out of 64 in the dual-level implementation, so that the tree node has 63 key values (keys) and 64 child node references (child). To ensure cache line alignment, a null key is finally complemented in the key value region, namely in the graph

. In addition, in the implementation, an index layer is simultaneously superimposed on the traditional B + tree node structure, and the index layer is generated based on the key values of the lower layer. Each index key representsMaximum value in the corresponding key value region (8 consecutive key values constitute one region), i.e.

An empty index is finally made up in the index area, again for cache line alignment. When the node is queried, the index area is firstly traversed to determine which part of the key value area the target key value is located in, and then the target key is queried in the corresponding area. Although the implementation causes redundancy of key values and the volume related to the keys of the B + tree is increased, the utilization rate of the cache can be effectively ensured, and unnecessary cache line replacement is reduced.

Rank-based search optimization

In a rank-based search strategy, query requests are fully ranked before being actually executed by the GPU to increase the probability of sharing a traversal path between the requests. However, full sorting brings a non-negligible possible time overhead, and therefore, in order to increase the benefit of query performance more than the time overhead brought by sorting, a partial sorting strategy is introduced. In a partial ordering strategy, a partial bit-based ordering algorithm is used (unlike binary tree-based bit ordering algorithms, since complete exact ordering is not required, where ordering is for searching, and may be rough ordering, to improve search efficiency), i.e., only the highest required for each query requestNBit ordering can achieve a query effect close to full ordering, but the ordering time overhead can be effectively reduced. In order to be able to determine the number of ranking bits in the partial ranking algorithm, a model is required to assist the selection of this parameter. The appropriate number of sort bits will then be calculated based on the model and the configuration of the actual HyperSpace B + tree.

For the HyperSpace system, each integer key value can be represented by 64 bits (B=64) Each cache line of the GPU has a size of 128 bytes and can cache 16 keys (bK=16) Thus, for a tree size of 2²³For B + tree of (a) ((B))T=2 ²³) It can be calculated based on the model presented below that only 19 bits (are to be ordered:)N=19) That is, the partial order of the query request can be achievedThe ordering overhead is not too large.

Search optimization based on reducing thread group size

In the implementation of a search strategy based on reducing the size of a thread group, an optimization scheme is adopted for reducing the number of threads for serving a query request. Such an implementation is beneficial to reduce unnecessary comparisons and increase usage of GPU computing resources.

In this method implementation, the thread group size for each query service is set to 1, and each thread group processes a maximum of N queries simultaneously (N being the maximum parallelism of thread bundle processing on the GPU). Specifically, for example, the number of execution steps of 1000 queries with different thread group sizes is collected on the CPU side: (S). And, the currently most suitable thread group size is 1 thread. Although the fanout of the current B + tree is 64, due to the existence of the index layer, the fanout of the actual B + tree can be considered to be 8, and the result is consistent with the query execution efficiency of the B + tree with the fanout of 8 under different thread group sizes verified by our experiments. Therefore, in the HyperSpace implementation, 1 thread is used to serve each query, that is, each thread bundle simultaneously processes 32 queries, and the configuration also simultaneously achieves the maximum parallelism of the thread bundle processing on the GPU.

Claims

1. An optimization method for effectively improving B + tree retrieval efficiency on a GPU is characterized by comprising the following steps: firstly, designing a new B + tree data structure; then, an optimized search method for improving the query efficiency of the B + tree data structure is further designed;

the new B + tree data structure is designed by dividing the traditional B + tree into two parts: key areas and child node areas, and replacing pointer information of child nodes with larger volumes in the B + tree with prefixes and arrays with smaller volumes; the structure can provide possibility for accessing other low-delay memory cache tree structures with smaller volumes on the GPU, and the positions of child nodes can be obtained through simple calculation; the new B + tree data structure is called a HyperSpace tree structure;

the key area is organized in a one-dimensional array mode and stores key information of all tree nodes of a traditional B + tree; the tree node key information is sequentially and continuously stored in the key area array from left to right in the tree breadth-first traversal order; each element has a corresponding relation with a node of a traditional B + tree, a first element (namely index sequence number is 0) of a key area stores key value information of a root node in an original tree, a second element stores all key value information of a first node of a second layer, and the like; each element in the array has a fixed size, namely the size of a key value part in a traditional B + tree node; the length of the array is equal to the number of all tree nodes in the traditional B + tree structure;

the child node area is introduced with the concept of prefix sum; prefix sum refers to the result of an accumulation operation where each item element in the array represents all preceding items (including the item element itself); the operation of constructing the prefix and the array needs to introduce a binary associative operator ^ and a one-dimensional array

Processed and output as a one-dimensional array

；

In the child node area, each element in the prefix and the array corresponds to the key value array element in the key area one by one, namely the information of the child node of the B + tree is continuously stored according to the tree breadth-first traversal sequence consistent with the key area; each element in the array represents the accumulated sum of the number of children of all tree nodes before the current node in the process of traversing the layers; each element in the array also has a specific physical meaning, namely the index position of the first child node of the corresponding tree node in the key area;

according to the prefix and the array, the positions of all child nodes of the node can be efficiently obtained through calculation, and the number of children contained in each node can be obtained by subtracting the value of the element corresponding to the node from the value of the subsequent element.

2. The optimization method according to claim 1, wherein the optimization search method for improving the query efficiency of the data structure of the B + tree comprises a search method based on sorting and a search method based on reducing the size of a thread group; moreover, batch update processing operation of the HyperSpace tree structure is executed at the CPU end, and two optimized search methods and concurrent query operation of the HyperSpace tree are executed at the GPU end; the tree structure synchronization between the GPU end and the CPU end is realized through PCIe memory transmission after the batch updating operation of the CPU end is completed each time; wherein:

the search method based on the ranking ranks all the query requests based on the target key before the query requests start to be executed; the sorted adjacent queries can obtain similar search paths with a high probability, so that unnecessary memory access branches and query branches in the tree traversal process can be reduced;

the searching method based on the thread group size reduces the number of threads required by one query request so as to effectively reduce unnecessary comparison and improve the utilization rate of computing resources.

3. The optimization method according to claim 2, wherein in the ranking-based search method, a partial ranking strategy is adopted, i.e. a bit-based ranking algorithm is used, and only the highest ranking for each query request is requiredNBit sorting, namely, the same query effect as the complete sorting can be obtained; in order to determine the ranking bits in the partial ranking algorithm, a model is used to assist the selection of the parameters, and the specific calculation formula is as follows:

wherein B is each integer key value, K is each cache line size of the GPU,Tis the size of the B + tree.

4. The optimization method of claim 2, wherein in the thread group size based search method, the thread group size of each query service is set to 1, and each thread group processes a maximum of N queries simultaneously, where N is the maximum parallelism of thread bundle processing on the GPU.