CN104050175A

CN104050175A - Parallel method for realizing two-dimension data neighbor search by using GPU (graphics processing unit) on-chip tree mass

Info

Publication number: CN104050175A
Application number: CN201310078697.XA
Authority: CN
Inventors: 易卫东; 菅立恒
Original assignee: University of Chinese Academy of Sciences
Current assignee: University of Chinese Academy of Sciences
Priority date: 2013-03-13
Filing date: 2013-03-13
Publication date: 2014-09-17

Abstract

The invention provides a parallel method for realizing two-dimension data neighbor search by using a GPU (graphics processing unit) on-chip tree mass, and relates to an efficient quartered tree mass data structure based on GPU+CUDA (compute unified device architecture) compute architecture of NVIDIA and a parallel search neighbor algorithm based on the data structure. The method can be used for providing performance acceleration for neighbor search in various applications. According to the invention, the method comprises the following steps that data is copied to a GPU global memory; a quartered tree mass is built in a CPU on-chip memory, and data points are organized into the tree mass; a plurality of quartered trees are parallelly searched from the GPU on-chip memory, and neighbors in a certain range are found for each data point; further compute processing is carried out on the found neighbors according to concrete application requirements. The parallel neighbor search method has the advantages that the performance is excellent, and important application values are realized in the fields of image processing, geographic information systems and the like.

Description

Utilize woodlot on GPU sheet to realize the parallel method of 2-D data neighbor search

Technical field

The present invention relates to tree data structure and parallel computation, be specifically related to structure, the neighbor search based on this data structure and the programming of GPU+CUDA framework of four minutes woodlots.

Background technology

Within four minutes, tree is a kind of for organizing the multi-level tree data structure of two-dimensional space data, and in tree, each nonleaf node has at most four child nodes.It is divided into 4 sub regions regularly by a square area, and each subregion is further divided into 4 sub regions, so successively divides, until all subregions at most only comprise a data point.The advantage of four minutes trees is that spatial relationship lies among data model, and retrieval and processing speed are very fast.This data structure is widely used in the neighbor search in the fields such as image processing, Geographic Information System and robot.

Graphic process unit (GPU) is current flourish parallel computing platform.It originates from graph rendering, has occurred the convenient programming on higher level lanquage support GPU now.NVIDIA, for its GPU provides Compute Unified Device Architecture (CUDA) programming model, has the DLL (dynamic link library) of class C language.GPU in conjunction with CUDA provides powerful computing power and very high memory bandwidth, is suitable for the application of highly-parallel, computation-intensive.Such as: NVIDIA Tesla C2070 has the single-precision floating point peak value of 1.03TFLOPS/s and the bandwidth peak of 144GB/s.Although computing power is powerful, the cost of NVIDIA GPU is lower, and is prevalent in the machine that PC or workstation etc. are not too expensive.At present, the computation-intensive that GPU+CUDA framework has been applied to accelerating in each science calculating field is processed.

At hardware level, the GPU that supports CUDA is the set of some single-instruction multiple-data stream (SIMD)s (SIMD) stream multiprocessors (stream multiprocessor), and each stream multiprocessor has 32 stream handles (stream processor).Such as: NVIDIA Tesla C2070 processor has 14 stream multiprocessors, totally 448 stream handles.Each stream multiprocessor has on the quick sheet of a finite capacity shares storage, can be shared by all stream handles in a stream multiprocessor.Each stream handle has 32 bit registers of some.Between stream multiprocessor, by postponing the outer overall situation/device storage of higher sheet, communicate.Overall situation storage can be read and write by main frame, and is consistent between the different IPs function of same program.Sharing storage explicitly is managed by programmer.Compare with CPU, GPU is used for making computing unit by more transistor, so the high order of magnitude of the Floating-point Computation peakedness ratio CPU of GPU.Meanwhile, the special optimization of NVIDIA makes the bandwidth of GPU also than the high order of magnitude of CPU.

At software level, CUDA model is the set of a large amount of executed in parallel threads, and CUDA program is carried out with the pattern of thread parallel.Basic task unit by host computer control scheduling, the actual execution of GPU is called kernel function, and its form and function class are like the function defining in C language.Calculating is organized into the thread grid (seeing Fig. 2) consisting of some thread block.At instruction level, 32 continuous threads in same thread block form a minimum performance element, are called thread warp.A thread block is the set of one batch of synchronization thread warp of SIMD parallel running on same flow multiprocessor.The one or more thread block of each stream concurrent execution of multiprocessor synchronization.For arbitrary thread, its index is used to determine its data that should process.Thread in same thread block is by sharing and be stored into Serial Communication on sheet.

Be based upon the performance that four minutes trees on the GPU+CUDA parallel computation framework of NVIDIA can greatly promote the neighbor search based on four minutes trees.Yet CUDA computation model is considered to be not suitable for the IRREGULAR COMPUTATION such as achievement and search tree; At present, it is less that use CUDA sets up the work aspect tree data structure on GPU.Zhou Kun etc. has built a KD-tree for multidimensional data on GPU, and the overall situation storage that they are stored in GPU by the nodal information of tree etc. is upper, and uses this data structure to accelerate ray tracing and K the nearest neighbor search in image rendering.First four minutes tree constructing methods of a mixing build the some layers in top of tree on CPU, then transfer data to the storage of the GPU overall situation, build the rest layers of tree thereon.The method is also built four minutes trees in the GPU overall situation storage that access delay is higher in.And, do not have the code of the formal document of publishing or announcement can be for reference.

In cardinal principle motion simulation, the gravity effect calculating between celestial body relates to a large amount of neighbours that find out each celestial body, the neighbours that particularly surrounding of single celestial body gravity effect maximum closed on.Martin and Keshav have realized a high performance N-body Parallel Simulation algorithm on the GPU+CUDA of NVIDIA framework, and in simulation, they have built an octa-tree, to accelerate to search the neighbour of each celestial body.Experimental results show that this octa-tree based on CUDA has greatly promoted the performance of modeling algorithm.But, this tree is still structured in the higher overall situation storage of delay.

In the research of problems, work on hand concentrates in the overall situation storage of GPU and builds a single tree data structure.Such strategy causes the access to overall situation storage that cannot optimize that large amount efficiency is lower, and is competing data point is inserted in tree due to thread, causes the thread parallel degree of CUDA program lower, and performance is greatly affected.

Summary of the invention

The object of this invention is to provide a kind of efficient four minutes woodlots of the GPU+CUDA computing architecture based on NVIDIA and the method for the parallel search neighbour based on this data structure, can provide for the neighbor search in various application performance to accelerate.

For achieving the above object, on a kind of GPU of utilization sheet, woodlot realizes the parallel method of 2-D data neighbor search, and step is as follows:

(1) data to be organized are copied to the storage of the GPU overall situation;

(2) on GPU sheet, in storage, build four minutes woodlots, data point is organized in woodlot;

(3) walk abreast and in storage, search for some four minutes trees on GPU sheet, for each data point finds out the neighbours in its certain limit;

(4) according to application, need to do further computing to the neighbours that find.

Different from the tree data structure of having realized on GPU in the past, the present invention utilizes GPU at a high speed but on limited sheet, stores structure and the search that realizes four minutes trees, has greatly promoted the efficiency of memory access.Meanwhile, woodlot be take each CUDA thread block and is realized a less Si Fenshu as unit, and the relative single four minutes tree of thread parallel degree is wanted much higher times, thereby has promoted the efficiency of contributing.The neighbor search algorithm performance being somebody's turn to do based on four minutes woodlots is superior, in the fields such as image processing and Computer Simulation, has important using value.

Accompanying drawing explanation

Fig. 1 utilizes woodlot on GPU sheet to realize the structure of 2-D data neighbor search

The order execution of Fig. 2 host side code and the executed in parallel of equipment end code

The two-dimentional level recurrence of Fig. 3 region lattice is divided

The data point tissue of four minutes tree-like formulas of Fig. 4 region lattice

Embodiment

Core concept of the present invention is to build the data structure-tetra-minute woodlot of storing on a kind of GPU of residing at high-speed chip, uses it to organize the data point in various application, and in calculating, as expedited data structure, lifting behavior total data point are searched neighbour's performance.

Below in conjunction with accompanying drawing and algorithm, realize neighbour's scheduling algorithm that pseudo-code is described the structure of data structure in the present invention, this data structure in detail and used this data structure parallel search data point.

(1) tissue of data point

Four minutes woodlots are forests that has comprised tetra-minutes trees of CUDA that a large amount of scales are less.The present invention is divided into a large amount of square area lattice by whole data point deployment region, the data point (in experiment test, Threshold is 64, and each region lattice comprises at most 64 data points) that makes each region lattice comprise no more than setting threshold number.Data point in each region lattice is organized with tetra-minutes trees of a CUDA.As shown in Figure 3, first region lattice are divided into four little lattice in less, impartial region, and the then little lattice in each region iteration division again, until its data point that comprises no more than one.Finally, all data points in the lattice of Fig. 3 region have been organized in the CUDA shown in Fig. 4 tetra-minutes tree, the close data point in space brotgher of node each other in tree.

Compare CUDA thread parallel degree when woodlot maximizing is contribute with a large tree; Each CUDA tetra-minutes tree is less, can reside in completely on the quick sheet of GPU in storage, and the high latency of having avoided the storage of the access overall situation to cause, can bring appreciable performance boost.

(2) on GPU sheet, storage builds four minutes woodlots

First, data set copies to the storage of the GPU overall situation by internal memory.The scope distributing according to data and the experience of distribution density is estimated data distributed areas are divided into region lattice.Each thread is held a data point, according to its coordinate, is incorporated in the lattice of specific region.If the number of data points that each region lattice comprises is greater than the threshold value of setting, each region cellular splitting is four less region lattice; Otherwise, stop incorporating into of data point.

When generating four minutes woodlots, each CUDA thread block is responsible for region lattice, builds a CUDA-quadtree.Wherein, each thread is held a data point, is inserted in tree.For facility is searched for neighbour in CUDA-quadtree, central point and the length of side of each region sublattice constitutes " dummy node " to represent this region, i.e. nonleaf node in CUDA-quadtree.Such as: " the root dummy node " with region center of a lattice point and four minutes trees of region lattice length of side combination, represents the region that these region lattice cover.Supposing has at most n data point in region lattice, and four minutes of generating tree maximum possible takes up room as n*4+1, accordingly sharing the array of distributing a fixed size in storage, to preserve four minutes tree information.This information array backward is filled, and, when having new data point to be inserted into CUDA-quadtree, the last blank space of data point of having preserved in array is stored new data point information.

As shown in table 1 pseudo-code, thread data inserting point is a process making repeated attempts, and an insertion machine meeting is obtained in thread competition therebetween.First, thread 0 initialization " root dummy node " also creates four empty leaf nodes.Then, each thread is a little found suitable leaf node and is carried out update in Ge Zhong position, region according to holding data.After such position is found, a competition is won thread to the unique authority lock that writes of its interpolation, monopolizes this position, inserts the data point (by this data point writing information array) of holding, and discharges this position, exits update.If a thread is competed unsuccessfully, do not pin suitable leaf node, this thread waits next time chance again attempt aforementioned update.Yet in second takes turns, the leaf node position that thread finds and pins is taken by the data point of other insertions.So subtree with four empty leaf nodes of this thread creation, is inserted into this position, and original node and the data point of oneself holding are inserted in the leaf node of this subtree, discharge this position, exit update.All the other threads continue to continue as shown in Table 1 to attempt, until its data point of holding is inserted in a leaf node of CUDA-quadtree.Finally, in thread block, the collaborative disposable data fusion mode with memory access efficiency optimization of thread writes overall situation storage by four minutes tree information arrays.

The pseudo code example that on table 1GPU sheet, data point is inserted

(3) tetra-minutes trees of memory search CUDA on GPU sheet

Four minutes woodlots be take space proximity and as yardstick, data point are organized, and in search, can determine rapidly each data point neighbours' data point around with less computing cost.According to space length, the close region in the responsible region lattice of each CUDA thread block in the certain limit of all data points can directly determine, i.e. these region lattice and some neighbours region lattice thereof.Afterwards, CUDA thread block is searched for tetra-minutes trees of CUDA that these region lattice are corresponding one by one, concurrently searches their data point of closing on for all data points in these region lattice.Tenesmus in tetra-minutes trees of CUDA is searched darker, needs the scope of search less.When region lattice of search, in thread block, the first collaborative disposable data fusion mode with memory access efficiency optimization of thread is read in the shared storage of GPU, the search of then setting in shared storage at a high speed by tetra-minutes tree information arrays of corresponding CUDA from overall situation storage.

Table 2 pseudo-code has been demonstrated a thread warp in a CUDA thread block and how have been adopted depth-first fashion to search for four minutes tree, is that all data points that in a thread warp, thread is held are searched it and closed on data point concomitantly.Two tracking informations are safeguarded in this search, are respectively the tree degree of depth from root node to current accessed node and the information stacks of different depth access node order its brotgher of node.Tetra-minutes trees of CUDA to be visited are written into shares after storage, and all threads in this warp start tenesmus search along same path from root node.When the position searching is an inner nonleaf node of tree, when " dummy node ", in warp thread detect its scope of closing on holding data a little no be somebody's turn to do " dummy node " covering area overlapping; If there is arbitrary detection to meet the requirements, warp continues the subtree of this node of tenesmus search, otherwise abandons this node and subtree thereof, searches for its next brother node.When the position searching for by a data point, taken leaf node time, whether a little with this data point be neighbour to thread if calculating that it holds data in warp simultaneously, then whether decision does further computational analysis according to application-specific.When the leaf node searching is sky, warp continues its next brother node of access.

The pseudo code example of searching for four minutes trees on table 2GPU sheet

A tree that resides in overall situation storage in different and former studies, tetra-minutes trees of CUDA of storing on can resident GPU high-speed chip provide data access very efficiently, and this is very important to setting interior search.Meanwhile, take thread warp has avoided the thread difference of infringement CUDA program feature as unit.

Experimental verification

We have realized algorithm described in the invention with CUDA programming, and with experiment test, have verified the performance of this parallel neighbor search method.A HP Z800 workstation is used in experiment, and CPU is that dominant frequency is Intel Xeon X5647 tetra-core processors of 2.93GHz, the host memory of 8GB.GPU used is NVIDIA Tesla C2070, and it is the GPU that is exclusively used in general-purpose computations, has the overall situation storage of stream handle that 448 dominant frequency are 1.15GHz and 6GB.Operating system is Red Hat Enterprise Linux WS6.064 position, NVIDIA270.41.19 version GPU is installed and drives and CUDA4.0 kit and software development kit.

Table 3,4 is that the people such as this method and Martin proposes and the octa-tree based on CUDA realized experiment test under the same conditions, is respectively structure stage and the search phase of two data structures.Can find out, the present invention is better than single one tree comprehensively, especially in the structure stage of tree, has the performance advantage of two orders of magnitude.

The comparison that table 3 tree builds

The comparison of table 4 search in tree

Claims

1. utilize woodlot on GPU sheet to realize a parallel method for 2-D data neighbor search, comprise step:

2. in accordance with the method for claim 1, method described in it is characterized in that makes full use of on the efficient sheet of GPU stores, the work that builds four minutes woodlots of four minutes woodlots and search is all carried out in this storage, and four minutes trees that build are worked in coordination with by thread the mode merging with the storage of memory access optimum and write overall storage; When four minutes trees of search, in an identical manner four minutes trees are read on sheet and stored by overall situation storage, and then search for.

3. in accordance with the method for claim 1, it is characterized in that constructed data structure is for by the many four minutes woodlots that form of tree, each thread block is responsible for structure and the search of four minutes tree, to guarantee that data structure can be written on limited sheet storage and guarantee that CUDA thread parallel degree is large as far as possible.

4. in accordance with the method for claim 1, it is characterized in that each thread is responsible for a data point, in the building process of four minutes trees, this data point is inserted in four minutes trees, be that a data point is searched for its neighbours around in neighbor search process.

5. in accordance with the method for claim 1, while it is characterized in that for data point search neighbour, one by one four minutes trees are written on sheet and are stored in the mode of memory access optimum, on sheet efficiently, in storage, complete searching work.