CN110389784A

CN110389784A - A kind of compiled query processing method in image processor environment

Info

Publication number: CN110389784A
Application number: CN201910678918.4A
Authority: CN
Inventors: 赵志强
Original assignee: Harbin Huituo Investment Center (limited Partnership)
Current assignee: Harbin Huituo Investment Center (limited Partnership)
Priority date: 2019-07-23
Filing date: 2019-07-23
Publication date: 2019-10-29

Abstract

The present invention provides the compiled query processing method in a kind of image processor environment, belongs to processor technical field.The present invention constructs compound kernel in GPU first；Then multiple channels are constructed, each channel is made of the complex nucleus constructed in multiple step 1, and complicated logic is able to carry out in each complex nucleus；Parallelization between channel is set again and handles data；When GPU receives query compiler instruction, for one channel of query compiler instruction distribution, each complex nucleus executes a logic step, while by intermediate data storage in the main memory of GPU.The present invention solves the problems, such as that the query processing speed of existing GPU formula image processor is limited by bandwidth memory.The query processing of image processor of the present invention.

Description

A kind of compiled query processing method in image processor environment

Technical field

The present invention relates to a kind of compiled query processing methods, belong to processor technical field.

Background technique

Currently, the query processing of GPU (Graphics Processing Unit graphics processing unit) formula image processor The serious limitation mobile by data.Due within one device with the meter of teraflops (trillion times floating-point operation per second) Handling capacity is calculated, even high bandwidth memory can not also provide enough data rationally to utilize.And query compiler processing is one Kind improves the advanced technology of memory efficient.GPU is commonly used as the powerful accelerator of query processing.Since the arithmetic of coprocessor gulps down The amount of spitting reaches peak ranges, therefore the teledata with enough data is challenged.The reason is that having in high bandwidth The hardware data card deposited has the read and write rate of hundreds of GB/s.Due to various reasons, memory intensive application is still present in Under the mobile cost of data.Conventional method by intermediate data there are in main memory, with when go in main memory to read, then again will New step intermediate data writes back to main memory, and I/O times there are many number, and bandwidth becomes the bottleneck of query processing speed naturally.

Summary of the invention

The present invention is to solve the problems, such as that the query processing speed of existing GPU formula image processor is limited by bandwidth memory, Provide the compiled query processing method in a kind of image processor environment.

Compiled query processing method in a kind of image processor environment of the present invention, it is real by the following technical programs It is existing:

Step 1: constructing compound kernel in GPU；Kernel indicates core；

Step 2: constructing multiple channels, each channel is made of the complex nucleus constructed in multiple step 1, each complex nucleus Inside it is able to carry out complicated logic；

Step 3: parallelization handles data between setting channel；

Step 4: one channel of distribution is instructed for the query compiler when GPU receives query compiler instruction, each Complex nucleus executes a logic step, while by intermediate data storage in the main memory of GPU.

Present invention feature the most prominent and significant beneficial effect are:

Compiled query processing method in a kind of image processor environment according to the present invention, changes query compiler Query compiler processing is integrated into the DBMS (data base handling system) of image processor acceleration, makes it by intrinsic processing mode The large-scale parallel of GPU formula image processor can be suitble to execute model.The present invention can by query compiler and GPU style and Row, and application carries out in different situations.The present invention shows multiple compilers for operating and being merged into single GPU kernel Reduce bandwidth demand；It can effectively be handled in single channel data, greatly improve search efficiency, shorten Memory executes the time and kernel executes the time.Compared with once-through operation symbol, the method for the present invention internal storage access amount reduces 7.5 times Memory executes the time, and the kernel execution time is caused to shorten 9.5 times.

Detailed description of the invention

The mobile schematic diagram of data when Fig. 1 is query compiler；

Fig. 2 is the composition schematic diagram of complex nucleus in the present invention；

Wherein, MEM: main memory；GPU MEM:GPU main memory；SCRATCHPAD MEM/REGISTERS/CACHE: temporarily storage Device/buffer/cache memory；CORES: kernel；Compound kernel: complex nucleus；Input: input；Probe: Detection；Count: it counts；Prefix sum: preceding paragraph and (professional term, the intermediate data in calculating)；Write: it writes；PCIe Transfers: bus transfer；GPU Global Memory:GPU global memory；On-Chip Memory: on piece memory, brilliant load Memory；Select/hash: selection/Hash procedure；Aligned write: it writes back；Project/join probe: there may be Join operation；Fusion operator: fusion operator.

Specific embodiment

Specific embodiment 1: being illustrated in conjunction with Fig. 1 to present embodiment, at a kind of image that present embodiment provides The compiled query processing method in device environment is managed, specifically includes the following steps:

Step 1: constructing compound kernel in GPU；Kernel indicates core, contain multiple cores in GPU, and core handles number According to, including logical operation etc.；

Step 2: constructing multiple channels, each channel is made of the complex nucleus constructed in multiple step 1, each complex nucleus Inside be able to carry out complicated logic (such as memory is write back after logical operation), processing nearly all so all in core into Row, reduces many read-write amounts than conventional method, conventional method core only has unity logic, outputs data to master after logic calculation It deposits, again reads out data from main memory if it is desired to executing written-back operation, there are many I/O (input/output) quantity, it is easy to have bottle Neck；

Step 3: setting channel between can parallelization handle data, multiple channels are constructed in GPU in this way, being equivalent to, Make its parallel data processing；

Step 4: instructing one channel of distribution when GPU receives query compiler instruction for the query compiler, each is multiple Synkaryon executes a logic step and greatly accelerates the speed of compiled query in this way, compiled query is integrated into GPU, also promoted Concurrency because there is the nuclear volume far more than CPU in GPU, and calculates power height, it is fast to execute speed；Simultaneously by intermediate data storage In the main memory of GPU, it is possible to reduce I/O reads and writes quantity, reduces the factor for much limiting its performance, such as bandwidth bottleneck.Such as The mobile schematic diagram of data when the method for the present invention query compiler shown in Fig. 1, it can be seen that the present invention can greatly reduce compiling The mobile number of data when inquiry.

Analyze used macro the executions models discovery of various systems past, for evaluation relations inquiry operator, it is existing most Advanced system will select multiple primitives and execute corresponding kernel sequence on GPU；In order to provide data, macro execution to kernel Model defines how data transmission will interlock with kernel execution.Compared with legacy system, the data from kernel to kernel are moved for this It is dynamic to may cause additional bandwidth demand.In order to understand this influence, by having studied existing macro execution model to multilevel bandwidth Use the influence of (PCIe, GPU global memory etc.), analysis queried the executive condition of Star Schema benchmark (SSB), inquiry be It is executed on NVIDIA GTX970GPU2 using CoGaDB with scale factor 10.

A kind of straightforward procedure for executing kernel sequence is to transmit all inputs first, executes kernel, and finally transmission is all defeated Out, intermediate data that data are retained in GPU global storage and insignificant transmission are necessary.However, the disadvantage is that, Only GPU memory could be used in the case where inputting output and centre.In order on kernel time-triggered protocol coprocessor Big data, can execute each kernel in data block, and batch processing is to execute multiple kernels by each of PCIe transmission piece.This The limitation of PCIe bandwidth is alleviated in invention by rearranging the operation of a kernel, and the transmission of intermediate result is shorted to host, Rather than kernel is run before processing column.Batch processing by by it is previous operation (op1) output be re-used as input (op2) and It is not to be transferred to host to realize this purpose.As long as intermediate batch processing result can be stored in GPU global memory, this is fitted With.Data are transmitted in blocks and execute multiple operators in each piece, compared with a kernel, may be implemented scalable Property simultaneously improves efficiency.The macro execution model of batch processing is handled by GPUDB and Hetero-DB for collaboration.

In order to more effectively using GPU global memory bandwidth, need to optimize using additional microstage using micro- execution model, And combine them with macro execution model (batch processing), to realize scalability and performance.The optimization of existing microscopic level (such as Vector processing and query compiler again and again) utilize processor cache on the more efficient equilibrium line of memory bandwidth.In order to reconcile The explanation expense of Volcano and the materialization expense of operator-ata-time, a vector use batch for being suitble to processor cache Processing, will use query compiler, it is also necessary to which fine-grained data concurrency is integrated into compiled query on GPU.

Specific embodiment 2: the present embodiment is different from the first embodiment in that, it is each described multiple in step 1 Include calculating logic in conjunction kernel, calculate lead data and write back logic；As shown in Figure 2 it is found that complex nucleus is by many steps Suddenly it links together sequence to execute, logic is complicated.

Other steps and parameter are same as the specific embodiment one.

Specific embodiment 3: the present embodiment is different from the first and the second embodiment in that, it is every described in step 4 One core executes a logic step specifically:

Compound kernel A executes first logic step (such as the operation for connecting m with two databases of n), will be intermediate As a result it is stored in the main memory of GPU, then, the kernel B being connected with kernel A performs the next step the operation (next step Operation can be the intermediate result Zhong Zuo projection operation etc. in previous step), and so on.

Other steps and parameter are the same as one or two specific embodiments.

The present invention can also have other various embodiments, without deviating from the spirit and substance of the present invention, this field Technical staff makes various corresponding changes and modifications in accordance with the present invention, but these corresponding changes and modifications all should belong to The protection scope of the appended claims of the present invention.

Claims

1. the compiled query processing method in a kind of image processor environment, which is characterized in that specifically includes the following steps:

Step 1: constructing compound kernel in GPU；Kernel indicates core；

Step 2: constructing multiple channels, each channel is made of the complex nucleus constructed in multiple step 1, each compound intranuclear energy It is enough to execute complicated logic；

Step 3: parallelization handles data between setting channel；

Step 4: instructing one channel of distribution, each complex nucleus when GPU receives query compiler instruction for the query compiler A logic step is executed, while by intermediate data storage in the main memory of GPU.

2. the compiled query processing method in a kind of image processor environment according to claim 1, which is characterized in that step In one in each compound kernel comprising calculating logic, calculate lead data and write back logic.

3. the compiled query processing method in a kind of image processor environment according to claim 1 or claim 2, which is characterized in that Each core described in step 4 executes a logic step specifically:

Compound kernel A executes first logic step, intermediate result is stored in the main memory of GPU, then, with kernel A connected kernel B performs the next step operation, and so on.