CN109213601A

CN109213601A - A kind of load-balancing method and equipment based on CPU-GPU

Info

Publication number: CN109213601A
Application number: CN201811064037.5A
Authority: CN
Inventors: 翁楚良; 孙婷婷; 黄皓; 王嘉伦
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2018-09-12
Filing date: 2018-09-12
Publication date: 2019-01-15
Anticipated expiration: 2038-09-12
Also published as: CN109213601B

Abstract

The purpose of the application is to provide a kind of load-balancing method and equipment based on CPU-GPU, the application enables CPU-GPU isomeric data analysis system to support the query analysis under big data scene by constructing assembly line query execution model on CPU-GPU heterogeneous database system；Determine the total quantity of pending assembly line；Start the assembly line query execution model to distribute the corresponding assembly line of the total quantity to the CPU and the GPU, and the execution duration of the assembly line according to the determining single on the CPU and the GPU respectively, it calculates the corresponding system of all load sharing policies and executes duration；All systems are finally executed into the corresponding load sharing policy of minimum value in duration and are determined as best CPU-GPU allocation strategy, load balancing based on CPU-GPU isomeric data analysis system can reasonable distribution assembly line be loaded to different processor, make full use of processor computing resource, not only improve system performance, moreover it is possible to so that system is reached overall performance best.

Description

A kind of load-balancing method and equipment based on CPU-GPU

Technical field

This application involves computer field more particularly to a kind of load-balancing methods and equipment based on CPU-GPU.

Background technique

Universal graphics processing unit (Graphics Processing Unit, GPU) is more in matrix calculating, machine learning etc. A field is widely used.In recent years, the related needs rapid growth of data-intensive applications, promotes the isomery based on GPU The development of online analysis and processing platform, since GPU possesses multiple computing units that can run a large amount of threads simultaneously, with GPU The performance of data analysis system as primary processor is better than traditional CPU analysis system in most cases, when executing Between shorten several orders of magnitude.

In traditional relational query analysis system, when client sends inquiry request, system can create one point Analyse operation, parsed and be converted to logical query plan to request, inquiry plan optimizer can according to certain principle (such as at This is minimum) the optimal physical query plan execution of selection.Physical query plan is a directed acyclic graph (DAG), it includes more A operation operator is executed according to certain sequence between operation operator.

In current CPU-GPU isomery analysis system, GPU is the primary processor of query execution, the execution of operation operator Be mainly distributed on GPU, and CPU is mainly responsible for data distribution and collection, when subsequent operation need it is defeated using previous steps When intermediate result out, CPU will also do certain processing to intermediate result.

The analysis demand of Data Management Analysis system processing is towards big data scene, and data volume exponentially increases, work Make heavier loads；However, the data in storage medium inside it can only be directly handled due to GPU, and the capacity of video memory is limited, Therefore GPU can not just complete the processing of large data sets by single load.When input data or intermediate result can not be put into greatly very much When GPU global memory, the efficiency that will lead to analysis work continues lowly, or even causes mission failure.Pass through limit in the prior art The size of inquiry table processed evades this problem, or calculating task is transferred to CPU as alternative strategy, but these are not most Good solution.

In conclusion the use of GPU being that data analysis system accelerates query analysis at present on the heterogeneous platform of CPU-GPU Though it is effective, but still have the following problems: GPU video memory capacity is limited, and the processing of large data sets can not be completed by single load, And the task distribution between CPU and GPU is unbalanced, does not make full use of heterogeneous processor resource.

Summary of the invention

The purpose of the application is to provide a kind of load-balancing method and equipment based on CPU-GPU, existing to solve GPU video memory capacity in technology is limited, and the task between the processing and CPU and GPU of large data sets can not be completed by single load Distributing unbalanced leads to the problem of not making full use of heterogeneous processor resource.

According to the one aspect of the application, a kind of load-balancing method based on CPU-GPU is provided, this method comprises:

Assembly line query execution model is constructed on CPU-GPU heterogeneous database system；

Determine the total quantity of pending assembly line；

Start the assembly line query execution model, the corresponding assembly line of the total quantity is distributed to the CPU On the GPU, and the execution duration of the assembly line according to the determining single on the CPU and the GPU respectively, it calculates The corresponding system of all load sharing policies executes duration；

All systems are executed into the corresponding load sharing policy of minimum value in duration and are determined as best CPU-GPU Allocation strategy.

Further, in the above method, the total quantity of the pending assembly line of the determination, comprising:

Obtain query statement, wherein the query statement includes data to be checked；

The data to be checked are divided according to preset data fragment size, obtain the data of the data to be checked Fragment and its sum；

Each of the respectively described data to be checked data fragmentation starts corresponding assembly line, then pending institute The total quantity for stating assembly line is determined by the sum of the data fragmentation.

Further, in the above method, start the assembly line query execution model total quantity is corresponding described Assembly line is distributed to the CPU and the GPU, and is flowed according to the determining single on the CPU and the GPU respectively The execution duration of waterline calculates the corresponding system of all load sharing policies and executes duration, comprising:

Step 1: starting assembly line query execution model, is arranged initial load allocation strategy: for the stream of CPU distribution Waterline quantity N_CPU=0, for the assembly line quantity N of GPU distribution_GPU=N, wherein N is the total quantity and N of the assembly line For the positive integer more than or equal to 1；

Step 2: parallel execute each of distributes on the CPU and the GPU assembly line respectively, obtain current The corresponding CPU of load sharing policy executes duration and GPU executes duration；

Step 3: the CPU is executed duration and is determined if the CPU executes duration and GPU execution duration is equal Duration is executed for the corresponding system of present load allocation strategy；If the CPU executes duration and the GPU executes duration not phase Deng the CPU is then executed the larger value that duration and the GPU execute in duration, and to be determined as present load allocation strategy corresponding System executes duration；

Step 4: updating load sharing policy: for the assembly line quantity N of CPU distribution_CPU=N_CPU+ 1, it is the GPU The assembly line quantity N of distribution_GPU=N_GPU- 1, wherein N_CPU+N_GPU=N,；

Step 5: repeating the above steps two to step 4, until obtaining the corresponding system of all load sharing policies System executes duration.

Further, described to be flowed according to the determining single on the CPU and the GPU respectively in the above method The execution duration of waterline calculates before the corresponding system of all load sharing policies executes duration, further includes:

Duration T is inputted according to data of the data fragmentation on CPU_{IN_C}, data execute duration T_{EXE_C}And data output Duration T_{OUT_C}, determine the execution duration of the assembly line described in single on the CPU；

Duration T is inputted according to data of the data fragmentation on GPU_{IN_G}, data execute duration T_{EXE_G}And data output Duration T_{OUT_G}, determine the execution duration of the assembly line described in single on the GPU.

Further, in the above method, it is described Step 2: in the corresponding CPU of present load allocation strategy that obtains execute The formula that duration and GPU execute duration is respectively as follows:

T_CPU=T_{IN_C}+T_{EXE_C}+T_{OUT_C}+Max{T_{IN_c}, T_{EXE_C}, T_{OUT_C}}×(N_CPU- 1),

T_GPU=T_{IN_G}+T_{EXE_G}+T_{OUT_G}+Max{T_{IN_G}, T_{EXE_G}, T_{OUT_G}}×(N_GPU- 1),

Wherein, T_CPUDuration, Max { T are executed for the corresponding CPU of present load allocation strategy_{IN_C}, T_{EXE_C}, T_{OUT_C}It is data Data of the fragment on CPU input duration T_{IN_C}, data execute duration T_{EXE_C}And data export duration T_{OUT_C}In maximum value；

T_GPUDuration, Max { T are executed for the corresponding GPU of present load allocation strategy_{IN_G}, T_{EXE_G}, T_{OUT_G}It is data fragmentation Data on GPU input duration T_{IN_G}, data execute duration T_{EXE_G}And data export duration T_{OUT_G}In maximum value.

Further, in the above method, the query statement further includes querying condition, wherein the distribution to the CPU On assembly line be the thread instance run on the CPU according to the querying condition；It is described to distribute to the stream on the GPU Waterline is the kernel function example run on the GPU according to the querying condition.

Further, described Step 2: parallel execute is distributed respectively on the CPU and the GPU in the above method Each assembly line obtains the corresponding CPU of present load allocation strategy and executes duration and after GPU executes duration, further includes:

Each of obtain on the CPU and GPU the corresponding implementing result of the assembly line；

The corresponding final query execution knot of the query statement is obtained according to the corresponding implementing result of each assembly line Fruit.

Further, in the above method, data of the data fragmentation on CPU input duration T_{IN_C}For the data point The piece duration used when the memory of the CPU is copied；

Data of the data fragmentation on CPU execute duration T_{EXE_C}For used in the thread instance that is run on the CPU Duration；

Data of the data fragmentation on CPU export duration T_{OUT_C}For the assembly line pair described in the CPU memory copying Duration used in the implementing result answered；

Data of the data fragmentation on GPU input duration T_{IN_G}For the memory by the data fragmentation from the CPU It is copied to duration used in the video memory of the GPU；

Data of the data fragmentation on GPU execute duration T_{EXE_G}Run by kernel function example on the GPU Duration；

Data of the data fragmentation on GPU export duration T_{OUT_G}For by the corresponding implementing result of the assembly line from institute The video memory for stating GPU is copied to duration used in the memory of the CPU.

According to the another aspect of the application, a kind of non-volatile memory medium is additionally provided, being stored thereon with computer can Reading instruction when the computer-readable instruction can be executed by processor, realizes the processor as above-mentioned based on CPU-GPU Load-balancing method.

According to the another aspect of the application, a kind of equipment is additionally provided, wherein comprising:

One or more processors；

Non-volatile memory medium, for storing one or more computer-readable instructions,

When one or more of computer-readable instructions are executed by one or more of processors, so that one Or multiple processors realize such as the above-mentioned load-balancing method based on CPU-GPU.

Compared with prior art, the application on CPU-GPU heterogeneous database system by constructing assembly line query execution Model enables CPU-GPU isomeric data analysis system to support the query analysis under big data scene；Determine pending flowing water The total quantity of line；Start the assembly line query execution model to distribute the corresponding assembly line of the total quantity to described On CPU and the GPU, and the execution duration of the assembly line according to the determining single on the CPU and the GPU respectively, It calculates the corresponding system of all load sharing policies and executes duration；Finally all systems are executed in duration The corresponding load sharing policy of minimum value is determined as best CPU-GPU allocation strategy, is based on CPU-GPU isomeric data analysis system Load balancing can reasonable distribution assembly line be loaded to different processor, make full use of processor computing resource, not only Improve system performance, moreover it is possible to so that system is reached overall performance best.

Detailed description of the invention

By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, the application's is other Feature, objects and advantages will become more apparent upon:

Fig. 1 shows a kind of flow diagram of load-balancing method based on CPU-GPU according to the application one aspect；

Fig. 2 shows dividing in a kind of load-balancing method based on CPU-GPU according to the application one aspect in load Duration is executed with GPU corresponding under strategy and CPU executes the schematic diagram of duration；

Fig. 3 shows in a kind of load-balancing method based on CPU-GPU according to the application one aspect and is in load The calculating schematic diagram of execution duration when allocation strategy when corresponding inquiry；Fig. 4 shows the application according to the application one aspect First query statement carries out showing for the load sharing policy of CPU-GPU under assembly line query execution model to 80,000,000 row data It is intended to；

Fig. 5 shows right under assembly line query execution model according to the first query statement of application of the application one aspect 1.4 hundred million row data carry out the schematic diagram of the load sharing policy of CPU-GPU；

Fig. 6 shows right under assembly line query execution model according to the second query statement of application of the application one aspect 80000000 row data carry out the schematic diagram of the load sharing policy of CPU-GPU；

Fig. 7 shows right under assembly line query execution model according to the second query statement of application of the application one aspect 1.4 hundred million row data carry out the schematic diagram of the load sharing policy of CPU-GPU；

The same or similar appended drawing reference represents the same or similar component in attached drawing.

Specific embodiment

The application is described in further detail with reference to the accompanying drawing.

In a typical configuration of this application, terminal, the equipment of service network and trusted party include one or more Processor (CPU), input/output interface, network interface and memory.

Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/or The forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM).Memory is computer-readable medium Example.

Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method Or technology come realize information store.Information can be computer readable instructions, data structure, the module of program or other data. The example of the storage medium of computer includes, but are not limited to phase change memory (PRAM), static random access memory (SRAM), moves State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable Programmable read only memory (EEPROM), flash memory or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM), Digital versatile disc (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices or Any other non-transmission medium, can be used for storage can be accessed by a computing device information.As defined in this article, computer Readable medium does not include non-temporary computer readable media (transitory media), such as the data-signal and carrier wave of modulation.

As shown in Figure 1, load-balancing method of one of the embodiment of the present application based on CPU-GPU, is suitable for relationship type The Data Analysis Model of database can give full play to heterogeneous processor characteristic, and rationally and efficiently share out the work task, realize big Query analysis under data scene, the method comprising the steps of S11, step S12, step S13, step S14, wherein specifically include:

The step S11, by constructing assembly line query execution model on CPU-GPU heterogeneous database system；

The step S12, determines the total quantity of pending assembly line；For example, the total quantity of pending assembly line is N；

The step S13 starts the assembly line query execution model for the corresponding assembly line point of the total quantity N It is assigned on the CPU and the GPU, and according to the determining single on the CPU and the GPU respectively assembly line is held Row duration calculates the corresponding system of all load sharing policies and executes duration；

All systems are executed the corresponding load sharing policy of minimum value in duration and determined by the step S14 For best CPU-GPU allocation strategy.

S11 to step S14 realizes the query analysis under big data quantity scene through the above steps, while can also be balanced Task distribution between CPU and GPU, makes full use of processor resource, improves system performance.

During executing the load sharing policy of CPU-GPU heterogeneous database system, in the isomeric data of CPU-GPU Building assembly line query execution model and load sharing policy are executed on library MapD Core open source version 3 .3.1, and are being equipped with 1 It is executed on the machine of block NVIDIA Tesla K80 GPU, video card includes about 22GB global memory.Corresponding to system Server in be equipped with two ten cores Xeon E5-2630 v4 CPU and 224GB RA；Improved system is run on The CentOS 7Linux release of Linux 3.10.0 kernel.

In the present embodiment, the step S11 on CPU-GPU heterogeneous database system by constructing assembly line query execution The step of model, is as follows:

Query statement is obtained, includes data to be checked and one or more querying conditions in the query statement, here, institute Stating data to be checked can be but be not limited to one or more database relational tables；

Size is divided according to preset data, data to be checked are divided into N number of data fragmentation, wherein when described to be checked When data are preferably database relational table, then the data fragmentation is the sublist of the database relational table, and the preset data Divide the tuple data that size is the preset quantity in the database relational table；

Start corresponding assembly line for each data fragmentation in above-mentioned N number of data fragmentation (subquery object) (to start The total quantity of assembly line be N), which is also possible to distribute to GPU either be assigned to CPU, point The operation for inquiring corresponding data fragmentation is executed on each assembly line not distributed on CPU and GPU, when each assembly line is held After the completion of row, the corresponding implementing result r of each assembly line is obtained_i, i is number (or the fragment of data fragmentation of the assembly line Number), wherein i=1,2 ... ..., N-1, N, by the corresponding implementing result r of all assembly lines of synthesis_i, obtain the query statement Corresponding final query result R, when poll-final, the system recorded under current CPU-GPU load sharing policy is held Row duration realizes the record to the system execution duration of current CPU-GPU load sharing policy and final result of query execution Statistics.

In the present embodiment, the step S12 determines the total quantity of pending assembly line, comprising:

Obtain query statement, wherein the query statement includes data to be checked；Here, the query statement further includes Querying condition, then distribution to the assembly line on CPU is the thread instance run on the CPU according to the querying condition；Point The assembly line being assigned on GPU is the kernel function example run on the GPU according to the querying condition.

The data to be checked are divided according to preset data fragment size, obtain the data of the data to be checked Fragment and its sum；For example, total n=(sizes of data to be checked)/(preset data of the data fragmentation of the data to be checked Fragment size)；

Each of the respectively described data to be checked data fragmentation starts corresponding assembly line, then pending institute The total quantity for stating assembly line is determined by the sum of the data fragmentation, for example, starting a corresponding stream for each data fragmentation Waterline, then n data fragmentation can then correspond to n assembly line of starting, then the total quantity N=n of the pending assembly line of system, by counting It is determined according to the sum of fragment.

In the present embodiment, the step S13 starts the assembly line query execution model for the corresponding institute of the total quantity It states assembly line to distribute to the CPU and the GPU, and according to the determining single on the CPU and the GPU respectively The execution duration of assembly line calculates the corresponding system of all load sharing policies and executes duration, comprising:

Step 2: parallel execute each of distributes on the CPU and the GPU assembly line respectively, obtain current The corresponding CPU of load sharing policy executes duration T_CPUDuration T is executed with GPU_GPU；

Step 3: if the CPU executes duration T_CPUDuration T is executed with the GPU_GPUIt is equal, then when the CPU being executed The long corresponding system of present load allocation strategy that is determined as executes duration T=T_CPUOr T=T_GPU；If the CPU executes duration T_CPU Duration T is executed with the GPU_GPUIt is unequal, then the CPU is executed into duration and the GPU executes the larger value in duration and determines Duration T=Max { T is executed for the corresponding system of present load allocation strategy_CPU, T_GPU}；It is held since CPU and GPU can be started simultaneously at Row, therefore system needs the execution time phase difference minimum of two processors that can just obtain top performance, optimizes under scene, when two When the execution time phase difference of a processor is 0, then it is believed that system reaches optimal load balancing state, i.e., at this time by assembly line point Dispensing CPU load sharing policy corresponding with GPU is best.

Step 5: repeating the above steps two to step 4, until obtaining the corresponding system of all load sharing policies System executes duration, i.e., finishes up to the case where all load sharing policies is performed both by.

In the present embodiment, in the step S13 according to the determining single on the CPU and the GPU respectively The execution duration of assembly line calculates before the corresponding system of all load sharing policies executes duration, further includes:

Duration T is inputted according to data of the data fragmentation on CPU_{IN_C}, data execute duration T_{EXE_C}And data output Duration T_{OUT_C}, determine the execution duration of the assembly line described in single on the CPU；That is, in the execution of single assembly line on CPU Length is the data input duration T by the data fragmentation on CPU_{IN_C}, data execute duration T_{EXE_C}And data export duration T_{OUT_C} Three phases composition, wherein data of the data fragmentation on CPU input duration T_{IN_C}It is the data fragmentation in institute State duration used when the memory of CPU is copied；Data of the data fragmentation on CPU execute duration T_{EXE_C}For described Duration used in the thread instance run on CPU；Data of the data fragmentation on CPU export duration T_{OUT_C}For described Duration used in the corresponding implementing result of assembly line described in CPU memory copying；

Duration T is inputted according to data of the data fragmentation on GPU_{IN_G}, data execute duration T_{EXE_G}And data output Duration T_{OUT_G}, determine the execution duration of the assembly line described in single on the GPU, that is, in the execution of single assembly line on GPU Length is the data input duration T by the data fragmentation on GPU_{IN_G}, data execute duration T_{EXE_G}And data export duration T_{OUT_G} Three phases composition, wherein data of the data fragmentation on GPU input duration T_{IN_G}For by the data fragmentation from institute State duration used in video memory of the memory copying of CPU to the GPU；Data of the data fragmentation on GPU execute duration T_{EXE_G}For duration used in the kernel function example that is run on the GPU；Data of the data fragmentation on GPU export duration T_{OUT_G}For the corresponding implementing result of the assembly line is copied to duration used in the memory of the CPU from the video memory of the GPU.

Due to no matter on CPU, or there are phase mutual respects during executing assigned multiple threads on GPU parallel It is folded, and only differ between two adjacent assembly lines the execution duration in a stage, thus it is described Step 2: in current The corresponding CPU of load sharing policy executes duration and the formula difference of GPU execution duration is as follows:

The corresponding CPU of present load allocation strategy executes duration T_CPUAre as follows:

Due to no matter the execution duration in each stage on CPU or on GPU in the three phases of execution pipeline It has differences, is the overlapping time between two adjacent assembly lines of accurate recording, then most by the execution duration in three phases The long stage as the execution time difference between adjacent two assembly lines, such as according to data fragmentation on CPU data input when Long T_{IN_C}, data execute duration T_{EXE_C}And data export duration T_{OUT_C}In maximum value Max { T_{IN_C}, T_{EXE_C}, T_{OUT_C}, it can be with Obtain the execution time difference between adjacent two assembly lines are as follows: the duration maximum value in three phases is multiplied by remaining flowing water line number According to, the total duration of lap can be obtained, under present load allocation strategy as shown in Figure 2 be respectively GPU execute duration The calculating schematic diagram of duration is executed with CPU, wherein be in the pipeline implementation of GPU on the left of Fig. 2, be CPU's on the right side of Fig. 2 Pipeline implementation.

As shown in figure 3, the corresponding GPU of present load allocation strategy executes duration T_CPUAre as follows:

Wherein, Max { T_{IN_G}, T_{EXE_G}, T_{OUT_G}It is that data of the data fragmentation on GPU input duration T_{IN_G}, data execute Duration T_{EXE_G}And data export duration T_{OUT_G}In maximum value；

Here, in Fig. 3, for two processors being divided to assembly line less than CPU and GPU under load sharing policy After upper, when inquiry the corresponding schematic diagram for executing duration, which, which is used to indicate, is all placed on all loads on GPU When execution end time point；At the end of End is used to indicate the execution after on two processors of load balancing to CPU and GPU Between point.

In the present embodiment, it is described Step 2: it is parallel execute each of distribute on the CPU and the GPU respectively it is described Assembly line obtains the corresponding CPU of present load allocation strategy and executes duration and after GPU executes duration, further includes:

Each of obtain on the CPU and GPU the corresponding implementing result of the assembly line；For example, i-th assembly line is held Corresponding implementing result is r after the completion of row_i, i is the number (or fragment number of data fragmentation) of the assembly line, wherein and i=1, 2 ... ..., N-1, N；

The corresponding final query execution knot of the query statement is obtained according to the corresponding implementing result of each assembly line Fruit R, for example, R={ r₁, r₂... ... r_i... ..., r_(N-1), r_N}。

Then above-described embodiment of the application, by all corresponding systems of load sharing policy in the step S14 Execute load sharing policy corresponding to the minimum value in duration as optimal CPU-GPU load sharing policy, i.e., {OptN_GPU, OptN_CPU}=FindMin (T []), OptN_GPU(CPU is upper to be divided for the load of GPU under finger optimum load allocation strategy The assembly line quantity matched), OptN_CPURefer to the load (the assembly line quantity being assigned on GPU) of CPU under optimum load allocation plan, T [] is the array for storing the system under all load sharing policy loads and executing duration, and FindMin owns in described Load sharing policy under system execute the minimum value that finds in duration.

In a kind of one practical application scene of load-balancing method based on CPU-GPU provided by the present application, such as Fig. 4 and figure Shown in 5, after using the first query statement to execute respectively to data to be checked for 80,000,000 row data and 1.4 hundred million row data As a result, wherein the first query statement are as follows: select avg (attr1) from tbl1group by attr2, wherein Fig. 4 and The longitudinal axis in the left side in Fig. 5 indicates that the corresponding system of present load allocation strategy executes duration T, right side longitudinal axis GPU Pipeline Number indicates the assembly line quantity N distributed on GPU_GPU, horizontal axis CPU Pipeline Number indicates to distribute on CPU Assembly line quantity N_CPU, total assembly line quantity on CPU and GPU remains unchanged (i.e. in Fig. 4 N_GPU+N_CPU=3, the N in Fig. 5_GPU +N_CPU=5), Pipeline workload partitions represents the allocation strategy of flowing water linear load.When what is distributed on CPU When assembly line quantity is 0, all loads are all allocated in GPU execution, i.e. the execution mould of tradition CPU-GPU isomery processing analysis system Formula.It is found that (assembly line for distributing to CPU is 1, is distributed to when cpu load is equal to when 1, GPU load is equal to 2 in Fig. 4 The assembly line of GPU is 2), the corresponding system executive chairman T of present load allocation strategy is 587 milliseconds, to execute 80,000,000 rows Inquiry data under the corresponding system of all load sharing policies execute the time shortest (T in duration_min), i.e. CPU is negative It carries and is equal to the 1 optimum load allocation strategy with GPU load equal to 2 to inquire 80,000,000 row data with the first query statement；? In Fig. 5 it is found that when cpu load is 2, GPU load is that 3 (assembly line for distributing to CPU is 2, distributes to the assembly line of GPU It is 3) when, present load allocation strategy corresponding system is 936 milliseconds a length of when executing, in the case where executing 1.4 hundred million row data The corresponding system of all load sharing policies executes the time shortest (T in duration_min), i.e., it is negative with GPU to be equal to 2 for cpu load It carries and is equal to 3 for the optimum load allocation strategy of the inquiry data of the first query statement 1.4 hundred million rows of inquiry.In first inquiry Under sentence, inquiring 80,000,000 row data and 1.4 hundred million row data and undertaking the system executive chairman all loaded by GPU is respectively 881 millis Second with 1265 milliseconds.Compared to traditional executive mode, using the system performance after load sharing policy in the different data amount time-division About 33% and 26% is not improved, wherein (GPU undertakes the system all loaded and executes duration-load 33%=when 80,000,000 row The corresponding system executive chairman of allocation strategy)/(GPU undertakes the system all loaded and executes duration when 80,000,000 row)=(881 millis - 587 milliseconds of second)/881 milliseconds；GPU undertakes the system all loaded and executes duration-load sharing policy when hundred million row of 26%=1.4 Corresponding system executive chairman)/(GPU undertakes the system all loaded and executes duration when 1.4 row)=(1265 milliseconds -936 milliseconds)/ 936 milliseconds.

In a kind of another practical application scene of load-balancing method based on CPU-GPU provided by the present application, such as Fig. 6 and Shown in Fig. 7, after using the second query statement to execute respectively to data to be checked for 80,000,000 row data and 1.4 hundred million row data Result, wherein the second query statement are as follows: select count (*) from (select tbl1.attr1 from tbl1 Join tbl2 on tbl1.attr1=tbl2.attr1), since second query statement is attended operation, flowing water Line number amount is more than first query statement.In Fig. 6 it can be seen that under 80,000,000 row data, when cpu load is 2, GPU is When loading 7, system a length of minimum value T when executing_min=1361 milliseconds.In Fig. 7 it can be seen that under 1.4 hundred million row data, when When cpu load is 7, when GPU load is 18, system a length of minimum value T when executing_min=3488 milliseconds, system performance is best, with 80,000,000 row data are inquired under second query statement and 1.4 hundred million row data are undertaken the system all loaded by GPU respectively and held 1750 milliseconds of row duration improve about 22% and 28% compared with 4845 milliseconds, using the system performance after load sharing policy, Wherein, (GPU undertakes the corresponding system of system execution duration-load sharing policy all loaded and holds 22%=when 80,000,000 row Row length)/(GPU undertakes the system all loaded and executes duration when 80,000,000 row)=(1750 milliseconds -1361 milliseconds)/1750 milli Second；GPU undertakes the system all loaded and executes the corresponding system execution of duration-load sharing policy when hundred million row of 28%=1.4 It is long)/(GPU undertakes the system all loaded and executes duration when 1.4 row)=(4845 milliseconds -3488 milliseconds)/4845 milliseconds.

One or more processors；

When one or more of computer-readable instructions are executed by one or more of processors, so that one Or multiple processors realize such as the above-mentioned load-balancing method based on CPU-GPU

Here, the detailed content of each embodiment of the equipment, for details, reference can be made to the one kind executed in the equipment to be based on The corresponding part of the embodiment of the method for the load-balancing method of CPU-GPU, here, repeating no more.

In conclusion the application is made by constructing assembly line query execution model on CPU-GPU heterogeneous database system CPU-GPU isomeric data analysis system can support the query analysis under big data scene；Determine the total of pending assembly line Quantity；Start the assembly line query execution model to distribute the corresponding assembly line of the total quantity to the CPU and institute It states on GPU, and the execution duration of the assembly line according to the determining single on the CPU and the GPU respectively, calculates all The corresponding system of load sharing policy execute duration；All systems are finally executed to the minimum value pair in duration The load sharing policy answered is determined as best CPU-GPU allocation strategy, and the load based on CPU-GPU isomeric data analysis system is equal Weighing apparatus strategy can reasonable distribution assembly line be loaded to different processor, make full use of processor computing resource, not only improve system Performance, moreover it is possible to make system reach overall performance best.

It should be noted that the application can be carried out in the assembly of software and/or software and hardware, for example, can adopt With specific integrated circuit (ASIC), general purpose computer or any other realized similar to hardware device.In one embodiment In, the software program of the application can be executed to implement the above steps or functions by processor.Similarly, the application Software program (including relevant data structure) can be stored in computer readable recording medium, for example, RAM memory, Magnetic or optical driver or floppy disc and similar devices.In addition, hardware can be used to realize in some steps or function of the application, example Such as, as the circuit cooperated with processor thereby executing each step or function.

In addition, a part of the application can be applied to computer program product, such as computer program instructions, when its quilt When computer executes, by the operation of the computer, it can call or provide according to the present processes and/or technical solution. And the program instruction of the present processes is called, it is possibly stored in fixed or moveable recording medium, and/or pass through Broadcast or the data flow in other signal-bearing mediums and transmitted, and/or be stored according to described program instruction operation In the working storage of computer equipment.Here, including a device according to one embodiment of the application, which includes using Memory in storage computer program instructions and processor for executing program instructions, wherein when the computer program refers to When enabling by processor execution, method and/or skill of the device operation based on aforementioned multiple embodiments according to the application are triggered Art scheme.

It is obvious to a person skilled in the art that the application is not limited to the details of above-mentioned exemplary embodiment, Er Qie In the case where without departing substantially from spirit herein or essential characteristic, the application can be realized in other specific forms.Therefore, no matter From the point of view of which point, the present embodiments are to be considered as illustrative and not restrictive, and scope of the present application is by appended power Benefit requires rather than above description limits, it is intended that all by what is fallen within the meaning and scope of the equivalent elements of the claims Variation is included in the application.Any reference signs in the claims should not be construed as limiting the involved claims.This Outside, it is clear that one word of " comprising " does not exclude other units or steps, and odd number is not excluded for plural number.That states in device claim is multiple Unit or device can also be implemented through software or hardware by a unit or device.The first, the second equal words are used to table Show title, and does not indicate any particular order.

Claims

1. a kind of load-balancing method based on CPU-GPU, wherein the described method includes:

Determine the total quantity of pending assembly line；

Start the assembly line query execution model, the corresponding assembly line of the total quantity is distributed to the CPU and institute It states on GPU, and the execution duration of the assembly line according to the determining single on the CPU and the GPU respectively, calculates all The corresponding system of load sharing policy execute duration；

All systems are executed into the corresponding load sharing policy of minimum value in duration and are determined as optimum load distribution plan Slightly.

2. according to the method described in claim 1, wherein, the total quantity of the pending assembly line of the determination, comprising:

The data to be checked are divided according to preset data fragment size, obtain the data fragmentation of the data to be checked And its sum；

Each of the respectively described data to be checked data fragmentation starts corresponding assembly line, then the pending stream The total quantity of waterline is determined by the sum of the data fragmentation.

3. according to the method described in claim 2, wherein, starting the assembly line query execution model total quantity is corresponding The assembly line distribute to the CPU and the GPU, and according to the determining single on the CPU and the GPU respectively The execution duration of the assembly line calculates the corresponding system of all load sharing policies and executes duration, comprising:

Step 1: starting assembly line query execution model, is arranged initial load allocation strategy: for the assembly line of CPU distribution Quantity N_CPU=0, for the assembly line quantity N of GPU distribution_GPU=N, wherein N is the total quantity of the assembly line and N is big In the positive integer for being equal to 1；

Step 2: parallel execute each of distributes on the CPU and the GPU assembly line respectively, present load is obtained The corresponding CPU of allocation strategy executes duration and GPU executes duration；

Step 3: CPU execution duration is determined as working as if the CPU executes duration and GPU execution duration is equal The corresponding system of preceding load sharing policy executes duration；If the CPU executes duration and GPU execution duration is unequal, The CPU is executed into duration and the GPU executes the larger value in duration and is determined as the corresponding system of present load allocation strategy Execute duration；

Step 4: updating load sharing policy: for the assembly line quantity N of CPU distribution_CPU=N_CPU+ 1, it is distributed for the GPU Assembly line quantity N_GPU=N_GPU- 1, wherein N_CPU+N_GPU=N,；

Step 5: repeating the above steps two to step 4, held until obtaining the corresponding system of all load sharing policies Row duration.

4. described according to the determining single on the CPU and the GPU respectively according to the method described in claim 3, wherein The execution duration of the assembly line calculates before the corresponding system of all load sharing policies executes duration, further includes:

Duration T is inputted according to data of the data fragmentation on CPU_{IN_C}, data execute duration T_{EXE_C}And data export duration T_{OUT_C}, determine the execution duration of the assembly line described in single on the CPU；

Duration T is inputted according to data of the data fragmentation on GPU_{IN_G}, data execute duration T_{EXE_G}And data export duration T_{OUT_G}, determine the execution duration of the assembly line described in single on the GPU.

5. according to the method described in claim 4, wherein, it is described Step 2: in obtain present load allocation strategy corresponding CPU executes duration and the formula of GPU execution duration is respectively as follows:

T_CPU=T_{IN_C}+T_{EXE_C}+T_{OUT_C}+Max(T_{IN_C}, I_{EXE_C}, T_{OUT_C}}×(N_CPU- 1),

T_GPU=T_{IN_G}+T_{EXE_G}+T_{OUR_G}+Max{T_{IN_G}, T_{EXE_G}, T_{OUT_G}}×(N_GPU- 1),

Wherein, T_CPUDuration, Max { T are executed for the corresponding CPU of present load allocation strategy_{IN_C}, T_{EXE_C}, T_{OUT_C}It is data fragmentation Data on CPU input duration T_{IN_C}, data execute duration T_{EXE_C}And data export duration T_{OUT_C}In maximum value；

T_GPUDuration, Max { T are executed for the corresponding GPU of present load allocation strategy_{IN_G}, T_{EXE_G}, T_{OUT_G}It is data fragmentation in GPU On data input duration T_{IN_G}, data execute duration T_{EXE_G}And data export duration T_{OUT_G}In maximum value.

6. according to the method described in claim 5, wherein, the query statement further includes querying condition, wherein the distribution is extremely Assembly line on the CPU is the thread instance run on the CPU according to the querying condition；It is described to distribute to described Assembly line on GPU is the kernel function example run on the GPU according to the querying condition.

It is described Step 2: parallel execute is distributed respectively in the CPU and described 7. according to the method described in claim 6, wherein The assembly line each of on GPU, after obtaining the corresponding CPU execution duration of present load allocation strategy and GPU execution duration, Further include:

The corresponding final result of query execution of the query statement is obtained according to the corresponding implementing result of each assembly line.

8. according to the method described in claim 7, wherein, data of the data fragmentation on CPU input duration T_{IN_C}It is described The data fragmentation duration used when the memory of the CPU is copied；

Data of the data fragmentation on CPU execute duration T_{EXE_C}When for used in the thread instance that is run on the CPU It is long；

Data of the data fragmentation on CPU export duration T_{OUT_C}It is corresponding for the assembly line described in the CPU memory copying Duration used in implementing result；

Data of the data fragmentation on GPU input duration T_{IN_G}For by the data fragmentation from the memory copying of the CPU to Duration used in the video memory of the GPU；

Data of the data fragmentation on GPU execute duration T_{EXE_G}For used in the kernel function example that is run on the GPU Duration；

Data of the data fragmentation on GPU export duration T_{OUT_G}For by the corresponding implementing result of the assembly line from described The video memory of GPU is copied to duration used in the memory of the CPU.

9. a kind of non-volatile memory medium, is stored thereon with computer-readable instruction, the computer-readable instruction can be located When managing device execution, the processor is made to realize such as method described in any item of the claim 1 to 8.

10. a kind of equipment, wherein comprising:

One or more processors；

When one or more of computer-readable instructions are executed by one or more of processors, so that one or more A processor realizes such as method described in any item of the claim 1 to 8.