CN109213601A - A kind of load-balancing method and equipment based on CPU-GPU - Google Patents

A kind of load-balancing method and equipment based on CPU-GPU Download PDF

Info

Publication number
CN109213601A
CN109213601A CN201811064037.5A CN201811064037A CN109213601A CN 109213601 A CN109213601 A CN 109213601A CN 201811064037 A CN201811064037 A CN 201811064037A CN 109213601 A CN109213601 A CN 109213601A
Authority
CN
China
Prior art keywords
cpu
gpu
duration
data
assembly line
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811064037.5A
Other languages
Chinese (zh)
Other versions
CN109213601B (en
Inventor
翁楚良
孙婷婷
黄皓
王嘉伦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China Normal University
Original Assignee
East China Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China Normal University filed Critical East China Normal University
Priority to CN201811064037.5A priority Critical patent/CN109213601B/en
Publication of CN109213601A publication Critical patent/CN109213601A/en
Application granted granted Critical
Publication of CN109213601B publication Critical patent/CN109213601B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/505Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining

Abstract

The purpose of the application is to provide a kind of load-balancing method and equipment based on CPU-GPU, the application enables CPU-GPU isomeric data analysis system to support the query analysis under big data scene by constructing assembly line query execution model on CPU-GPU heterogeneous database system;Determine the total quantity of pending assembly line;Start the assembly line query execution model to distribute the corresponding assembly line of the total quantity to the CPU and the GPU, and the execution duration of the assembly line according to the determining single on the CPU and the GPU respectively, it calculates the corresponding system of all load sharing policies and executes duration;All systems are finally executed into the corresponding load sharing policy of minimum value in duration and are determined as best CPU-GPU allocation strategy, load balancing based on CPU-GPU isomeric data analysis system can reasonable distribution assembly line be loaded to different processor, make full use of processor computing resource, not only improve system performance, moreover it is possible to so that system is reached overall performance best.

Description

A kind of load-balancing method and equipment based on CPU-GPU
Technical field
This application involves computer field more particularly to a kind of load-balancing methods and equipment based on CPU-GPU.
Background technique
Universal graphics processing unit (Graphics Processing Unit, GPU) is more in matrix calculating, machine learning etc. A field is widely used.In recent years, the related needs rapid growth of data-intensive applications, promotes the isomery based on GPU The development of online analysis and processing platform, since GPU possesses multiple computing units that can run a large amount of threads simultaneously, with GPU The performance of data analysis system as primary processor is better than traditional CPU analysis system in most cases, when executing Between shorten several orders of magnitude.
In traditional relational query analysis system, when client sends inquiry request, system can create one point Analyse operation, parsed and be converted to logical query plan to request, inquiry plan optimizer can according to certain principle (such as at This is minimum) the optimal physical query plan execution of selection.Physical query plan is a directed acyclic graph (DAG), it includes more A operation operator is executed according to certain sequence between operation operator.
In current CPU-GPU isomery analysis system, GPU is the primary processor of query execution, the execution of operation operator Be mainly distributed on GPU, and CPU is mainly responsible for data distribution and collection, when subsequent operation need it is defeated using previous steps When intermediate result out, CPU will also do certain processing to intermediate result.
The analysis demand of Data Management Analysis system processing is towards big data scene, and data volume exponentially increases, work Make heavier loads;However, the data in storage medium inside it can only be directly handled due to GPU, and the capacity of video memory is limited, Therefore GPU can not just complete the processing of large data sets by single load.When input data or intermediate result can not be put into greatly very much When GPU global memory, the efficiency that will lead to analysis work continues lowly, or even causes mission failure.Pass through limit in the prior art The size of inquiry table processed evades this problem, or calculating task is transferred to CPU as alternative strategy, but these are not most Good solution.
In conclusion the use of GPU being that data analysis system accelerates query analysis at present on the heterogeneous platform of CPU-GPU Though it is effective, but still have the following problems: GPU video memory capacity is limited, and the processing of large data sets can not be completed by single load, And the task distribution between CPU and GPU is unbalanced, does not make full use of heterogeneous processor resource.
Summary of the invention
The purpose of the application is to provide a kind of load-balancing method and equipment based on CPU-GPU, existing to solve GPU video memory capacity in technology is limited, and the task between the processing and CPU and GPU of large data sets can not be completed by single load Distributing unbalanced leads to the problem of not making full use of heterogeneous processor resource.
According to the one aspect of the application, a kind of load-balancing method based on CPU-GPU is provided, this method comprises:
Assembly line query execution model is constructed on CPU-GPU heterogeneous database system;
Determine the total quantity of pending assembly line;
Start the assembly line query execution model, the corresponding assembly line of the total quantity is distributed to the CPU On the GPU, and the execution duration of the assembly line according to the determining single on the CPU and the GPU respectively, it calculates The corresponding system of all load sharing policies executes duration;
All systems are executed into the corresponding load sharing policy of minimum value in duration and are determined as best CPU-GPU Allocation strategy.
Further, in the above method, the total quantity of the pending assembly line of the determination, comprising:
Obtain query statement, wherein the query statement includes data to be checked;
The data to be checked are divided according to preset data fragment size, obtain the data of the data to be checked Fragment and its sum;
Each of the respectively described data to be checked data fragmentation starts corresponding assembly line, then pending institute The total quantity for stating assembly line is determined by the sum of the data fragmentation.
Further, in the above method, start the assembly line query execution model total quantity is corresponding described Assembly line is distributed to the CPU and the GPU, and is flowed according to the determining single on the CPU and the GPU respectively The execution duration of waterline calculates the corresponding system of all load sharing policies and executes duration, comprising:
Step 1: starting assembly line query execution model, is arranged initial load allocation strategy: for the stream of CPU distribution Waterline quantity NCPU=0, for the assembly line quantity N of GPU distributionGPU=N, wherein N is the total quantity and N of the assembly line For the positive integer more than or equal to 1;
Step 2: parallel execute each of distributes on the CPU and the GPU assembly line respectively, obtain current The corresponding CPU of load sharing policy executes duration and GPU executes duration;
Step 3: the CPU is executed duration and is determined if the CPU executes duration and GPU execution duration is equal Duration is executed for the corresponding system of present load allocation strategy;If the CPU executes duration and the GPU executes duration not phase Deng the CPU is then executed the larger value that duration and the GPU execute in duration, and to be determined as present load allocation strategy corresponding System executes duration;
Step 4: updating load sharing policy: for the assembly line quantity N of CPU distributionCPU=NCPU+ 1, it is the GPU The assembly line quantity N of distributionGPU=NGPU- 1, wherein NCPU+NGPU=N,;
Step 5: repeating the above steps two to step 4, until obtaining the corresponding system of all load sharing policies System executes duration.
Further, described to be flowed according to the determining single on the CPU and the GPU respectively in the above method The execution duration of waterline calculates before the corresponding system of all load sharing policies executes duration, further includes:
Duration T is inputted according to data of the data fragmentation on CPUIN_C, data execute duration TEXE_CAnd data output Duration TOUT_C, determine the execution duration of the assembly line described in single on the CPU;
Duration T is inputted according to data of the data fragmentation on GPUIN_G, data execute duration TEXE_GAnd data output Duration TOUT_G, determine the execution duration of the assembly line described in single on the GPU.
Further, in the above method, it is described Step 2: in the corresponding CPU of present load allocation strategy that obtains execute The formula that duration and GPU execute duration is respectively as follows:
TCPU=TIN_C+TEXE_C+TOUT_C+Max{TIN_c, TEXE_C, TOUT_C}×(NCPU- 1),
TGPU=TIN_G+TEXE_G+TOUT_G+Max{TIN_G, TEXE_G, TOUT_G}×(NGPU- 1),
Wherein, TCPUDuration, Max { T are executed for the corresponding CPU of present load allocation strategyIN_C, TEXE_C, TOUT_CIt is data Data of the fragment on CPU input duration TIN_C, data execute duration TEXE_CAnd data export duration TOUT_CIn maximum value;
TGPUDuration, Max { T are executed for the corresponding GPU of present load allocation strategyIN_G, TEXE_G, TOUT_GIt is data fragmentation Data on GPU input duration TIN_G, data execute duration TEXE_GAnd data export duration TOUT_GIn maximum value.
Further, in the above method, the query statement further includes querying condition, wherein the distribution to the CPU On assembly line be the thread instance run on the CPU according to the querying condition;It is described to distribute to the stream on the GPU Waterline is the kernel function example run on the GPU according to the querying condition.
Further, described Step 2: parallel execute is distributed respectively on the CPU and the GPU in the above method Each assembly line obtains the corresponding CPU of present load allocation strategy and executes duration and after GPU executes duration, further includes:
Each of obtain on the CPU and GPU the corresponding implementing result of the assembly line;
The corresponding final query execution knot of the query statement is obtained according to the corresponding implementing result of each assembly line Fruit.
Further, in the above method, data of the data fragmentation on CPU input duration TIN_CFor the data point The piece duration used when the memory of the CPU is copied;
Data of the data fragmentation on CPU execute duration TEXE_CFor used in the thread instance that is run on the CPU Duration;
Data of the data fragmentation on CPU export duration TOUT_CFor the assembly line pair described in the CPU memory copying Duration used in the implementing result answered;
Data of the data fragmentation on GPU input duration TIN_GFor the memory by the data fragmentation from the CPU It is copied to duration used in the video memory of the GPU;
Data of the data fragmentation on GPU execute duration TEXE_GRun by kernel function example on the GPU Duration;
Data of the data fragmentation on GPU export duration TOUT_GFor by the corresponding implementing result of the assembly line from institute The video memory for stating GPU is copied to duration used in the memory of the CPU.
According to the another aspect of the application, a kind of non-volatile memory medium is additionally provided, being stored thereon with computer can Reading instruction when the computer-readable instruction can be executed by processor, realizes the processor as above-mentioned based on CPU-GPU Load-balancing method.
According to the another aspect of the application, a kind of equipment is additionally provided, wherein comprising:
One or more processors;
Non-volatile memory medium, for storing one or more computer-readable instructions,
When one or more of computer-readable instructions are executed by one or more of processors, so that one Or multiple processors realize such as the above-mentioned load-balancing method based on CPU-GPU.
Compared with prior art, the application on CPU-GPU heterogeneous database system by constructing assembly line query execution Model enables CPU-GPU isomeric data analysis system to support the query analysis under big data scene;Determine pending flowing water The total quantity of line;Start the assembly line query execution model to distribute the corresponding assembly line of the total quantity to described On CPU and the GPU, and the execution duration of the assembly line according to the determining single on the CPU and the GPU respectively, It calculates the corresponding system of all load sharing policies and executes duration;Finally all systems are executed in duration The corresponding load sharing policy of minimum value is determined as best CPU-GPU allocation strategy, is based on CPU-GPU isomeric data analysis system Load balancing can reasonable distribution assembly line be loaded to different processor, make full use of processor computing resource, not only Improve system performance, moreover it is possible to so that system is reached overall performance best.
Detailed description of the invention
By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, the application's is other Feature, objects and advantages will become more apparent upon:
Fig. 1 shows a kind of flow diagram of load-balancing method based on CPU-GPU according to the application one aspect;
Fig. 2 shows dividing in a kind of load-balancing method based on CPU-GPU according to the application one aspect in load Duration is executed with GPU corresponding under strategy and CPU executes the schematic diagram of duration;
Fig. 3 shows in a kind of load-balancing method based on CPU-GPU according to the application one aspect and is in load The calculating schematic diagram of execution duration when allocation strategy when corresponding inquiry;Fig. 4 shows the application according to the application one aspect First query statement carries out showing for the load sharing policy of CPU-GPU under assembly line query execution model to 80,000,000 row data It is intended to;
Fig. 5 shows right under assembly line query execution model according to the first query statement of application of the application one aspect 1.4 hundred million row data carry out the schematic diagram of the load sharing policy of CPU-GPU;
Fig. 6 shows right under assembly line query execution model according to the second query statement of application of the application one aspect 80000000 row data carry out the schematic diagram of the load sharing policy of CPU-GPU;
Fig. 7 shows right under assembly line query execution model according to the second query statement of application of the application one aspect 1.4 hundred million row data carry out the schematic diagram of the load sharing policy of CPU-GPU;
The same or similar appended drawing reference represents the same or similar component in attached drawing.
Specific embodiment
The application is described in further detail with reference to the accompanying drawing.
In a typical configuration of this application, terminal, the equipment of service network and trusted party include one or more Processor (CPU), input/output interface, network interface and memory.
Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/or The forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM).Memory is computer-readable medium Example.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method Or technology come realize information store.Information can be computer readable instructions, data structure, the module of program or other data. The example of the storage medium of computer includes, but are not limited to phase change memory (PRAM), static random access memory (SRAM), moves State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable Programmable read only memory (EEPROM), flash memory or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM), Digital versatile disc (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices or Any other non-transmission medium, can be used for storage can be accessed by a computing device information.As defined in this article, computer Readable medium does not include non-temporary computer readable media (transitory media), such as the data-signal and carrier wave of modulation.
As shown in Figure 1, load-balancing method of one of the embodiment of the present application based on CPU-GPU, is suitable for relationship type The Data Analysis Model of database can give full play to heterogeneous processor characteristic, and rationally and efficiently share out the work task, realize big Query analysis under data scene, the method comprising the steps of S11, step S12, step S13, step S14, wherein specifically include:
The step S11, by constructing assembly line query execution model on CPU-GPU heterogeneous database system;
The step S12, determines the total quantity of pending assembly line;For example, the total quantity of pending assembly line is N;
The step S13 starts the assembly line query execution model for the corresponding assembly line point of the total quantity N It is assigned on the CPU and the GPU, and according to the determining single on the CPU and the GPU respectively assembly line is held Row duration calculates the corresponding system of all load sharing policies and executes duration;
All systems are executed the corresponding load sharing policy of minimum value in duration and determined by the step S14 For best CPU-GPU allocation strategy.
S11 to step S14 realizes the query analysis under big data quantity scene through the above steps, while can also be balanced Task distribution between CPU and GPU, makes full use of processor resource, improves system performance.
During executing the load sharing policy of CPU-GPU heterogeneous database system, in the isomeric data of CPU-GPU Building assembly line query execution model and load sharing policy are executed on library MapD Core open source version 3 .3.1, and are being equipped with 1 It is executed on the machine of block NVIDIA Tesla K80 GPU, video card includes about 22GB global memory.Corresponding to system Server in be equipped with two ten cores Xeon E5-2630 v4 CPU and 224GB RA;Improved system is run on The CentOS 7Linux release of Linux 3.10.0 kernel.
In the present embodiment, the step S11 on CPU-GPU heterogeneous database system by constructing assembly line query execution The step of model, is as follows:
Query statement is obtained, includes data to be checked and one or more querying conditions in the query statement, here, institute Stating data to be checked can be but be not limited to one or more database relational tables;
Size is divided according to preset data, data to be checked are divided into N number of data fragmentation, wherein when described to be checked When data are preferably database relational table, then the data fragmentation is the sublist of the database relational table, and the preset data Divide the tuple data that size is the preset quantity in the database relational table;
Start corresponding assembly line for each data fragmentation in above-mentioned N number of data fragmentation (subquery object) (to start The total quantity of assembly line be N), which is also possible to distribute to GPU either be assigned to CPU, point The operation for inquiring corresponding data fragmentation is executed on each assembly line not distributed on CPU and GPU, when each assembly line is held After the completion of row, the corresponding implementing result r of each assembly line is obtainedi, i is number (or the fragment of data fragmentation of the assembly line Number), wherein i=1,2 ... ..., N-1, N, by the corresponding implementing result r of all assembly lines of synthesisi, obtain the query statement Corresponding final query result R, when poll-final, the system recorded under current CPU-GPU load sharing policy is held Row duration realizes the record to the system execution duration of current CPU-GPU load sharing policy and final result of query execution Statistics.
In the present embodiment, the step S12 determines the total quantity of pending assembly line, comprising:
Obtain query statement, wherein the query statement includes data to be checked;Here, the query statement further includes Querying condition, then distribution to the assembly line on CPU is the thread instance run on the CPU according to the querying condition;Point The assembly line being assigned on GPU is the kernel function example run on the GPU according to the querying condition.
The data to be checked are divided according to preset data fragment size, obtain the data of the data to be checked Fragment and its sum;For example, total n=(sizes of data to be checked)/(preset data of the data fragmentation of the data to be checked Fragment size);
Each of the respectively described data to be checked data fragmentation starts corresponding assembly line, then pending institute The total quantity for stating assembly line is determined by the sum of the data fragmentation, for example, starting a corresponding stream for each data fragmentation Waterline, then n data fragmentation can then correspond to n assembly line of starting, then the total quantity N=n of the pending assembly line of system, by counting It is determined according to the sum of fragment.
In the present embodiment, the step S13 starts the assembly line query execution model for the corresponding institute of the total quantity It states assembly line to distribute to the CPU and the GPU, and according to the determining single on the CPU and the GPU respectively The execution duration of assembly line calculates the corresponding system of all load sharing policies and executes duration, comprising:
Step 1: starting assembly line query execution model, is arranged initial load allocation strategy: for the stream of CPU distribution Waterline quantity NCPU=0, for the assembly line quantity N of GPU distributionGPU=N, wherein N is the total quantity and N of the assembly line For the positive integer more than or equal to 1;
Step 2: parallel execute each of distributes on the CPU and the GPU assembly line respectively, obtain current The corresponding CPU of load sharing policy executes duration TCPUDuration T is executed with GPUGPU
Step 3: if the CPU executes duration TCPUDuration T is executed with the GPUGPUIt is equal, then when the CPU being executed The long corresponding system of present load allocation strategy that is determined as executes duration T=TCPUOr T=TGPU;If the CPU executes duration TCPU Duration T is executed with the GPUGPUIt is unequal, then the CPU is executed into duration and the GPU executes the larger value in duration and determines Duration T=Max { T is executed for the corresponding system of present load allocation strategyCPU, TGPU};It is held since CPU and GPU can be started simultaneously at Row, therefore system needs the execution time phase difference minimum of two processors that can just obtain top performance, optimizes under scene, when two When the execution time phase difference of a processor is 0, then it is believed that system reaches optimal load balancing state, i.e., at this time by assembly line point Dispensing CPU load sharing policy corresponding with GPU is best.
Step 4: updating load sharing policy: for the assembly line quantity N of CPU distributionCPU=NCPU+ 1, it is the GPU The assembly line quantity N of distributionGPU=NGPU- 1, wherein NCPU+NGPU=N,;
Step 5: repeating the above steps two to step 4, until obtaining the corresponding system of all load sharing policies System executes duration, i.e., finishes up to the case where all load sharing policies is performed both by.
In the present embodiment, in the step S13 according to the determining single on the CPU and the GPU respectively The execution duration of assembly line calculates before the corresponding system of all load sharing policies executes duration, further includes:
Duration T is inputted according to data of the data fragmentation on CPUIN_C, data execute duration TEXE_CAnd data output Duration TOUT_C, determine the execution duration of the assembly line described in single on the CPU;That is, in the execution of single assembly line on CPU Length is the data input duration T by the data fragmentation on CPUIN_C, data execute duration TEXE_CAnd data export duration TOUT_C Three phases composition, wherein data of the data fragmentation on CPU input duration TIN_CIt is the data fragmentation in institute State duration used when the memory of CPU is copied;Data of the data fragmentation on CPU execute duration TEXE_CFor described Duration used in the thread instance run on CPU;Data of the data fragmentation on CPU export duration TOUT_CFor described Duration used in the corresponding implementing result of assembly line described in CPU memory copying;
Duration T is inputted according to data of the data fragmentation on GPUIN_G, data execute duration TEXE_GAnd data output Duration TOUT_G, determine the execution duration of the assembly line described in single on the GPU, that is, in the execution of single assembly line on GPU Length is the data input duration T by the data fragmentation on GPUIN_G, data execute duration TEXE_GAnd data export duration TOUT_G Three phases composition, wherein data of the data fragmentation on GPU input duration TIN_GFor by the data fragmentation from institute State duration used in video memory of the memory copying of CPU to the GPU;Data of the data fragmentation on GPU execute duration TEXE_GFor duration used in the kernel function example that is run on the GPU;Data of the data fragmentation on GPU export duration TOUT_GFor the corresponding implementing result of the assembly line is copied to duration used in the memory of the CPU from the video memory of the GPU.
Due to no matter on CPU, or there are phase mutual respects during executing assigned multiple threads on GPU parallel It is folded, and only differ between two adjacent assembly lines the execution duration in a stage, thus it is described Step 2: in current The corresponding CPU of load sharing policy executes duration and the formula difference of GPU execution duration is as follows:
The corresponding CPU of present load allocation strategy executes duration TCPUAre as follows:
TCPU=TIN_C+TEXE_C+TOUT_C+Max{TIN_C, TEXE_C, TOUT_C}×(NCPU- 1),
Due to no matter the execution duration in each stage on CPU or on GPU in the three phases of execution pipeline It has differences, is the overlapping time between two adjacent assembly lines of accurate recording, then most by the execution duration in three phases The long stage as the execution time difference between adjacent two assembly lines, such as according to data fragmentation on CPU data input when Long TIN_C, data execute duration TEXE_CAnd data export duration TOUT_CIn maximum value Max { TIN_C, TEXE_C, TOUT_C, it can be with Obtain the execution time difference between adjacent two assembly lines are as follows: the duration maximum value in three phases is multiplied by remaining flowing water line number According to, the total duration of lap can be obtained, under present load allocation strategy as shown in Figure 2 be respectively GPU execute duration The calculating schematic diagram of duration is executed with CPU, wherein be in the pipeline implementation of GPU on the left of Fig. 2, be CPU's on the right side of Fig. 2 Pipeline implementation.
As shown in figure 3, the corresponding GPU of present load allocation strategy executes duration TCPUAre as follows:
TGPU=TIN_G+TEXE_G+TOUT_G+Max{TIN_G, TEXE_G, TOUT_G}×(NGPU- 1),
Wherein, Max { TIN_G, TEXE_G, TOUT_GIt is that data of the data fragmentation on GPU input duration TIN_G, data execute Duration TEXE_GAnd data export duration TOUT_GIn maximum value;
Here, in Fig. 3, for two processors being divided to assembly line less than CPU and GPU under load sharing policy After upper, when inquiry the corresponding schematic diagram for executing duration, which, which is used to indicate, is all placed on all loads on GPU When execution end time point;At the end of End is used to indicate the execution after on two processors of load balancing to CPU and GPU Between point.
In the present embodiment, it is described Step 2: it is parallel execute each of distribute on the CPU and the GPU respectively it is described Assembly line obtains the corresponding CPU of present load allocation strategy and executes duration and after GPU executes duration, further includes:
Each of obtain on the CPU and GPU the corresponding implementing result of the assembly line;For example, i-th assembly line is held Corresponding implementing result is r after the completion of rowi, i is the number (or fragment number of data fragmentation) of the assembly line, wherein and i=1, 2 ... ..., N-1, N;
The corresponding final query execution knot of the query statement is obtained according to the corresponding implementing result of each assembly line Fruit R, for example, R={ r1, r2... ... ri... ..., r(N-1), rN}。
Then above-described embodiment of the application, by all corresponding systems of load sharing policy in the step S14 Execute load sharing policy corresponding to the minimum value in duration as optimal CPU-GPU load sharing policy, i.e., {OptNGPU, OptNCPU}=FindMin (T []), OptNGPU(CPU is upper to be divided for the load of GPU under finger optimum load allocation strategy The assembly line quantity matched), OptNCPURefer to the load (the assembly line quantity being assigned on GPU) of CPU under optimum load allocation plan, T [] is the array for storing the system under all load sharing policy loads and executing duration, and FindMin owns in described Load sharing policy under system execute the minimum value that finds in duration.
In a kind of one practical application scene of load-balancing method based on CPU-GPU provided by the present application, such as Fig. 4 and figure Shown in 5, after using the first query statement to execute respectively to data to be checked for 80,000,000 row data and 1.4 hundred million row data As a result, wherein the first query statement are as follows: select avg (attr1) from tbl1group by attr2, wherein Fig. 4 and The longitudinal axis in the left side in Fig. 5 indicates that the corresponding system of present load allocation strategy executes duration T, right side longitudinal axis GPU Pipeline Number indicates the assembly line quantity N distributed on GPUGPU, horizontal axis CPU Pipeline Number indicates to distribute on CPU Assembly line quantity NCPU, total assembly line quantity on CPU and GPU remains unchanged (i.e. in Fig. 4 NGPU+NCPU=3, the N in Fig. 5GPU +NCPU=5), Pipeline workload partitions represents the allocation strategy of flowing water linear load.When what is distributed on CPU When assembly line quantity is 0, all loads are all allocated in GPU execution, i.e. the execution mould of tradition CPU-GPU isomery processing analysis system Formula.It is found that (assembly line for distributing to CPU is 1, is distributed to when cpu load is equal to when 1, GPU load is equal to 2 in Fig. 4 The assembly line of GPU is 2), the corresponding system executive chairman T of present load allocation strategy is 587 milliseconds, to execute 80,000,000 rows Inquiry data under the corresponding system of all load sharing policies execute the time shortest (T in durationmin), i.e. CPU is negative It carries and is equal to the 1 optimum load allocation strategy with GPU load equal to 2 to inquire 80,000,000 row data with the first query statement;? In Fig. 5 it is found that when cpu load is 2, GPU load is that 3 (assembly line for distributing to CPU is 2, distributes to the assembly line of GPU It is 3) when, present load allocation strategy corresponding system is 936 milliseconds a length of when executing, in the case where executing 1.4 hundred million row data The corresponding system of all load sharing policies executes the time shortest (T in durationmin), i.e., it is negative with GPU to be equal to 2 for cpu load It carries and is equal to 3 for the optimum load allocation strategy of the inquiry data of the first query statement 1.4 hundred million rows of inquiry.In first inquiry Under sentence, inquiring 80,000,000 row data and 1.4 hundred million row data and undertaking the system executive chairman all loaded by GPU is respectively 881 millis Second with 1265 milliseconds.Compared to traditional executive mode, using the system performance after load sharing policy in the different data amount time-division About 33% and 26% is not improved, wherein (GPU undertakes the system all loaded and executes duration-load 33%=when 80,000,000 row The corresponding system executive chairman of allocation strategy)/(GPU undertakes the system all loaded and executes duration when 80,000,000 row)=(881 millis - 587 milliseconds of second)/881 milliseconds;GPU undertakes the system all loaded and executes duration-load sharing policy when hundred million row of 26%=1.4 Corresponding system executive chairman)/(GPU undertakes the system all loaded and executes duration when 1.4 row)=(1265 milliseconds -936 milliseconds)/ 936 milliseconds.
In a kind of another practical application scene of load-balancing method based on CPU-GPU provided by the present application, such as Fig. 6 and Shown in Fig. 7, after using the second query statement to execute respectively to data to be checked for 80,000,000 row data and 1.4 hundred million row data Result, wherein the second query statement are as follows: select count (*) from (select tbl1.attr1 from tbl1 Join tbl2 on tbl1.attr1=tbl2.attr1), since second query statement is attended operation, flowing water Line number amount is more than first query statement.In Fig. 6 it can be seen that under 80,000,000 row data, when cpu load is 2, GPU is When loading 7, system a length of minimum value T when executingmin=1361 milliseconds.In Fig. 7 it can be seen that under 1.4 hundred million row data, when When cpu load is 7, when GPU load is 18, system a length of minimum value T when executingmin=3488 milliseconds, system performance is best, with 80,000,000 row data are inquired under second query statement and 1.4 hundred million row data are undertaken the system all loaded by GPU respectively and held 1750 milliseconds of row duration improve about 22% and 28% compared with 4845 milliseconds, using the system performance after load sharing policy, Wherein, (GPU undertakes the corresponding system of system execution duration-load sharing policy all loaded and holds 22%=when 80,000,000 row Row length)/(GPU undertakes the system all loaded and executes duration when 80,000,000 row)=(1750 milliseconds -1361 milliseconds)/1750 milli Second;GPU undertakes the system all loaded and executes the corresponding system execution of duration-load sharing policy when hundred million row of 28%=1.4 It is long)/(GPU undertakes the system all loaded and executes duration when 1.4 row)=(4845 milliseconds -3488 milliseconds)/4845 milliseconds.
According to the another aspect of the application, a kind of non-volatile memory medium is additionally provided, being stored thereon with computer can Reading instruction when the computer-readable instruction can be executed by processor, realizes the processor as above-mentioned based on CPU-GPU Load-balancing method.
According to the another aspect of the application, a kind of equipment is additionally provided, wherein comprising:
One or more processors;
Non-volatile memory medium, for storing one or more computer-readable instructions,
When one or more of computer-readable instructions are executed by one or more of processors, so that one Or multiple processors realize such as the above-mentioned load-balancing method based on CPU-GPU
Here, the detailed content of each embodiment of the equipment, for details, reference can be made to the one kind executed in the equipment to be based on The corresponding part of the embodiment of the method for the load-balancing method of CPU-GPU, here, repeating no more.
In conclusion the application is made by constructing assembly line query execution model on CPU-GPU heterogeneous database system CPU-GPU isomeric data analysis system can support the query analysis under big data scene;Determine the total of pending assembly line Quantity;Start the assembly line query execution model to distribute the corresponding assembly line of the total quantity to the CPU and institute It states on GPU, and the execution duration of the assembly line according to the determining single on the CPU and the GPU respectively, calculates all The corresponding system of load sharing policy execute duration;All systems are finally executed to the minimum value pair in duration The load sharing policy answered is determined as best CPU-GPU allocation strategy, and the load based on CPU-GPU isomeric data analysis system is equal Weighing apparatus strategy can reasonable distribution assembly line be loaded to different processor, make full use of processor computing resource, not only improve system Performance, moreover it is possible to make system reach overall performance best.
It should be noted that the application can be carried out in the assembly of software and/or software and hardware, for example, can adopt With specific integrated circuit (ASIC), general purpose computer or any other realized similar to hardware device.In one embodiment In, the software program of the application can be executed to implement the above steps or functions by processor.Similarly, the application Software program (including relevant data structure) can be stored in computer readable recording medium, for example, RAM memory, Magnetic or optical driver or floppy disc and similar devices.In addition, hardware can be used to realize in some steps or function of the application, example Such as, as the circuit cooperated with processor thereby executing each step or function.
In addition, a part of the application can be applied to computer program product, such as computer program instructions, when its quilt When computer executes, by the operation of the computer, it can call or provide according to the present processes and/or technical solution. And the program instruction of the present processes is called, it is possibly stored in fixed or moveable recording medium, and/or pass through Broadcast or the data flow in other signal-bearing mediums and transmitted, and/or be stored according to described program instruction operation In the working storage of computer equipment.Here, including a device according to one embodiment of the application, which includes using Memory in storage computer program instructions and processor for executing program instructions, wherein when the computer program refers to When enabling by processor execution, method and/or skill of the device operation based on aforementioned multiple embodiments according to the application are triggered Art scheme.
It is obvious to a person skilled in the art that the application is not limited to the details of above-mentioned exemplary embodiment, Er Qie In the case where without departing substantially from spirit herein or essential characteristic, the application can be realized in other specific forms.Therefore, no matter From the point of view of which point, the present embodiments are to be considered as illustrative and not restrictive, and scope of the present application is by appended power Benefit requires rather than above description limits, it is intended that all by what is fallen within the meaning and scope of the equivalent elements of the claims Variation is included in the application.Any reference signs in the claims should not be construed as limiting the involved claims.This Outside, it is clear that one word of " comprising " does not exclude other units or steps, and odd number is not excluded for plural number.That states in device claim is multiple Unit or device can also be implemented through software or hardware by a unit or device.The first, the second equal words are used to table Show title, and does not indicate any particular order.

Claims (10)

1. a kind of load-balancing method based on CPU-GPU, wherein the described method includes:
Assembly line query execution model is constructed on CPU-GPU heterogeneous database system;
Determine the total quantity of pending assembly line;
Start the assembly line query execution model, the corresponding assembly line of the total quantity is distributed to the CPU and institute It states on GPU, and the execution duration of the assembly line according to the determining single on the CPU and the GPU respectively, calculates all The corresponding system of load sharing policy execute duration;
All systems are executed into the corresponding load sharing policy of minimum value in duration and are determined as optimum load distribution plan Slightly.
2. according to the method described in claim 1, wherein, the total quantity of the pending assembly line of the determination, comprising:
Obtain query statement, wherein the query statement includes data to be checked;
The data to be checked are divided according to preset data fragment size, obtain the data fragmentation of the data to be checked And its sum;
Each of the respectively described data to be checked data fragmentation starts corresponding assembly line, then the pending stream The total quantity of waterline is determined by the sum of the data fragmentation.
3. according to the method described in claim 2, wherein, starting the assembly line query execution model total quantity is corresponding The assembly line distribute to the CPU and the GPU, and according to the determining single on the CPU and the GPU respectively The execution duration of the assembly line calculates the corresponding system of all load sharing policies and executes duration, comprising:
Step 1: starting assembly line query execution model, is arranged initial load allocation strategy: for the assembly line of CPU distribution Quantity NCPU=0, for the assembly line quantity N of GPU distributionGPU=N, wherein N is the total quantity of the assembly line and N is big In the positive integer for being equal to 1;
Step 2: parallel execute each of distributes on the CPU and the GPU assembly line respectively, present load is obtained The corresponding CPU of allocation strategy executes duration and GPU executes duration;
Step 3: CPU execution duration is determined as working as if the CPU executes duration and GPU execution duration is equal The corresponding system of preceding load sharing policy executes duration;If the CPU executes duration and GPU execution duration is unequal, The CPU is executed into duration and the GPU executes the larger value in duration and is determined as the corresponding system of present load allocation strategy Execute duration;
Step 4: updating load sharing policy: for the assembly line quantity N of CPU distributionCPU=NCPU+ 1, it is distributed for the GPU Assembly line quantity NGPU=NGPU- 1, wherein NCPU+NGPU=N,;
Step 5: repeating the above steps two to step 4, held until obtaining the corresponding system of all load sharing policies Row duration.
4. described according to the determining single on the CPU and the GPU respectively according to the method described in claim 3, wherein The execution duration of the assembly line calculates before the corresponding system of all load sharing policies executes duration, further includes:
Duration T is inputted according to data of the data fragmentation on CPUIN_C, data execute duration TEXE_CAnd data export duration TOUT_C, determine the execution duration of the assembly line described in single on the CPU;
Duration T is inputted according to data of the data fragmentation on GPUIN_G, data execute duration TEXE_GAnd data export duration TOUT_G, determine the execution duration of the assembly line described in single on the GPU.
5. according to the method described in claim 4, wherein, it is described Step 2: in obtain present load allocation strategy corresponding CPU executes duration and the formula of GPU execution duration is respectively as follows:
TCPU=TIN_C+TEXE_C+TOUT_C+Max(TIN_C, IEXE_C, TOUT_C}×(NCPU- 1),
TGPU=TIN_G+TEXE_G+TOUR_G+Max{TIN_G, TEXE_G, TOUT_G}×(NGPU- 1),
Wherein, TCPUDuration, Max { T are executed for the corresponding CPU of present load allocation strategyIN_C, TEXE_C, TOUT_CIt is data fragmentation Data on CPU input duration TIN_C, data execute duration TEXE_CAnd data export duration TOUT_CIn maximum value;
TGPUDuration, Max { T are executed for the corresponding GPU of present load allocation strategyIN_G, TEXE_G, TOUT_GIt is data fragmentation in GPU On data input duration TIN_G, data execute duration TEXE_GAnd data export duration TOUT_GIn maximum value.
6. according to the method described in claim 5, wherein, the query statement further includes querying condition, wherein the distribution is extremely Assembly line on the CPU is the thread instance run on the CPU according to the querying condition;It is described to distribute to described Assembly line on GPU is the kernel function example run on the GPU according to the querying condition.
It is described Step 2: parallel execute is distributed respectively in the CPU and described 7. according to the method described in claim 6, wherein The assembly line each of on GPU, after obtaining the corresponding CPU execution duration of present load allocation strategy and GPU execution duration, Further include:
Each of obtain on the CPU and GPU the corresponding implementing result of the assembly line;
The corresponding final result of query execution of the query statement is obtained according to the corresponding implementing result of each assembly line.
8. according to the method described in claim 7, wherein, data of the data fragmentation on CPU input duration TIN_CIt is described The data fragmentation duration used when the memory of the CPU is copied;
Data of the data fragmentation on CPU execute duration TEXE_CWhen for used in the thread instance that is run on the CPU It is long;
Data of the data fragmentation on CPU export duration TOUT_CIt is corresponding for the assembly line described in the CPU memory copying Duration used in implementing result;
Data of the data fragmentation on GPU input duration TIN_GFor by the data fragmentation from the memory copying of the CPU to Duration used in the video memory of the GPU;
Data of the data fragmentation on GPU execute duration TEXE_GFor used in the kernel function example that is run on the GPU Duration;
Data of the data fragmentation on GPU export duration TOUT_GFor by the corresponding implementing result of the assembly line from described The video memory of GPU is copied to duration used in the memory of the CPU.
9. a kind of non-volatile memory medium, is stored thereon with computer-readable instruction, the computer-readable instruction can be located When managing device execution, the processor is made to realize such as method described in any item of the claim 1 to 8.
10. a kind of equipment, wherein comprising:
One or more processors;
Non-volatile memory medium, for storing one or more computer-readable instructions,
When one or more of computer-readable instructions are executed by one or more of processors, so that one or more A processor realizes such as method described in any item of the claim 1 to 8.
CN201811064037.5A 2018-09-12 2018-09-12 Load balancing method and device based on CPU-GPU Active CN109213601B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811064037.5A CN109213601B (en) 2018-09-12 2018-09-12 Load balancing method and device based on CPU-GPU

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811064037.5A CN109213601B (en) 2018-09-12 2018-09-12 Load balancing method and device based on CPU-GPU

Publications (2)

Publication Number Publication Date
CN109213601A true CN109213601A (en) 2019-01-15
CN109213601B CN109213601B (en) 2021-01-01

Family

ID=64984143

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811064037.5A Active CN109213601B (en) 2018-09-12 2018-09-12 Load balancing method and device based on CPU-GPU

Country Status (1)

Country Link
CN (1) CN109213601B (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109918141A (en) * 2019-03-15 2019-06-21 Oppo广东移动通信有限公司 Thread execution method, device, terminal and storage medium
CN110069527A (en) * 2019-04-22 2019-07-30 电子科技大学 A kind of GPU and CPU isomery accelerated method of data base-oriented
CN110096367A (en) * 2019-05-14 2019-08-06 宁夏融媒科技有限公司 A kind of panorama real-time video method for stream processing based on more GPU
CN110287212A (en) * 2019-06-27 2019-09-27 浪潮商用机器有限公司 A kind of data service handling method, system and associated component
CN110298437A (en) * 2019-06-28 2019-10-01 Oppo广东移动通信有限公司 Separation calculation method, apparatus, storage medium and the mobile terminal of neural network
CN110490300A (en) * 2019-07-26 2019-11-22 苏州浪潮智能科技有限公司 A kind of operation accelerated method, apparatus and system based on deep learning
CN111062855A (en) * 2019-11-18 2020-04-24 中国航空工业集团公司西安航空计算技术研究所 Graph pipeline performance analysis method
CN111240820A (en) * 2020-01-13 2020-06-05 星环信息科技(上海)有限公司 Concurrency quantity increasing speed multiplying determining method, equipment and medium
CN112181689A (en) * 2020-09-30 2021-01-05 华东师范大学 Runtime system for efficiently scheduling GPU kernel under cloud
CN112989082A (en) * 2021-05-20 2021-06-18 南京甄视智能科技有限公司 CPU and GPU mixed self-adaptive face searching method and system
WO2021129873A1 (en) * 2019-12-27 2021-07-01 中兴通讯股份有限公司 Database querying method, device, apparatus, and storage medium
CN115437795A (en) * 2022-11-07 2022-12-06 东南大学 Video memory recalculation optimization method and system for heterogeneous GPU cluster load perception
US11954527B2 (en) 2020-12-09 2024-04-09 Industrial Technology Research Institute Machine learning system and resource allocation method thereof

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101091175A (en) * 2004-09-16 2007-12-19 辉达公司 Load balancing
CN101706741A (en) * 2009-12-11 2010-05-12 中国人民解放军国防科学技术大学 Method for partitioning dynamic tasks of CPU and GPU based on load balance
CN103329100A (en) * 2011-01-21 2013-09-25 英特尔公司 Load balancing in heterogeneous computing environments
US9311152B2 (en) * 2007-10-24 2016-04-12 Apple Inc. Methods and apparatuses for load balancing between multiple processing units

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101091175A (en) * 2004-09-16 2007-12-19 辉达公司 Load balancing
US9311152B2 (en) * 2007-10-24 2016-04-12 Apple Inc. Methods and apparatuses for load balancing between multiple processing units
CN101706741A (en) * 2009-12-11 2010-05-12 中国人民解放军国防科学技术大学 Method for partitioning dynamic tasks of CPU and GPU based on load balance
CN103329100A (en) * 2011-01-21 2013-09-25 英特尔公司 Load balancing in heterogeneous computing environments

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
沈文枫: "CPU-GPU异构高性能计算中的负载预测调度算法研究及应用", 《中国博士学位论文全文数据库 信息科技辑》 *

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109918141B (en) * 2019-03-15 2020-11-27 Oppo广东移动通信有限公司 Thread execution method, thread execution device, terminal and storage medium
CN109918141A (en) * 2019-03-15 2019-06-21 Oppo广东移动通信有限公司 Thread execution method, device, terminal and storage medium
CN110069527A (en) * 2019-04-22 2019-07-30 电子科技大学 A kind of GPU and CPU isomery accelerated method of data base-oriented
CN110069527B (en) * 2019-04-22 2021-05-14 电子科技大学 Database-oriented GPU and CPU heterogeneous acceleration method
CN110096367A (en) * 2019-05-14 2019-08-06 宁夏融媒科技有限公司 A kind of panorama real-time video method for stream processing based on more GPU
CN110287212A (en) * 2019-06-27 2019-09-27 浪潮商用机器有限公司 A kind of data service handling method, system and associated component
CN110298437A (en) * 2019-06-28 2019-10-01 Oppo广东移动通信有限公司 Separation calculation method, apparatus, storage medium and the mobile terminal of neural network
CN110298437B (en) * 2019-06-28 2021-06-01 Oppo广东移动通信有限公司 Neural network segmentation calculation method and device, storage medium and mobile terminal
CN110490300A (en) * 2019-07-26 2019-11-22 苏州浪潮智能科技有限公司 A kind of operation accelerated method, apparatus and system based on deep learning
CN110490300B (en) * 2019-07-26 2022-03-15 苏州浪潮智能科技有限公司 Deep learning-based operation acceleration method, device and system
CN111062855A (en) * 2019-11-18 2020-04-24 中国航空工业集团公司西安航空计算技术研究所 Graph pipeline performance analysis method
CN111062855B (en) * 2019-11-18 2023-09-05 中国航空工业集团公司西安航空计算技术研究所 Graphic pipeline performance analysis method
WO2021129873A1 (en) * 2019-12-27 2021-07-01 中兴通讯股份有限公司 Database querying method, device, apparatus, and storage medium
CN111240820A (en) * 2020-01-13 2020-06-05 星环信息科技(上海)有限公司 Concurrency quantity increasing speed multiplying determining method, equipment and medium
CN111240820B (en) * 2020-01-13 2020-11-24 星环信息科技(上海)有限公司 Concurrency quantity increasing speed multiplying determining method, equipment and medium
CN112181689A (en) * 2020-09-30 2021-01-05 华东师范大学 Runtime system for efficiently scheduling GPU kernel under cloud
US11954527B2 (en) 2020-12-09 2024-04-09 Industrial Technology Research Institute Machine learning system and resource allocation method thereof
CN112989082B (en) * 2021-05-20 2021-07-23 南京甄视智能科技有限公司 CPU and GPU mixed self-adaptive face searching method and system
CN112989082A (en) * 2021-05-20 2021-06-18 南京甄视智能科技有限公司 CPU and GPU mixed self-adaptive face searching method and system
CN115437795A (en) * 2022-11-07 2022-12-06 东南大学 Video memory recalculation optimization method and system for heterogeneous GPU cluster load perception

Also Published As

Publication number Publication date
CN109213601B (en) 2021-01-01

Similar Documents

Publication Publication Date Title
CN109213601A (en) A kind of load-balancing method and equipment based on CPU-GPU
CN110168516B (en) Dynamic computing node grouping method and system for large-scale parallel processing
US8631000B2 (en) Scan sharing for query predicate evaluations in column-based in-memory database systems
Bajda-Pawlikowski et al. Efficient processing of data warehousing queries in a split execution environment
US9152669B2 (en) System and method for distributed SQL join processing in shared-nothing relational database clusters using stationary tables
US9165032B2 (en) Allocation of resources for concurrent query execution via adaptive segmentation
US7113957B1 (en) Row hash match scan join using summary contexts for a partitioned database system
US9405782B2 (en) Parallel operation in B+ trees
CN109558237A (en) A kind of task status management method and device
CN108959510B (en) Partition level connection method and device for distributed database
CN104238999B (en) A kind of method for scheduling task and device based on horizontal partitioning distributed data base
US20200250192A1 (en) Processing queries associated with multiple file formats based on identified partition and data container objects
US20190163795A1 (en) Data allocating system and data allocating method
Mutharaju et al. D-SPARQ: distributed, scalable and efficient RDF query engine
Tan et al. Effectiveness assessment of solid-state drive used in big data services
CN109829678B (en) Rollback processing method and device and electronic equipment
CN108710640B (en) Method for improving search efficiency of Spark SQL
US9910869B2 (en) Dropping columns from a table with minimized unavailability
CN107451142B (en) Method and apparatus for writing and querying data in database, management system and computer-readable storage medium thereof
US20210216573A1 (en) Algorithm to apply a predicate to data sets
Golab et al. Distributed data placement via graph partitioning
CN113360503A (en) Test data tracking method and device for distributed database
CN111737257A (en) Data query method and device
CN111913986A (en) Query optimization method and device
CN106202412A (en) Data retrieval method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant