CN109213601B

CN109213601B - Load balancing method and device based on CPU-GPU

Info

Publication number: CN109213601B
Application number: CN201811064037.5A
Authority: CN
Inventors: 翁楚良; 孙婷婷; 黄皓; 王嘉伦
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2018-09-12
Filing date: 2018-09-12
Publication date: 2021-01-01
Anticipated expiration: 2038-09-12
Also published as: CN109213601A

Abstract

The method comprises the steps that a pipeline query execution model is built on a CPU-GPU heterogeneous database system, so that a CPU-GPU heterogeneous data analysis system can support query analysis in a big data scene; determining a total number of pipelines to be executed; starting the pipeline query execution model to distribute the pipelines corresponding to the total number to the CPU and the GPU, and calculating system execution time lengths corresponding to all load distribution strategies according to the determined execution time lengths of the single pipeline on the CPU and the GPU respectively; and finally, determining the load distribution strategy corresponding to the minimum value in the system execution duration as the optimal CPU-GPU distribution strategy, wherein the load balancing strategy based on the CPU-GPU heterogeneous data analysis system can reasonably distribute the pipeline load to different processors, fully utilizes the calculation resources of the processors, not only improves the system performance, but also enables the system to achieve the optimal total performance.

Description

Load balancing method and device based on CPU-GPU

Technical Field

The present application relates to the field of computers, and in particular, to a load balancing method and device based on a CPU-GPU.

Background

General purpose Graphics Processing Units (GPUs) are widely used in many fields such as matrix computing and machine learning. In recent years, the related demand of data intensive applications is rapidly increased, the development of heterogeneous online analysis processing platforms based on the GPU is promoted, and since the GPU has a plurality of computing units capable of running a large number of threads simultaneously, the performance of a data analysis system using the GPU as a main processor is superior to that of a conventional CPU analysis system in most cases, and the execution time is shortened by several orders of magnitude.

In a conventional relational query analysis system, when a client sends a query request, the system creates an analysis job, analyzes the request and converts the request into a logical query plan, and a query plan optimizer selects an optimal physical query plan according to a certain principle (e.g., the lowest cost) to execute the query plan. A physical query plan is a Directed Acyclic Graph (DAG) that contains a plurality of operators that are executed in a sequence.

In the current CPU-GPU heterogeneous analysis system, the GPU is a main processor for query execution, the execution of operation operators is mainly distributed on the GPU, and the CPU is mainly responsible for data distribution and result collection, and when a subsequent operation needs to use an intermediate result output in the previous step, the CPU needs to perform certain processing on the intermediate result.

The analysis demand processed by the data processing and analyzing system faces to a big data scene, the data volume is exponentially increased, and the workload is heavier; however, since the GPU can only directly process data in its internal storage medium, and the video memory has a limited capacity, the GPU cannot complete processing of a large data set with a single load. When the input data or the intermediate result is too large to be put into the global memory of the GPU, the efficiency of the analysis work is continuously low, and even the task fails. The prior art circumvents this problem by limiting the size of the look-up table or shifts the computational task to the CPU as an alternative strategy, but none of these are optimal solutions.

In summary, in the present heterogeneous platform of CPU-GPU, it is effective to use GPU as a data analysis system to accelerate query analysis, but the following problems still exist: the GPU has limited video memory capacity, cannot finish the processing of a large data set through single loading, has unbalanced task allocation between the CPU and the GPU, and does not fully utilize heterogeneous processor resources.

Disclosure of Invention

An object of the present application is to provide a load balancing method and device based on a CPU-GPU, so as to solve the problems in the prior art that the GPU has a limited video memory capacity, cannot complete processing of a large data set by single loading, and cannot fully utilize heterogeneous processor resources due to unbalanced task allocation between the CPU and the GPU.

According to one aspect of the application, a load balancing method based on a CPU-GPU is provided, and the method comprises the following steps:

constructing a pipeline query execution model on a CPU-GPU heterogeneous database system;

determining a total number of pipelines to be executed;

starting the pipeline query execution model, distributing the pipelines corresponding to the total number to the CPU and the GPU, and calculating system execution time lengths corresponding to all load distribution strategies according to the determined execution time lengths of the single pipeline on the CPU and the GPU respectively;

and determining the load distribution strategy corresponding to the minimum value in all the system execution time lengths as an optimal CPU-GPU distribution strategy.

Further, in the above method, the determining the total number of pipelines to be executed includes:

acquiring a query statement, wherein the query statement comprises data to be queried;

dividing the data to be queried according to the size of a preset data fragment to obtain the data fragment of the data to be queried and the total number of the data fragment;

and respectively starting a corresponding pipeline for each data fragment in the data to be queried, wherein the total number of the pipelines to be executed is determined by the total number of the data fragments.

Further, in the above method, the step of starting the pipeline query execution model to allocate the pipelines corresponding to the total number to the CPU and the GPU, and calculating system execution durations corresponding to all load allocation policies according to the determined execution durations of the single pipeline on the CPU and the GPU, respectively, includes:

step one, starting a pipeline query execution model, and setting an initial load distribution strategy: number of pipelines N allocated to the CPU _CPU0, the number of pipelines N allocated for the GPU_GPUN, where N is the total number of the pipelines and N is a positive integer greater than or equal to 1;

step two, executing each pipeline respectively distributed on the CPU and the GPU in parallel to obtain the CPU execution duration and the GPU execution duration corresponding to the current load distribution strategy;

step three, if the CPU execution duration is equal to the GPU execution duration, determining the CPU execution duration as a system execution duration corresponding to the current load distribution strategy; if the CPU execution duration and the GPU execution duration are not equal, determining the larger value of the CPU execution duration and the GPU execution duration as the system execution duration corresponding to the current load distribution strategy;

step four, updating the load distribution strategy: number of pipelines N allocated to the CPU_CPU＝N_CPU+1, number of pipelines N allocated for the GPU_GPU＝N_GPU-1, wherein N_CPU+N_GPU＝N，；

And step five, repeating the step two to the step four until system execution time lengths corresponding to all the load distribution strategies are obtained.

Further, in the foregoing method, before calculating the system execution durations corresponding to all the load distribution policies according to the determined execution durations of the single pipeline on the CPU and the GPU, respectively, the method further includes:

according to the data input duration T of the data fragment on the CPU_{IN_C}Data execution duration T_{EXE_C}And data output duration T_{OUT_C}Determining an execution duration of a single said pipeline on said CPU;

according to the data input duration T of the data shards on the GPU_{IN_G}Data execution duration T_{EXE_G}And data output duration T_{OUT_G}Determining an execution duration of a single of said pipelines on said GPU.

Further, in the above method, the formulas for obtaining the CPU execution duration and the GPU execution duration corresponding to the current load distribution policy in the second step are respectively:

T_CPU＝T_{IN_C}+T_{EXE_C}+T_{OUT_C}+Max{T_{IN_c}，T_{EXE_C}，T_{OUT_C}}×(N_CPU-1)，

T_GPU＝T_{IN_G}+T_{EXE_G}+T_{OUT_G}+Max{T_{IN_G}，T_{EXE_G}，T_{OUT_G}}×(N_GPU-1)，

wherein, T_CPUAllocating CPU execution time length corresponding to strategy for current load, Max { T }_{IN_C}，T_{EXE_C}，T_{OUT_C}The data input duration T of the data slice on the CPU_{IN_C}Data execution duration T_{EXE_C}And data output duration T_{OUT_C}Maximum value of (1);

T_GPUallocating the GPU execution duration corresponding to the strategy for the current load, Max { T }_{IN_G}，T_{EXE_G}，T_{OUT_G}The data input duration T of the data fragment on the GPU_{IN_G}Data execution duration T_{EXE_G}And data output duration T_{OUT_G}Maximum value of (2).

Further, in the above method, the query statement further includes a query condition, where the pipeline allocated to the CPU is a thread instance running on the CPU according to the query condition; and the pipeline distributed to the GPU is a kernel function instance running on the GPU according to the query condition.

Further, in the above method, after the step two of executing in parallel each pipeline respectively allocated to the CPU and the GPU to obtain the CPU execution duration and the GPU execution duration corresponding to the current load allocation policy, the method further includes:

obtaining the execution result corresponding to each pipeline on the CPU and the GPU;

and obtaining a final query execution result corresponding to the query statement according to the execution result corresponding to each pipeline.

Further, in the above method, the data input duration T of the data slice on the CPU_{IN_C}The time length used when the data fragments are copied in the memory of the CPU is obtained;

the data execution duration T of the data fragment on the CPU_{EXE_C}A time duration for a thread instance running on the CPU;

the data output duration T of the data fragment on the CPU_{OUT_C}The time length used for copying the execution result corresponding to the pipeline in the CPU memory;

the data input duration T of the data fragment on the GPU_{IN_G}A time length for copying the data fragment from the memory of the CPU to the video memory of the GPU;

the data execution duration T of the data fragment on the GPU_{EXE_G}A duration for a kernel instance running on the GPU;

data output duration T of the data fragmentation on GPU_{OUT_G}And the time length is used for copying the execution result corresponding to the pipeline from the video memory of the GPU to the memory of the CPU.

According to another aspect of the present application, there is also provided a non-volatile storage medium having stored thereon computer-readable instructions, which, when executed by a processor, cause the processor to implement the CPU-GPU based load balancing method as described above.

According to another aspect of the present application, there is also provided an apparatus, wherein it comprises:

one or more processors;

a non-volatile storage medium for storing one or more computer-readable instructions,

when executed by the one or more processors, cause the one or more processors to implement a CPU-GPU based load balancing method as described above.

Compared with the prior art, the method has the advantages that the pipeline query execution model is built on the CPU-GPU heterogeneous database system, so that the CPU-GPU heterogeneous data analysis system can support query analysis in a big data scene; determining a total number of pipelines to be executed; starting the pipeline query execution model to distribute the pipelines corresponding to the total number to the CPU and the GPU, and calculating system execution time lengths corresponding to all load distribution strategies according to the determined execution time lengths of the single pipeline on the CPU and the GPU respectively; and finally, determining the load distribution strategy corresponding to the minimum value in the system execution duration as the optimal CPU-GPU distribution strategy, wherein the load balancing strategy based on the CPU-GPU heterogeneous data analysis system can reasonably distribute the pipeline load to different processors, fully utilizes the calculation resources of the processors, not only improves the system performance, but also enables the system to achieve the optimal total performance.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 illustrates a flow diagram of a CPU-GPU based load balancing methodology in accordance with an aspect of the subject application;

FIG. 2 is a schematic diagram illustrating a GPU execution duration and a CPU execution duration respectively corresponding to a load distribution policy in a CPU-GPU based load balancing method according to an aspect of the present application;

FIG. 3 is a schematic diagram illustrating a computation of an execution duration of a query corresponding to a load distribution policy in a CPU-GPU based load balancing method according to an aspect of the present disclosure; FIG. 4 is a schematic diagram illustrating a CPU-GPU load distribution strategy for 8000 ten thousand rows of data under a pipeline query execution model using a first query statement in accordance with an aspect of the subject application;

FIG. 5 illustrates a schematic diagram of a CPU-GPU load distribution strategy for 1.4 million lines of data under a pipeline query execution model using a first query statement, according to an aspect of the subject application;

FIG. 6 is a schematic diagram illustrating a CPU-GPU load distribution strategy for 8000 ten thousand rows of data under a pipeline query execution model using a second query statement in accordance with an aspect of the subject application;

FIG. 7 is a schematic diagram illustrating a CPU-GPU load distribution strategy for 1.4 million lines of data under a pipeline query execution model using a second query statement in accordance with an aspect of the subject application;

the same or similar reference numbers in the drawings identify the same or similar elements.

Detailed Description

The present application is described in further detail below with reference to the attached figures.

In a typical configuration of the present application, the terminal, the device serving the network, and the trusted party each include one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.

As shown in fig. 1, a load balancing method based on a CPU-GPU in the embodiment of the present application is applicable to a data analysis model of a relational database, can fully exert characteristics of heterogeneous processors, reasonably and efficiently allocate work tasks, and implement query analysis in a big data scene, and the method includes step S11, step S12, step S13, and step S14, where the method specifically includes:

the step S11, constructing a pipeline query execution model on the CPU-GPU heterogeneous database system;

said step S12, determining the total number of pipelines to be executed; for example, the total number of pipelines to be executed is N;

the step S13, starting the pipeline query execution model to allocate the pipelines corresponding to the total number N to the CPU and the GPU, and calculating system execution durations corresponding to all load allocation policies according to the determined execution durations of the single pipeline on the CPU and the GPU, respectively;

in step S14, the load distribution policy corresponding to the minimum value in all the system execution durations is determined as the optimal CPU-GPU distribution policy.

Through the steps S11 to S14, query analysis in a large data volume scene is realized, task allocation between the CPU and the GPU can be balanced, processor resources are fully utilized, and system performance is improved.

In the process of executing the load distribution strategy of the CPU-GPU heterogeneous database system, a pipeline query execution model and the load distribution strategy are constructed on a MapD Core open source version 3.3.1 of the heterogeneous database of the CPU-GPU, the execution is carried out on a machine provided with 1 piece of NVIDIA Tesla K80 GPU, and a video card comprises about 22GB global memory. Two ten-core Xeon E5-2630 v4 CPUs and RA of 224GB are arranged in a server corresponding to the system; the improved system runs on a CentOS 7Linux distribution of the Linux 3.10.0 kernel.

In this embodiment, the step S11 is to construct a pipeline query execution model on the CPU-GPU heterogeneous database system, and includes the following steps:

acquiring a query statement, wherein the query statement comprises data to be queried and one or more query conditions, and the data to be queried can be but is not limited to one or more database relational tables;

dividing data to be inquired into N data fragments according to preset data division size, wherein when the data to be inquired is preferably a database relational table, the data fragments are sub-tables of the database relational table, and the preset data division size is the preset number of element group data in the database relational table;

starting a corresponding pipeline for each data fragment in the N data fragments (sub-query objects) (that is, the total number of the started pipelines is N), wherein the pipeline can be distributed to a CPU (central processing unit) or a GPU (graphics processing unit), executing the operation of querying the corresponding data fragment on each pipeline distributed on the CPU and the GPU respectively, and obtaining an execution result r corresponding to each pipeline after the execution of each pipeline is completed_iI is the serial number of the pipeline (or the fragment number of the data fragment), where i is 1, 2, … …, N-1, N, and the execution results r corresponding to all pipelines will be synthesized_iAnd obtaining a final query result R corresponding to the query statement, and recording the system execution duration under the current CPU-GPU load distribution strategy when the query is finished, so as to realize the recording of the system execution duration of the current CPU-GPU load distribution strategy and the statistics of the final query execution result.

In this embodiment, the step S12 of determining the total number of pipelines to be executed includes:

acquiring a query statement, wherein the query statement comprises data to be queried; if the query statement further comprises a query condition, the pipeline distributed to the CPU is a thread instance running on the CPU according to the query condition; and the pipeline distributed to the GPU is a kernel function instance running on the GPU according to the query condition.

Dividing the data to be queried according to the size of a preset data fragment to obtain the data fragment of the data to be queried and the total number of the data fragment; for example, the total number n of data slices of the data to be queried is ═ size of the data to be queried)/(preset data slice size);

respectively starting a corresponding pipeline for each data fragment in the data to be queried, where the total number of the pipelines to be executed is determined by the total number of the data fragments, for example, if a corresponding pipeline is started for each data fragment, N data fragments are started for N pipelines correspondingly, and the total number N of the pipelines to be executed by the system is determined by the total number of the data fragments.

In this embodiment, the step S13 starts the pipeline query execution model to allocate the pipelines corresponding to the total number to the CPU and the GPU, and calculates system execution durations corresponding to all load allocation policies according to the determined execution durations of the single pipeline on the CPU and the GPU, respectively, including:

step two, executing each pipeline respectively distributed on the CPU and the GPU in parallel to obtain the CPU execution duration T corresponding to the current load distribution strategy_CPUAnd GPU execution duration T_GPU；

Step three, if the CPU execution time length T_CPUAnd the GPU execution duration T_GPUIf the system execution time length T is equal to the system execution time length T corresponding to the current load distribution strategy, determining the CPU execution time length as the system execution time length T ═ T corresponding to the current load distribution strategy_CPUOr T ═ T_GPU(ii) a If the CPU execution time length T_CPUAnd the GPU execution duration T_GPUIf the system execution time length T is not equal to the Max { T } corresponding to the current load distribution strategy, determining the larger value of the CPU execution time length and the GPU execution time length as the system execution time length T corresponding to the current load distribution strategy_CPU，T_GPU}; because the CPU and the GPU start executing simultaneously, the system needs to obtain the best performance only with the smallest difference between the execution times of the two processors, and in an optimal scenario, when the difference between the execution times of the two processors is 0, the system can be considered to reach the best load balancing state, that is, the load distribution policy corresponding to the pipeline distributed to the CPU and the GPU is the best.

And step five, repeating the step two to the step four until system execution time lengths corresponding to all the load distribution strategies are obtained, namely, all the load distribution strategies are executed completely.

In this embodiment, before the step S13 of calculating the system execution durations corresponding to all the load distribution policies according to the determined execution durations of the single pipeline on the CPU and the GPU, respectively, the method further includes:

according to the data input duration T of the data fragment on the CPU_{IN_C}Data execution duration T_{EXE_C}And data output duration T_{OUT_C}Determining an execution duration of a single said pipeline on said CPU; that is, the execution duration of a single pipeline on the CPU is the data input duration T on the CPU sliced by the data_{IN_C}Data execution duration T_{EXE_C}And data output duration T_{OUT_C}Three stages, wherein the data slice has a data input duration T on the CPU_{IN_C}The time length used when the data fragments are copied in the memory of the CPU is obtained; the data execution duration T of the data fragment on the CPU_{EXE_C}A time duration for a thread instance running on the CPU; the data output duration T of the data fragment on the CPU_{OUT_C}The time length used for copying the execution result corresponding to the pipeline in the CPU memory;

according to the data input duration T of the data shards on the GPU_{IN_G}Data execution duration T_{EXE_G}And data output duration T_{OUT_G}Determining the execution duration of a single pipeline on the GPU, namely the execution duration of the single pipeline on the GPU is the data input duration T of the data slicing on the GPU_{IN_G}Data execution duration T_{EXE_G}And data output duration T_{OUT_G}Three phases, wherein the data slicing is performed on the GPU for a data input duration T_{IN_G}A time length for copying the data fragment from the memory of the CPU to the video memory of the GPU; the data execution duration T of the data fragment on the GPU_{EXE_G}A duration for a kernel instance running on the GPU; data of the data fragmentation on GPUOutput duration T_{OUT_G}And the time length is used for copying the execution result corresponding to the pipeline from the video memory of the GPU to the memory of the CPU.

Because the allocated threads are mutually overlapped in the process of executing the allocated threads on the CPU or the GPU in parallel and the execution duration of one stage is only different between two adjacent pipelines, the formulas of the CPU execution duration and the GPU execution duration corresponding to the current load allocation strategy in the second step are respectively as follows:

CPU execution duration T corresponding to current load distribution strategy_CPUComprises the following steps:

because the execution duration of each of the three stages of the pipeline is different whether on the CPU or the GPU, in order to accurately record the overlapping time between two adjacent pipelines, the stage with the longest execution duration in the three stages is taken as the execution time difference between two adjacent pipelines, for example, the data input duration T on the CPU according to the data slicing_{IN_C}Data execution duration T_{EXE_C}And data output duration T_{OUT_C}Max of (Max) Max { T_{IN_C}，T_{EXE_C}，T_{OUT_C}And obtaining the execution time difference between two adjacent pipelines as follows: the maximum value of the time duration in the three stages is multiplied by the remaining pipeline data, so as to obtain the total time duration of the overlapping portion, as shown in fig. 2, which is a schematic calculation diagram of the GPU execution time duration and the CPU execution time duration respectively under the current load distribution strategy, where the left side of fig. 2 is in the pipeline execution process of the GPU, and the right side of fig. 2 is in the pipeline execution process of the CPU.

As shown in fig. 3, the GPU execution duration T corresponding to the current load distribution policy_CPUComprises the following steps:

wherein, Max{T_{IN_G}，T_{EXE_G}，T_{OUT_G}The data input duration T of the data fragment on the GPU_{IN_G}Data execution duration T_{EXE_G}And data output duration T_{OUT_G}Maximum value of (1);

here, in fig. 3, a schematic diagram of an execution duration corresponding to query after a pipeline is divided into two processors, namely a CPU and a GPU, under a load distribution policy, the origin End is used to indicate an execution End time point when all loads are placed on the GPU; end is used to indicate the execution End time point after load balancing on both the CPU and GPU.

In this embodiment, after the step two of executing, in parallel, each pipeline respectively allocated to the CPU and the GPU to obtain the CPU execution duration and the GPU execution duration corresponding to the current load allocation policy, the method further includes:

obtaining the execution result corresponding to each pipeline on the CPU and the GPU; for example, the corresponding execution result after the ith pipeline completes execution is r_iAnd i is the serial number of the pipeline (or the fragment number of the data fragment), wherein i is 1, 2, … …, N-1, N;

obtaining a final query execution result R corresponding to the query statement according to the execution result corresponding to each pipeline, for example, R ═ { R ═ R₁，r₂，……r_i，……，r_(N-1)，r_N}。

Following the above embodiment of the present application, in the step S14, the load distribution policy corresponding to the minimum value in the system execution durations corresponding to all the load distribution policies is used as the optimal CPU-GPU load distribution policy, i.e., { OptN ″_GPU，OptN_CPU}＝FindMin(T[])，OptN_GPURefers to the load of the GPU (the number of pipelines distributed on the CPU) under the optimal load distribution strategy, OptN_CPURefers to the load of the CPU (the number of pipelines allocated on the GPU) under the optimal load allocation scheme, T [, ]]For the array storing the system execution time length under all the load distribution strategies, FindMin finds the minimum value in the system execution time lengths under all the load distribution strategies。

In an actual application scenario of the load balancing method based on the CPU-GPU provided in the present application, as shown in fig. 4 and 5, a first query statement is adopted to respectively execute results of 8000 ten thousand rows of data and 1.4 hundred million rows of data to be queried, where the first query statement is: select avg (attr1) from tbl1group by attr2, wherein the left vertical axis in fig. 4 and 5 represents the system execution duration T corresponding to the current load distribution policy, and the right vertical axis GPU Pipeline Number represents the Number N of pipelines distributed on the GPU_GPUThe horizontal axis CPU Pipeline Number represents the Number of pipelines N distributed on the CPU_CPUThe total number of pipelines on the CPU and GPU remains unchanged (i.e., N in FIG. 4)_GPU+N _CPU3, N in fig. 5_GPU+N_CPU5), Pipeline workload partitions represent the allocation strategy for Pipeline load. When the number of the pipelines allocated on the CPU is 0, all the loads are allocated to the GPU for execution, namely the execution mode of the traditional CPU-GPU heterogeneous processing analysis system. As can be seen from fig. 4, when the CPU load is equal to 1 and the GPU load is equal to 2 (i.e. the number of pipelines allocated to the CPU is 1 and the number of pipelines allocated to the GPU is 2), the system execution duration T corresponding to the current load distribution policy is 587 milliseconds, and the time among the system execution durations corresponding to all the load distribution policies under the query data of 8000 ten thousand rows is the shortest (T is T_min) The CPU load equal to 1 and the GPU load equal to 2 are the optimal load distribution strategy for querying 8000 ten thousand rows of data by using the first query statement; as can be seen from fig. 5, when the CPU load is 2 and the GPU load is 3 (i.e. the number of pipelines allocated to the CPU is 2 and the number of pipelines allocated to the GPU is 3), the system execution time length corresponding to the current load distribution policy is 936 ms, and is the shortest (T) of the system execution time lengths corresponding to all load distribution policies executing 1.4 million rows of data (T)_min) That is, a CPU load equal to 2 and a GPU load equal to 3 are the optimal load distribution strategies for querying 1.4 billion lines of query data by using the first query statement. Under the first query statement, the system execution length for querying 8000 ten thousand lines of data and 1.4 hundred million lines of data with the GPU bearing the whole load is 881 milliseconds and 1265 milliseconds respectively. Using load compared to conventional implementationThe system performance after the distribution strategy is respectively improved by about 33% and 26% when the data amount is different, wherein 33% (8000 ten thousand lines of system execution duration that the GPU takes over all the loads-system execution duration corresponding to the load distribution strategy)/(8000 ten thousand lines of system execution duration that the GPU takes over all the loads) ((881 millisecond-587 millisecond)/881 millisecond); the system execution duration for which the GPU takes full load for 1.4 hundred million lines-the system execution duration for which the load distribution policy corresponds)/(the system execution duration for which the GPU takes full load for 1.4 lines) -1265-936 ms/936 ms.

In another practical application scenario of the load balancing method based on the CPU-GPU provided in the present application, as shown in fig. 6 and 7, a second query statement is adopted to respectively execute results of 8000 ten thousand rows of data and 1.4 hundred million rows of data to be queried, where the second query statement is: select count from (select tbl1.attr1 from tbl1 join tbl2 on tbl1.attr 1. tbl2.attr1), the number of pipelines is greater than the first query statement because the second query statement is a join operation. As can be seen in fig. 6: under 8000 ten thousand rows of data, when the CPU load is 2 and the GPU is 7, the system execution time length is the minimum value T_min1361 ms. As can be seen in fig. 7: under 1.4 million rows of data, when the CPU load is 7 and the GPU load is 18, the system execution time length is the minimum value T_min3488 milliseconds, the system performance is optimal, and compared with 1750 milliseconds and 4845 milliseconds for querying 8000 ten thousand rows of data and 1.4 hundred million rows of data under the second query statement, the system performance after the load distribution strategy is used is improved by about 22% and 28%, where 22% (system execution duration for which the GPU takes over all loads at 8000 ten thousand rows-system execution duration corresponding to the load distribution strategy)/(system execution duration for which the GPU takes over all loads at 8000 ten thousand rows) ((1750 milliseconds-1361 milliseconds)/1750 milliseconds); the system execution duration for which the GPU takes full load when 28% — 1.4 hundred million lines — the system execution duration for which the GPU takes full load when 1.4 lines)/(the system execution duration for which the GPU takes full load when 1.4 lines) — 4845-3488 msec)/4845 msec.

one or more processors;

when executed by the one or more processors, cause the one or more processors to implement a CPU-GPU based load balancing method as described above

Here, the detailed content of each embodiment of the device may specifically refer to a corresponding part of a method embodiment of a load balancing method based on a CPU-GPU, which is executed in the device, and is not described herein again.

In summary, the method and the system enable the CPU-GPU heterogeneous data analysis system to support query analysis in a big data scene by constructing a pipeline query execution model on the CPU-GPU heterogeneous database system; determining a total number of pipelines to be executed; starting the pipeline query execution model to distribute the pipelines corresponding to the total number to the CPU and the GPU, and calculating system execution time lengths corresponding to all load distribution strategies according to the determined execution time lengths of the single pipeline on the CPU and the GPU respectively; and finally, determining the load distribution strategy corresponding to the minimum value in the system execution duration as the optimal CPU-GPU distribution strategy, wherein the load balancing strategy based on the CPU-GPU heterogeneous data analysis system can reasonably distribute the pipeline load to different processors, fully utilizes the calculation resources of the processors, not only improves the system performance, but also enables the system to achieve the optimal total performance.

It should be noted that the present application may be implemented in software and/or a combination of software and hardware, for example, implemented using Application Specific Integrated Circuits (ASICs), general purpose computers or any other similar hardware devices. In one embodiment, the software programs of the present application may be executed by a processor to implement the steps or functions described above. Likewise, the software programs (including associated data structures) of the present application may be stored in a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. Additionally, some of the steps or functions of the present application may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.

In addition, some of the present application may be implemented as a computer program product, such as computer program instructions, which when executed by a computer, may invoke or provide methods and/or techniques in accordance with the present application through the operation of the computer. Program instructions which invoke the methods of the present application may be stored on a fixed or removable recording medium and/or transmitted via a data stream on a broadcast or other signal-bearing medium and/or stored within a working memory of a computer device operating in accordance with the program instructions. An embodiment according to the present application comprises an apparatus comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, trigger the apparatus to perform a method and/or a solution according to the aforementioned embodiments of the present application.

It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the apparatus claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Claims

1.A load balancing method based on a CPU-GPU (Central processing Unit-graphics processing Unit), wherein the method comprises the following steps:

constructing a model for query analysis of a pipeline on a CPU-GPU heterogeneous database system;

determining a total number of pipelines to be executed;

starting the model for inquiring and analyzing the assembly lines, distributing the assembly lines corresponding to the total number to the CPU and the GPU, and calculating system execution time lengths corresponding to all load distribution strategies according to the determined execution time lengths of the single assembly line on the CPU and the GPU respectively;

and determining the load distribution strategy corresponding to the minimum value in all the system execution time lengths as the optimal load distribution strategy.

2. The method of claim 1, wherein the determining a total number of pipelines to execute comprises:

3. The method of claim 2, wherein the initiating the model for performing query analysis on pipelines allocates the pipelines corresponding to the total number to the CPU and the GPU, and calculates system execution durations corresponding to all load allocation policies according to the determined execution durations of the single pipeline on the CPU and the GPU, respectively, comprises:

step one, starting a model for carrying out query analysis on a pipeline, and setting an initial load distribution strategy: the number of pipelines distributed for the CPU is NCPU (zero crossing point) 0, the number of pipelines distributed for the GPU is NGPU (zero crossing point) N, wherein N is the total number of the pipelines and is a positive integer greater than or equal to 1;

step four, updating the load distribution strategy: the number of pipelines allocated to the CPU is NCPU +1, and the number of pipelines allocated to the GPU is NGPU-1, where NCPU + NGPU is N;

4. The method of claim 3, wherein before calculating the system execution durations corresponding to all the load distribution policies according to the determined execution durations of the single pipeline on the CPU and the GPU, respectively, further comprises:

determining the execution duration of a single pipeline on the CPU according to the data input duration TIN _ C, the data execution duration TEXE _ C and the data output duration TOUT _ C of the data fragments on the CPU;

and determining the execution duration of a single pipeline on the GPU according to the data input duration TIN _ G, the data execution duration TEXE _ G and the data output duration TOUT _ G of the data fragments on the GPU.

5. The method according to claim 4, wherein the formulas for obtaining the CPU execution duration and the GPU execution duration corresponding to the current load distribution policy in the second step are respectively as follows:

TCPU＝TIN_C+TEXE_C+TOUT_C+Max(TIN_C，IEXE_C，TOUT_C}×(NCPU-1)，

TGPU＝TIN_G+TEXE_G+TOUR_G+Max{TIN_G，TEXE_G，TOUT_G}×(NGPU-1)，

wherein, TCPU is the CPU execution time corresponding to the current load distribution strategy, Max { TIN _ C, TEXE _ C, TOUT _ C } is the maximum value of the data input time TIN _ C, the data execution time TEXE _ C and the data output time TOUT _ C of the data fragment on the CPU;

the TGPU is the GPU execution time length corresponding to the current load distribution strategy, and Max { TIN _ G, TEXE _ G, TOUT _ G } is the maximum value of the data input time length TIN _ G, the data execution time length TEXE _ G and the data output time length TOUT _ G of the data fragment on the GPU.

6. The method of claim 5, wherein the query statement further comprises a query condition, wherein the pipeline allocated to the CPU is a thread instance running on the CPU according to the query condition; and the pipeline distributed to the GPU is a kernel function instance running on the GPU according to the query condition.

7. The method according to claim 6, wherein, in the second step, after executing each pipeline respectively allocated to the CPU and the GPU in parallel to obtain a CPU execution duration and a GPU execution duration corresponding to the current load allocation policy, the method further comprises:

8. The method according to claim 7, wherein a data input duration TIN _ C of the data slice on the CPU is a duration used by the data slice when the data slice is copied in a memory of the CPU;

the data execution time TEXE _ C of the data fragment on the CPU is the time used by the thread instance running on the CPU;

the data output time TOUT _ C of the data fragment on the CPU is the time used for copying the execution result corresponding to the pipeline in the CPU memory;

the data input duration TIN _ G of the data fragment on the GPU is the duration for copying the data fragment from the memory of the CPU to the video memory of the GPU;

the data execution time TEXE _ G of the data fragment on the GPU is the time used by the kernel function instance running on the GPU;

the data output duration TOUT _ G of the data fragment on the GPU is a duration for copying the execution result corresponding to the pipeline from the video memory of the GPU to the memory of the CPU.

9. A non-transitory storage medium having stored thereon computer readable instructions which, when executed by a processor, cause the processor to implement the method of any one of claims 1 to 8.

10. An apparatus, wherein it comprises:

one or more processors;

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-8.