CN109213601B - Load balancing method and device based on CPU-GPU - Google Patents

Load balancing method and device based on CPU-GPU Download PDF

Info

Publication number
CN109213601B
CN109213601B CN201811064037.5A CN201811064037A CN109213601B CN 109213601 B CN109213601 B CN 109213601B CN 201811064037 A CN201811064037 A CN 201811064037A CN 109213601 B CN109213601 B CN 109213601B
Authority
CN
China
Prior art keywords
cpu
gpu
data
execution
duration
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811064037.5A
Other languages
Chinese (zh)
Other versions
CN109213601A (en
Inventor
翁楚良
孙婷婷
黄皓
王嘉伦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China Normal University
Original Assignee
East China Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China Normal University filed Critical East China Normal University
Priority to CN201811064037.5A priority Critical patent/CN109213601B/en
Publication of CN109213601A publication Critical patent/CN109213601A/en
Application granted granted Critical
Publication of CN109213601B publication Critical patent/CN109213601B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/505Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining

Abstract

The method comprises the steps that a pipeline query execution model is built on a CPU-GPU heterogeneous database system, so that a CPU-GPU heterogeneous data analysis system can support query analysis in a big data scene; determining a total number of pipelines to be executed; starting the pipeline query execution model to distribute the pipelines corresponding to the total number to the CPU and the GPU, and calculating system execution time lengths corresponding to all load distribution strategies according to the determined execution time lengths of the single pipeline on the CPU and the GPU respectively; and finally, determining the load distribution strategy corresponding to the minimum value in the system execution duration as the optimal CPU-GPU distribution strategy, wherein the load balancing strategy based on the CPU-GPU heterogeneous data analysis system can reasonably distribute the pipeline load to different processors, fully utilizes the calculation resources of the processors, not only improves the system performance, but also enables the system to achieve the optimal total performance.

Description

Load balancing method and device based on CPU-GPU
Technical Field
The present application relates to the field of computers, and in particular, to a load balancing method and device based on a CPU-GPU.
Background
General purpose Graphics Processing Units (GPUs) are widely used in many fields such as matrix computing and machine learning. In recent years, the related demand of data intensive applications is rapidly increased, the development of heterogeneous online analysis processing platforms based on the GPU is promoted, and since the GPU has a plurality of computing units capable of running a large number of threads simultaneously, the performance of a data analysis system using the GPU as a main processor is superior to that of a conventional CPU analysis system in most cases, and the execution time is shortened by several orders of magnitude.
In a conventional relational query analysis system, when a client sends a query request, the system creates an analysis job, analyzes the request and converts the request into a logical query plan, and a query plan optimizer selects an optimal physical query plan according to a certain principle (e.g., the lowest cost) to execute the query plan. A physical query plan is a Directed Acyclic Graph (DAG) that contains a plurality of operators that are executed in a sequence.
In the current CPU-GPU heterogeneous analysis system, the GPU is a main processor for query execution, the execution of operation operators is mainly distributed on the GPU, and the CPU is mainly responsible for data distribution and result collection, and when a subsequent operation needs to use an intermediate result output in the previous step, the CPU needs to perform certain processing on the intermediate result.
The analysis demand processed by the data processing and analyzing system faces to a big data scene, the data volume is exponentially increased, and the workload is heavier; however, since the GPU can only directly process data in its internal storage medium, and the video memory has a limited capacity, the GPU cannot complete processing of a large data set with a single load. When the input data or the intermediate result is too large to be put into the global memory of the GPU, the efficiency of the analysis work is continuously low, and even the task fails. The prior art circumvents this problem by limiting the size of the look-up table or shifts the computational task to the CPU as an alternative strategy, but none of these are optimal solutions.
In summary, in the present heterogeneous platform of CPU-GPU, it is effective to use GPU as a data analysis system to accelerate query analysis, but the following problems still exist: the GPU has limited video memory capacity, cannot finish the processing of a large data set through single loading, has unbalanced task allocation between the CPU and the GPU, and does not fully utilize heterogeneous processor resources.
Disclosure of Invention
An object of the present application is to provide a load balancing method and device based on a CPU-GPU, so as to solve the problems in the prior art that the GPU has a limited video memory capacity, cannot complete processing of a large data set by single loading, and cannot fully utilize heterogeneous processor resources due to unbalanced task allocation between the CPU and the GPU.
According to one aspect of the application, a load balancing method based on a CPU-GPU is provided, and the method comprises the following steps:
constructing a pipeline query execution model on a CPU-GPU heterogeneous database system;
determining a total number of pipelines to be executed;
starting the pipeline query execution model, distributing the pipelines corresponding to the total number to the CPU and the GPU, and calculating system execution time lengths corresponding to all load distribution strategies according to the determined execution time lengths of the single pipeline on the CPU and the GPU respectively;
and determining the load distribution strategy corresponding to the minimum value in all the system execution time lengths as an optimal CPU-GPU distribution strategy.
Further, in the above method, the determining the total number of pipelines to be executed includes:
acquiring a query statement, wherein the query statement comprises data to be queried;
dividing the data to be queried according to the size of a preset data fragment to obtain the data fragment of the data to be queried and the total number of the data fragment;
and respectively starting a corresponding pipeline for each data fragment in the data to be queried, wherein the total number of the pipelines to be executed is determined by the total number of the data fragments.
Further, in the above method, the step of starting the pipeline query execution model to allocate the pipelines corresponding to the total number to the CPU and the GPU, and calculating system execution durations corresponding to all load allocation policies according to the determined execution durations of the single pipeline on the CPU and the GPU, respectively, includes:
step one, starting a pipeline query execution model, and setting an initial load distribution strategy: number of pipelines N allocated to the CPU CPU0, the number of pipelines N allocated for the GPUGPUN, where N is the total number of the pipelines and N is a positive integer greater than or equal to 1;
step two, executing each pipeline respectively distributed on the CPU and the GPU in parallel to obtain the CPU execution duration and the GPU execution duration corresponding to the current load distribution strategy;
step three, if the CPU execution duration is equal to the GPU execution duration, determining the CPU execution duration as a system execution duration corresponding to the current load distribution strategy; if the CPU execution duration and the GPU execution duration are not equal, determining the larger value of the CPU execution duration and the GPU execution duration as the system execution duration corresponding to the current load distribution strategy;
step four, updating the load distribution strategy: number of pipelines N allocated to the CPUCPU=NCPU+1, number of pipelines N allocated for the GPUGPU=NGPU-1, wherein NCPU+NGPU=N,;
And step five, repeating the step two to the step four until system execution time lengths corresponding to all the load distribution strategies are obtained.
Further, in the foregoing method, before calculating the system execution durations corresponding to all the load distribution policies according to the determined execution durations of the single pipeline on the CPU and the GPU, respectively, the method further includes:
according to the data input duration T of the data fragment on the CPUIN_CData execution duration TEXE_CAnd data output duration TOUT_CDetermining an execution duration of a single said pipeline on said CPU;
according to the data input duration T of the data shards on the GPUIN_GData execution duration TEXE_GAnd data output duration TOUT_GDetermining an execution duration of a single of said pipelines on said GPU.
Further, in the above method, the formulas for obtaining the CPU execution duration and the GPU execution duration corresponding to the current load distribution policy in the second step are respectively:
TCPU=TIN_C+TEXE_C+TOUT_C+Max{TIN_c,TEXE_C,TOUT_C}×(NCPU-1),
TGPU=TIN_G+TEXE_G+TOUT_G+Max{TIN_G,TEXE_G,TOUT_G}×(NGPU-1),
wherein, TCPUAllocating CPU execution time length corresponding to strategy for current load, Max { T }IN_C,TEXE_C,TOUT_CThe data input duration T of the data slice on the CPUIN_CData execution duration TEXE_CAnd data output duration TOUT_CMaximum value of (1);
TGPUallocating the GPU execution duration corresponding to the strategy for the current load, Max { T }IN_G,TEXE_G,TOUT_GThe data input duration T of the data fragment on the GPUIN_GData execution duration TEXE_GAnd data output duration TOUT_GMaximum value of (2).
Further, in the above method, the query statement further includes a query condition, where the pipeline allocated to the CPU is a thread instance running on the CPU according to the query condition; and the pipeline distributed to the GPU is a kernel function instance running on the GPU according to the query condition.
Further, in the above method, after the step two of executing in parallel each pipeline respectively allocated to the CPU and the GPU to obtain the CPU execution duration and the GPU execution duration corresponding to the current load allocation policy, the method further includes:
obtaining the execution result corresponding to each pipeline on the CPU and the GPU;
and obtaining a final query execution result corresponding to the query statement according to the execution result corresponding to each pipeline.
Further, in the above method, the data input duration T of the data slice on the CPUIN_CThe time length used when the data fragments are copied in the memory of the CPU is obtained;
the data execution duration T of the data fragment on the CPUEXE_CA time duration for a thread instance running on the CPU;
the data output duration T of the data fragment on the CPUOUT_CThe time length used for copying the execution result corresponding to the pipeline in the CPU memory;
the data input duration T of the data fragment on the GPUIN_GA time length for copying the data fragment from the memory of the CPU to the video memory of the GPU;
the data execution duration T of the data fragment on the GPUEXE_GA duration for a kernel instance running on the GPU;
data output duration T of the data fragmentation on GPUOUT_GAnd the time length is used for copying the execution result corresponding to the pipeline from the video memory of the GPU to the memory of the CPU.
According to another aspect of the present application, there is also provided a non-volatile storage medium having stored thereon computer-readable instructions, which, when executed by a processor, cause the processor to implement the CPU-GPU based load balancing method as described above.
According to another aspect of the present application, there is also provided an apparatus, wherein it comprises:
one or more processors;
a non-volatile storage medium for storing one or more computer-readable instructions,
when executed by the one or more processors, cause the one or more processors to implement a CPU-GPU based load balancing method as described above.
Compared with the prior art, the method has the advantages that the pipeline query execution model is built on the CPU-GPU heterogeneous database system, so that the CPU-GPU heterogeneous data analysis system can support query analysis in a big data scene; determining a total number of pipelines to be executed; starting the pipeline query execution model to distribute the pipelines corresponding to the total number to the CPU and the GPU, and calculating system execution time lengths corresponding to all load distribution strategies according to the determined execution time lengths of the single pipeline on the CPU and the GPU respectively; and finally, determining the load distribution strategy corresponding to the minimum value in the system execution duration as the optimal CPU-GPU distribution strategy, wherein the load balancing strategy based on the CPU-GPU heterogeneous data analysis system can reasonably distribute the pipeline load to different processors, fully utilizes the calculation resources of the processors, not only improves the system performance, but also enables the system to achieve the optimal total performance.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
FIG. 1 illustrates a flow diagram of a CPU-GPU based load balancing methodology in accordance with an aspect of the subject application;
FIG. 2 is a schematic diagram illustrating a GPU execution duration and a CPU execution duration respectively corresponding to a load distribution policy in a CPU-GPU based load balancing method according to an aspect of the present application;
FIG. 3 is a schematic diagram illustrating a computation of an execution duration of a query corresponding to a load distribution policy in a CPU-GPU based load balancing method according to an aspect of the present disclosure; FIG. 4 is a schematic diagram illustrating a CPU-GPU load distribution strategy for 8000 ten thousand rows of data under a pipeline query execution model using a first query statement in accordance with an aspect of the subject application;
FIG. 5 illustrates a schematic diagram of a CPU-GPU load distribution strategy for 1.4 million lines of data under a pipeline query execution model using a first query statement, according to an aspect of the subject application;
FIG. 6 is a schematic diagram illustrating a CPU-GPU load distribution strategy for 8000 ten thousand rows of data under a pipeline query execution model using a second query statement in accordance with an aspect of the subject application;
FIG. 7 is a schematic diagram illustrating a CPU-GPU load distribution strategy for 1.4 million lines of data under a pipeline query execution model using a second query statement in accordance with an aspect of the subject application;
the same or similar reference numbers in the drawings identify the same or similar elements.
Detailed Description
The present application is described in further detail below with reference to the attached figures.
In a typical configuration of the present application, the terminal, the device serving the network, and the trusted party each include one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.
As shown in fig. 1, a load balancing method based on a CPU-GPU in the embodiment of the present application is applicable to a data analysis model of a relational database, can fully exert characteristics of heterogeneous processors, reasonably and efficiently allocate work tasks, and implement query analysis in a big data scene, and the method includes step S11, step S12, step S13, and step S14, where the method specifically includes:
the step S11, constructing a pipeline query execution model on the CPU-GPU heterogeneous database system;
said step S12, determining the total number of pipelines to be executed; for example, the total number of pipelines to be executed is N;
the step S13, starting the pipeline query execution model to allocate the pipelines corresponding to the total number N to the CPU and the GPU, and calculating system execution durations corresponding to all load allocation policies according to the determined execution durations of the single pipeline on the CPU and the GPU, respectively;
in step S14, the load distribution policy corresponding to the minimum value in all the system execution durations is determined as the optimal CPU-GPU distribution policy.
Through the steps S11 to S14, query analysis in a large data volume scene is realized, task allocation between the CPU and the GPU can be balanced, processor resources are fully utilized, and system performance is improved.
In the process of executing the load distribution strategy of the CPU-GPU heterogeneous database system, a pipeline query execution model and the load distribution strategy are constructed on a MapD Core open source version 3.3.1 of the heterogeneous database of the CPU-GPU, the execution is carried out on a machine provided with 1 piece of NVIDIA Tesla K80 GPU, and a video card comprises about 22GB global memory. Two ten-core Xeon E5-2630 v4 CPUs and RA of 224GB are arranged in a server corresponding to the system; the improved system runs on a CentOS 7Linux distribution of the Linux 3.10.0 kernel.
In this embodiment, the step S11 is to construct a pipeline query execution model on the CPU-GPU heterogeneous database system, and includes the following steps:
acquiring a query statement, wherein the query statement comprises data to be queried and one or more query conditions, and the data to be queried can be but is not limited to one or more database relational tables;
dividing data to be inquired into N data fragments according to preset data division size, wherein when the data to be inquired is preferably a database relational table, the data fragments are sub-tables of the database relational table, and the preset data division size is the preset number of element group data in the database relational table;
starting a corresponding pipeline for each data fragment in the N data fragments (sub-query objects) (that is, the total number of the started pipelines is N), wherein the pipeline can be distributed to a CPU (central processing unit) or a GPU (graphics processing unit), executing the operation of querying the corresponding data fragment on each pipeline distributed on the CPU and the GPU respectively, and obtaining an execution result r corresponding to each pipeline after the execution of each pipeline is completediI is the serial number of the pipeline (or the fragment number of the data fragment), where i is 1, 2, … …, N-1, N, and the execution results r corresponding to all pipelines will be synthesizediAnd obtaining a final query result R corresponding to the query statement, and recording the system execution duration under the current CPU-GPU load distribution strategy when the query is finished, so as to realize the recording of the system execution duration of the current CPU-GPU load distribution strategy and the statistics of the final query execution result.
In this embodiment, the step S12 of determining the total number of pipelines to be executed includes:
acquiring a query statement, wherein the query statement comprises data to be queried; if the query statement further comprises a query condition, the pipeline distributed to the CPU is a thread instance running on the CPU according to the query condition; and the pipeline distributed to the GPU is a kernel function instance running on the GPU according to the query condition.
Dividing the data to be queried according to the size of a preset data fragment to obtain the data fragment of the data to be queried and the total number of the data fragment; for example, the total number n of data slices of the data to be queried is ═ size of the data to be queried)/(preset data slice size);
respectively starting a corresponding pipeline for each data fragment in the data to be queried, where the total number of the pipelines to be executed is determined by the total number of the data fragments, for example, if a corresponding pipeline is started for each data fragment, N data fragments are started for N pipelines correspondingly, and the total number N of the pipelines to be executed by the system is determined by the total number of the data fragments.
In this embodiment, the step S13 starts the pipeline query execution model to allocate the pipelines corresponding to the total number to the CPU and the GPU, and calculates system execution durations corresponding to all load allocation policies according to the determined execution durations of the single pipeline on the CPU and the GPU, respectively, including:
step one, starting a pipeline query execution model, and setting an initial load distribution strategy: number of pipelines N allocated to the CPU CPU0, the number of pipelines N allocated for the GPUGPUN, where N is the total number of the pipelines and N is a positive integer greater than or equal to 1;
step two, executing each pipeline respectively distributed on the CPU and the GPU in parallel to obtain the CPU execution duration T corresponding to the current load distribution strategyCPUAnd GPU execution duration TGPU
Step three, if the CPU execution time length TCPUAnd the GPU execution duration TGPUIf the system execution time length T is equal to the system execution time length T corresponding to the current load distribution strategy, determining the CPU execution time length as the system execution time length T ═ T corresponding to the current load distribution strategyCPUOr T ═ TGPU(ii) a If the CPU execution time length TCPUAnd the GPU execution duration TGPUIf the system execution time length T is not equal to the Max { T } corresponding to the current load distribution strategy, determining the larger value of the CPU execution time length and the GPU execution time length as the system execution time length T corresponding to the current load distribution strategyCPU,TGPU}; because the CPU and the GPU start executing simultaneously, the system needs to obtain the best performance only with the smallest difference between the execution times of the two processors, and in an optimal scenario, when the difference between the execution times of the two processors is 0, the system can be considered to reach the best load balancing state, that is, the load distribution policy corresponding to the pipeline distributed to the CPU and the GPU is the best.
Step four, updating the load distribution strategy: number of pipelines N allocated to the CPUCPU=NCPU+1, number of pipelines N allocated for the GPUGPU=NGPU-1, wherein NCPU+NGPU=N,;
And step five, repeating the step two to the step four until system execution time lengths corresponding to all the load distribution strategies are obtained, namely, all the load distribution strategies are executed completely.
In this embodiment, before the step S13 of calculating the system execution durations corresponding to all the load distribution policies according to the determined execution durations of the single pipeline on the CPU and the GPU, respectively, the method further includes:
according to the data input duration T of the data fragment on the CPUIN_CData execution duration TEXE_CAnd data output duration TOUT_CDetermining an execution duration of a single said pipeline on said CPU; that is, the execution duration of a single pipeline on the CPU is the data input duration T on the CPU sliced by the dataIN_CData execution duration TEXE_CAnd data output duration TOUT_CThree stages, wherein the data slice has a data input duration T on the CPUIN_CThe time length used when the data fragments are copied in the memory of the CPU is obtained; the data execution duration T of the data fragment on the CPUEXE_CA time duration for a thread instance running on the CPU; the data output duration T of the data fragment on the CPUOUT_CThe time length used for copying the execution result corresponding to the pipeline in the CPU memory;
according to the data input duration T of the data shards on the GPUIN_GData execution duration TEXE_GAnd data output duration TOUT_GDetermining the execution duration of a single pipeline on the GPU, namely the execution duration of the single pipeline on the GPU is the data input duration T of the data slicing on the GPUIN_GData execution duration TEXE_GAnd data output duration TOUT_GThree phases, wherein the data slicing is performed on the GPU for a data input duration TIN_GA time length for copying the data fragment from the memory of the CPU to the video memory of the GPU; the data execution duration T of the data fragment on the GPUEXE_GA duration for a kernel instance running on the GPU; data of the data fragmentation on GPUOutput duration TOUT_GAnd the time length is used for copying the execution result corresponding to the pipeline from the video memory of the GPU to the memory of the CPU.
Because the allocated threads are mutually overlapped in the process of executing the allocated threads on the CPU or the GPU in parallel and the execution duration of one stage is only different between two adjacent pipelines, the formulas of the CPU execution duration and the GPU execution duration corresponding to the current load allocation strategy in the second step are respectively as follows:
CPU execution duration T corresponding to current load distribution strategyCPUComprises the following steps:
TCPU=TIN_C+TEXE_C+TOUT_C+Max{TIN_C,TEXE_C,TOUT_C}×(NCPU-1),
because the execution duration of each of the three stages of the pipeline is different whether on the CPU or the GPU, in order to accurately record the overlapping time between two adjacent pipelines, the stage with the longest execution duration in the three stages is taken as the execution time difference between two adjacent pipelines, for example, the data input duration T on the CPU according to the data slicingIN_CData execution duration TEXE_CAnd data output duration TOUT_CMax of (Max) Max { TIN_C,TEXE_C,TOUT_CAnd obtaining the execution time difference between two adjacent pipelines as follows: the maximum value of the time duration in the three stages is multiplied by the remaining pipeline data, so as to obtain the total time duration of the overlapping portion, as shown in fig. 2, which is a schematic calculation diagram of the GPU execution time duration and the CPU execution time duration respectively under the current load distribution strategy, where the left side of fig. 2 is in the pipeline execution process of the GPU, and the right side of fig. 2 is in the pipeline execution process of the CPU.
As shown in fig. 3, the GPU execution duration T corresponding to the current load distribution policyCPUComprises the following steps:
TGPU=TIN_G+TEXE_G+TOUT_G+Max{TIN_G,TEXE_G,TOUT_G}×(NGPU-1),
wherein, Max{TIN_G,TEXE_G,TOUT_GThe data input duration T of the data fragment on the GPUIN_GData execution duration TEXE_GAnd data output duration TOUT_GMaximum value of (1);
here, in fig. 3, a schematic diagram of an execution duration corresponding to query after a pipeline is divided into two processors, namely a CPU and a GPU, under a load distribution policy, the origin End is used to indicate an execution End time point when all loads are placed on the GPU; end is used to indicate the execution End time point after load balancing on both the CPU and GPU.
In this embodiment, after the step two of executing, in parallel, each pipeline respectively allocated to the CPU and the GPU to obtain the CPU execution duration and the GPU execution duration corresponding to the current load allocation policy, the method further includes:
obtaining the execution result corresponding to each pipeline on the CPU and the GPU; for example, the corresponding execution result after the ith pipeline completes execution is riAnd i is the serial number of the pipeline (or the fragment number of the data fragment), wherein i is 1, 2, … …, N-1, N;
obtaining a final query execution result R corresponding to the query statement according to the execution result corresponding to each pipeline, for example, R ═ { R ═ R1,r2,……ri,……,r(N-1),rN}。
Following the above embodiment of the present application, in the step S14, the load distribution policy corresponding to the minimum value in the system execution durations corresponding to all the load distribution policies is used as the optimal CPU-GPU load distribution policy, i.e., { OptN ″GPU,OptNCPU}=FindMin(T[]),OptNGPURefers to the load of the GPU (the number of pipelines distributed on the CPU) under the optimal load distribution strategy, OptNCPURefers to the load of the CPU (the number of pipelines allocated on the GPU) under the optimal load allocation scheme, T [, ]]For the array storing the system execution time length under all the load distribution strategies, FindMin finds the minimum value in the system execution time lengths under all the load distribution strategies。
In an actual application scenario of the load balancing method based on the CPU-GPU provided in the present application, as shown in fig. 4 and 5, a first query statement is adopted to respectively execute results of 8000 ten thousand rows of data and 1.4 hundred million rows of data to be queried, where the first query statement is: select avg (attr1) from tbl1group by attr2, wherein the left vertical axis in fig. 4 and 5 represents the system execution duration T corresponding to the current load distribution policy, and the right vertical axis GPU Pipeline Number represents the Number N of pipelines distributed on the GPUGPUThe horizontal axis CPU Pipeline Number represents the Number of pipelines N distributed on the CPUCPUThe total number of pipelines on the CPU and GPU remains unchanged (i.e., N in FIG. 4)GPU+N CPU3, N in fig. 5GPU+NCPU5), Pipeline workload partitions represent the allocation strategy for Pipeline load. When the number of the pipelines allocated on the CPU is 0, all the loads are allocated to the GPU for execution, namely the execution mode of the traditional CPU-GPU heterogeneous processing analysis system. As can be seen from fig. 4, when the CPU load is equal to 1 and the GPU load is equal to 2 (i.e. the number of pipelines allocated to the CPU is 1 and the number of pipelines allocated to the GPU is 2), the system execution duration T corresponding to the current load distribution policy is 587 milliseconds, and the time among the system execution durations corresponding to all the load distribution policies under the query data of 8000 ten thousand rows is the shortest (T is Tmin) The CPU load equal to 1 and the GPU load equal to 2 are the optimal load distribution strategy for querying 8000 ten thousand rows of data by using the first query statement; as can be seen from fig. 5, when the CPU load is 2 and the GPU load is 3 (i.e. the number of pipelines allocated to the CPU is 2 and the number of pipelines allocated to the GPU is 3), the system execution time length corresponding to the current load distribution policy is 936 ms, and is the shortest (T) of the system execution time lengths corresponding to all load distribution policies executing 1.4 million rows of data (T)min) That is, a CPU load equal to 2 and a GPU load equal to 3 are the optimal load distribution strategies for querying 1.4 billion lines of query data by using the first query statement. Under the first query statement, the system execution length for querying 8000 ten thousand lines of data and 1.4 hundred million lines of data with the GPU bearing the whole load is 881 milliseconds and 1265 milliseconds respectively. Using load compared to conventional implementationThe system performance after the distribution strategy is respectively improved by about 33% and 26% when the data amount is different, wherein 33% (8000 ten thousand lines of system execution duration that the GPU takes over all the loads-system execution duration corresponding to the load distribution strategy)/(8000 ten thousand lines of system execution duration that the GPU takes over all the loads) ((881 millisecond-587 millisecond)/881 millisecond); the system execution duration for which the GPU takes full load for 1.4 hundred million lines-the system execution duration for which the load distribution policy corresponds)/(the system execution duration for which the GPU takes full load for 1.4 lines) -1265-936 ms/936 ms.
In another practical application scenario of the load balancing method based on the CPU-GPU provided in the present application, as shown in fig. 6 and 7, a second query statement is adopted to respectively execute results of 8000 ten thousand rows of data and 1.4 hundred million rows of data to be queried, where the second query statement is: select count from (select tbl1.attr1 from tbl1 join tbl2 on tbl1.attr 1. tbl2.attr1), the number of pipelines is greater than the first query statement because the second query statement is a join operation. As can be seen in fig. 6: under 8000 ten thousand rows of data, when the CPU load is 2 and the GPU is 7, the system execution time length is the minimum value Tmin1361 ms. As can be seen in fig. 7: under 1.4 million rows of data, when the CPU load is 7 and the GPU load is 18, the system execution time length is the minimum value Tmin3488 milliseconds, the system performance is optimal, and compared with 1750 milliseconds and 4845 milliseconds for querying 8000 ten thousand rows of data and 1.4 hundred million rows of data under the second query statement, the system performance after the load distribution strategy is used is improved by about 22% and 28%, where 22% (system execution duration for which the GPU takes over all loads at 8000 ten thousand rows-system execution duration corresponding to the load distribution strategy)/(system execution duration for which the GPU takes over all loads at 8000 ten thousand rows) ((1750 milliseconds-1361 milliseconds)/1750 milliseconds); the system execution duration for which the GPU takes full load when 28% — 1.4 hundred million lines — the system execution duration for which the GPU takes full load when 1.4 lines)/(the system execution duration for which the GPU takes full load when 1.4 lines) — 4845-3488 msec)/4845 msec.
According to another aspect of the present application, there is also provided a non-volatile storage medium having stored thereon computer-readable instructions, which, when executed by a processor, cause the processor to implement the CPU-GPU based load balancing method as described above.
According to another aspect of the present application, there is also provided an apparatus, wherein it comprises:
one or more processors;
a non-volatile storage medium for storing one or more computer-readable instructions,
when executed by the one or more processors, cause the one or more processors to implement a CPU-GPU based load balancing method as described above
Here, the detailed content of each embodiment of the device may specifically refer to a corresponding part of a method embodiment of a load balancing method based on a CPU-GPU, which is executed in the device, and is not described herein again.
In summary, the method and the system enable the CPU-GPU heterogeneous data analysis system to support query analysis in a big data scene by constructing a pipeline query execution model on the CPU-GPU heterogeneous database system; determining a total number of pipelines to be executed; starting the pipeline query execution model to distribute the pipelines corresponding to the total number to the CPU and the GPU, and calculating system execution time lengths corresponding to all load distribution strategies according to the determined execution time lengths of the single pipeline on the CPU and the GPU respectively; and finally, determining the load distribution strategy corresponding to the minimum value in the system execution duration as the optimal CPU-GPU distribution strategy, wherein the load balancing strategy based on the CPU-GPU heterogeneous data analysis system can reasonably distribute the pipeline load to different processors, fully utilizes the calculation resources of the processors, not only improves the system performance, but also enables the system to achieve the optimal total performance.
It should be noted that the present application may be implemented in software and/or a combination of software and hardware, for example, implemented using Application Specific Integrated Circuits (ASICs), general purpose computers or any other similar hardware devices. In one embodiment, the software programs of the present application may be executed by a processor to implement the steps or functions described above. Likewise, the software programs (including associated data structures) of the present application may be stored in a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. Additionally, some of the steps or functions of the present application may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.
In addition, some of the present application may be implemented as a computer program product, such as computer program instructions, which when executed by a computer, may invoke or provide methods and/or techniques in accordance with the present application through the operation of the computer. Program instructions which invoke the methods of the present application may be stored on a fixed or removable recording medium and/or transmitted via a data stream on a broadcast or other signal-bearing medium and/or stored within a working memory of a computer device operating in accordance with the program instructions. An embodiment according to the present application comprises an apparatus comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, trigger the apparatus to perform a method and/or a solution according to the aforementioned embodiments of the present application.
It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the apparatus claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Claims (10)

1.A load balancing method based on a CPU-GPU (Central processing Unit-graphics processing Unit), wherein the method comprises the following steps:
constructing a model for query analysis of a pipeline on a CPU-GPU heterogeneous database system;
determining a total number of pipelines to be executed;
starting the model for inquiring and analyzing the assembly lines, distributing the assembly lines corresponding to the total number to the CPU and the GPU, and calculating system execution time lengths corresponding to all load distribution strategies according to the determined execution time lengths of the single assembly line on the CPU and the GPU respectively;
and determining the load distribution strategy corresponding to the minimum value in all the system execution time lengths as the optimal load distribution strategy.
2. The method of claim 1, wherein the determining a total number of pipelines to execute comprises:
acquiring a query statement, wherein the query statement comprises data to be queried;
dividing the data to be queried according to the size of a preset data fragment to obtain the data fragment of the data to be queried and the total number of the data fragment;
and respectively starting a corresponding pipeline for each data fragment in the data to be queried, wherein the total number of the pipelines to be executed is determined by the total number of the data fragments.
3. The method of claim 2, wherein the initiating the model for performing query analysis on pipelines allocates the pipelines corresponding to the total number to the CPU and the GPU, and calculates system execution durations corresponding to all load allocation policies according to the determined execution durations of the single pipeline on the CPU and the GPU, respectively, comprises:
step one, starting a model for carrying out query analysis on a pipeline, and setting an initial load distribution strategy: the number of pipelines distributed for the CPU is NCPU (zero crossing point) 0, the number of pipelines distributed for the GPU is NGPU (zero crossing point) N, wherein N is the total number of the pipelines and is a positive integer greater than or equal to 1;
step two, executing each pipeline respectively distributed on the CPU and the GPU in parallel to obtain the CPU execution duration and the GPU execution duration corresponding to the current load distribution strategy;
step three, if the CPU execution duration is equal to the GPU execution duration, determining the CPU execution duration as a system execution duration corresponding to the current load distribution strategy; if the CPU execution duration and the GPU execution duration are not equal, determining the larger value of the CPU execution duration and the GPU execution duration as the system execution duration corresponding to the current load distribution strategy;
step four, updating the load distribution strategy: the number of pipelines allocated to the CPU is NCPU +1, and the number of pipelines allocated to the GPU is NGPU-1, where NCPU + NGPU is N;
and step five, repeating the step two to the step four until system execution time lengths corresponding to all the load distribution strategies are obtained.
4. The method of claim 3, wherein before calculating the system execution durations corresponding to all the load distribution policies according to the determined execution durations of the single pipeline on the CPU and the GPU, respectively, further comprises:
determining the execution duration of a single pipeline on the CPU according to the data input duration TIN _ C, the data execution duration TEXE _ C and the data output duration TOUT _ C of the data fragments on the CPU;
and determining the execution duration of a single pipeline on the GPU according to the data input duration TIN _ G, the data execution duration TEXE _ G and the data output duration TOUT _ G of the data fragments on the GPU.
5. The method according to claim 4, wherein the formulas for obtaining the CPU execution duration and the GPU execution duration corresponding to the current load distribution policy in the second step are respectively as follows:
TCPU=TIN_C+TEXE_C+TOUT_C+Max(TIN_C,IEXE_C,TOUT_C}×(NCPU-1),
TGPU=TIN_G+TEXE_G+TOUR_G+Max{TIN_G,TEXE_G,TOUT_G}×(NGPU-1),
wherein, TCPU is the CPU execution time corresponding to the current load distribution strategy, Max { TIN _ C, TEXE _ C, TOUT _ C } is the maximum value of the data input time TIN _ C, the data execution time TEXE _ C and the data output time TOUT _ C of the data fragment on the CPU;
the TGPU is the GPU execution time length corresponding to the current load distribution strategy, and Max { TIN _ G, TEXE _ G, TOUT _ G } is the maximum value of the data input time length TIN _ G, the data execution time length TEXE _ G and the data output time length TOUT _ G of the data fragment on the GPU.
6. The method of claim 5, wherein the query statement further comprises a query condition, wherein the pipeline allocated to the CPU is a thread instance running on the CPU according to the query condition; and the pipeline distributed to the GPU is a kernel function instance running on the GPU according to the query condition.
7. The method according to claim 6, wherein, in the second step, after executing each pipeline respectively allocated to the CPU and the GPU in parallel to obtain a CPU execution duration and a GPU execution duration corresponding to the current load allocation policy, the method further comprises:
obtaining the execution result corresponding to each pipeline on the CPU and the GPU;
and obtaining a final query execution result corresponding to the query statement according to the execution result corresponding to each pipeline.
8. The method according to claim 7, wherein a data input duration TIN _ C of the data slice on the CPU is a duration used by the data slice when the data slice is copied in a memory of the CPU;
the data execution time TEXE _ C of the data fragment on the CPU is the time used by the thread instance running on the CPU;
the data output time TOUT _ C of the data fragment on the CPU is the time used for copying the execution result corresponding to the pipeline in the CPU memory;
the data input duration TIN _ G of the data fragment on the GPU is the duration for copying the data fragment from the memory of the CPU to the video memory of the GPU;
the data execution time TEXE _ G of the data fragment on the GPU is the time used by the kernel function instance running on the GPU;
the data output duration TOUT _ G of the data fragment on the GPU is a duration for copying the execution result corresponding to the pipeline from the video memory of the GPU to the memory of the CPU.
9. A non-transitory storage medium having stored thereon computer readable instructions which, when executed by a processor, cause the processor to implement the method of any one of claims 1 to 8.
10. An apparatus, wherein it comprises:
one or more processors;
a non-volatile storage medium for storing one or more computer-readable instructions,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-8.
CN201811064037.5A 2018-09-12 2018-09-12 Load balancing method and device based on CPU-GPU Active CN109213601B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811064037.5A CN109213601B (en) 2018-09-12 2018-09-12 Load balancing method and device based on CPU-GPU

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811064037.5A CN109213601B (en) 2018-09-12 2018-09-12 Load balancing method and device based on CPU-GPU

Publications (2)

Publication Number Publication Date
CN109213601A CN109213601A (en) 2019-01-15
CN109213601B true CN109213601B (en) 2021-01-01

Family

ID=64984143

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811064037.5A Active CN109213601B (en) 2018-09-12 2018-09-12 Load balancing method and device based on CPU-GPU

Country Status (1)

Country Link
CN (1) CN109213601B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109918141B (en) * 2019-03-15 2020-11-27 Oppo广东移动通信有限公司 Thread execution method, thread execution device, terminal and storage medium
CN110069527B (en) * 2019-04-22 2021-05-14 电子科技大学 Database-oriented GPU and CPU heterogeneous acceleration method
CN110096367A (en) * 2019-05-14 2019-08-06 宁夏融媒科技有限公司 A kind of panorama real-time video method for stream processing based on more GPU
CN110287212A (en) * 2019-06-27 2019-09-27 浪潮商用机器有限公司 A kind of data service handling method, system and associated component
CN110298437B (en) * 2019-06-28 2021-06-01 Oppo广东移动通信有限公司 Neural network segmentation calculation method and device, storage medium and mobile terminal
CN110490300B (en) * 2019-07-26 2022-03-15 苏州浪潮智能科技有限公司 Deep learning-based operation acceleration method, device and system
CN111062855B (en) * 2019-11-18 2023-09-05 中国航空工业集团公司西安航空计算技术研究所 Graphic pipeline performance analysis method
CN113051068A (en) * 2019-12-27 2021-06-29 中兴通讯股份有限公司 Database query method, device, equipment and storage medium
CN111240820B (en) * 2020-01-13 2020-11-24 星环信息科技(上海)有限公司 Concurrency quantity increasing speed multiplying determining method, equipment and medium
CN112181689A (en) * 2020-09-30 2021-01-05 华东师范大学 Runtime system for efficiently scheduling GPU kernel under cloud
TWI756974B (en) 2020-12-09 2022-03-01 財團法人工業技術研究院 Machine learning system and resource allocation method thereof
CN112989082B (en) * 2021-05-20 2021-07-23 南京甄视智能科技有限公司 CPU and GPU mixed self-adaptive face searching method and system
CN115437795B (en) * 2022-11-07 2023-03-24 东南大学 Video memory recalculation optimization method and system for heterogeneous GPU cluster load perception

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101091175A (en) * 2004-09-16 2007-12-19 辉达公司 Load balancing
CN101706741A (en) * 2009-12-11 2010-05-12 中国人民解放军国防科学技术大学 Method for partitioning dynamic tasks of CPU and GPU based on load balance
CN103329100A (en) * 2011-01-21 2013-09-25 英特尔公司 Load balancing in heterogeneous computing environments
US9311152B2 (en) * 2007-10-24 2016-04-12 Apple Inc. Methods and apparatuses for load balancing between multiple processing units

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101091175A (en) * 2004-09-16 2007-12-19 辉达公司 Load balancing
US9311152B2 (en) * 2007-10-24 2016-04-12 Apple Inc. Methods and apparatuses for load balancing between multiple processing units
CN101706741A (en) * 2009-12-11 2010-05-12 中国人民解放军国防科学技术大学 Method for partitioning dynamic tasks of CPU and GPU based on load balance
CN103329100A (en) * 2011-01-21 2013-09-25 英特尔公司 Load balancing in heterogeneous computing environments

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
CPU-GPU异构高性能计算中的负载预测调度算法研究及应用;沈文枫;《中国博士学位论文全文数据库 信息科技辑》;20170215;第I138-6页 *

Also Published As

Publication number Publication date
CN109213601A (en) 2019-01-15

Similar Documents

Publication Publication Date Title
CN109213601B (en) Load balancing method and device based on CPU-GPU
CN110168516B (en) Dynamic computing node grouping method and system for large-scale parallel processing
Kumar et al. Data placement and replica selection for improving co-location in distributed environments
US11036608B2 (en) Identifying differences in resource usage across different versions of a software application
US9471630B2 (en) Efficient query processing on ordered views
US10496659B2 (en) Database grouping set query
Tan et al. Effectiveness assessment of solid-state drive used in big data services
US8938442B2 (en) Systems and methods for efficient paging of data
CN107704568B (en) A kind of method and device of test data addition
Premchaiswadi et al. Optimizing and tuning MapReduce jobs to improve the large‐scale data analysis process
KR101772333B1 (en) INTELLIGENT JOIN TECHNIQUE PROVIDING METHOD AND SYSTEM BETWEEN HETEROGENEOUS NoSQL DATABASES
CN107451142B (en) Method and apparatus for writing and querying data in database, management system and computer-readable storage medium thereof
Low et al. Scalability of database bulk insertion with multi-threading
Tran et al. Exploring means to enhance the efficiency of GPU bitmap index query processing
Lee et al. Performance analysis of big data ETL process over CPU-GPU heterogeneous architectures
CN108121733B (en) Data query method and device
CN110019448B (en) Data interaction method and device
WO2021061183A1 (en) Shuffle reduce tasks to reduce i/o overhead
Grambow et al. Dockerization impacts in database performance benchmarking
US11914637B2 (en) Image scaling cloud database
US20240061705A1 (en) System and method for executing data processing tasks
CN115964354A (en) Determination method, server and computer storage medium
CN117708136A (en) Spark SQL processing method, device, storage medium and system
Martins et al. Integrating Map-Reduce and Stream-Processing for Efficiency (MRSP)
US20160335321A1 (en) Database management system, computer, and database management method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant