CN111090508B

CN111090508B - OpenCL-based dynamic task scheduling method between heterogeneous cooperative parallel computing devices

Info

Publication number: CN111090508B
Application number: CN201911203540.9A
Authority: CN
Inventors: 朱正东; 李少辉; 李小轩; 韩靖雯; 王鹏博; 李珍
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2019-11-29
Filing date: 2019-11-29
Publication date: 2023-04-14
Anticipated expiration: 2039-11-29
Also published as: CN111090508A

Abstract

The invention discloses an OpenCL-based dynamic task scheduling method among heterogeneous cooperative parallel computing devices, which comprises the following steps: firstly, taking a part of the total workload of a specified calculation kernel as an initial block size, then obtaining the task division proportion of each calculation device during the first calculation according to the theoretical peak value of each device participating in the collaborative parallel calculation during the first calculation execution process of the specified calculation kernel, and then dynamically adjusting the size of the next block and the task division proportion of each calculation device during the next calculation according to the calculation speed fed back by each calculation device participating in the collaborative parallel calculation during the execution process of the specified calculation kernel. The method realizes the effect of feedback type dynamic task division, and meanwhile, the overall performance of multi-device collaborative parallel computing can be improved. The invention completes the design details, the realization algorithm and the coding work of the functions and improves the resource utilization rate of a plurality of devices in parallel computing.

Description

OpenCL-based dynamic task scheduling method among heterogeneous cooperative parallel computing devices

Technical Field

The invention belongs to the technical field of computer application, and particularly relates to an OpenCL-based dynamic task scheduling method between heterogeneous cooperative parallel computing devices.

Background

With the rise of parallel programming languages, such as OpenCL and CUDA, heterogeneous computing platforms composed of host Computer (CPU) and GPU-based acceleration devices have become mainstream computing architectures nowadays. Such platforms provide higher performance for computationally intensive applications. On a heterogeneous platform formed by a CPU and various accelerating devices, openCL exerts the portability and cross-platform characteristics thereof, so that OpenCL is popular. But this programming model does not have an efficient and sophisticated task scheduling framework. Task scheduling between devices becomes especially important in order to be able to fully utilize the resources of heterogeneous systems. A static task scheduling strategy is adopted in the multi-device collaborative parallel computing, an accurate task division proportion is selected, load balance among computing devices can be effectively achieved, and no scheduling overhead exists during operation. However, the acquisition of the optimal task division ratio depends on time-consuming and labor-consuming offline training, and once the application program, the problem scale, the type and the number of devices participating in the collaborative parallel computation, the software and hardware configuration of the heterogeneous many-core system, and the like are changed, the offline training must be performed again. For static task scheduling, a poor task partition ratio may cause severe load imbalance among the devices, so that the overall performance of the multi-device cooperative parallel computing is significantly reduced.

Disclosure of Invention

In order to solve the problems in the prior art, the invention aims to provide a dynamic task scheduling method between devices in heterogeneous cooperative parallel computing based on OpenCL, and the method can improve the overall performance of multi-device cooperative parallel computing.

The invention adopts the following technical scheme:

a dynamic task scheduling method among heterogeneous cooperative parallel computing devices based on OpenCL comprises the following processes:

firstly, taking a part of the total workload of a specified calculation kernel as an initial block size and executing, then obtaining the task division proportion of each calculation device during the first calculation according to the theoretical peak value of each device participating in the collaborative parallel calculation in the first calculation execution process of the specified calculation kernel, and then dynamically adjusting the size of the next block and the task division proportion of each calculation device during the next calculation according to the calculation speed fed back by each calculation device participating in the collaborative parallel calculation in the execution process of the specified calculation kernel.

The dynamic task scheduling method among the devices in the heterogeneous cooperative parallel computing based on the OpenCL specifically comprises the following steps:

s1, taking a part of total workload of a specified computation kernel as a first block, and cooperatively executing the first block by using a computation device;

s2, judging whether the residual workload exists or not, and if not, indicating that the designated calculation kernel is executed; if so, cooperatively executing the second block using the computing device;

and S3, repeating the S2 until the residual workload is 0.

The S1 comprises the following steps:

s1.1, dividing the ratio R according to the initial task _i To a computing device D _i Distribution workload W _{curr_i} Wherein W is _{curr_i} ＝W _curr ×R _i And W _curr I is not less than 1 and not more than p, p is the total number of computing devices, n is a preset parameter, and W is the total workload of a specified computing kernel;

s1.2, in a computing device D _i In executing the workload W assigned to it _{curr_i} ；

S1.3, when calculating device D _i Has completed the workload W assigned to it _{curr_i} Thereafter, the collection computing device D _i Current execution time T _{curr_i} And computing device D _i Current execution speed V of _{curr_i} ；

S1.4, after all the computing devices complete respective work, computing device D _i Relative execution speed RV _i Wherein

Relative execution speed RV _i Used as a new task division ratio, the task division ratio is updated as follows: r _i ＝RV _i ，1≤i≤p；

S1.5, calculating the current collaborative parallel execution speed V _curr In which V is _curr ＝W _curr /T _curr ，T _curr ＝max(T _{curr_1} ,T _{curr_2} ,...,T _{curr_p} )；

S1.6, updating the total workload W which is finished _f And the remaining workload W _r Wherein W is _f ＝W _r +W _curr ，W _r ＝W-W _f 。

The initial task division ratio R _i Calculating the initial task division ratio R through manual setting or according to the proportional relation of theoretical peak values of all computing devices participating in cooperative parallel computation _i 。

The second block size is 2 xW/n, S2 includes the steps of:

s2.1, distributing the workload of the second block to each computing device participating in the collaborative parallel computing according to the updated task division proportion;

s2.2, executing the workload distributed to each computing device;

s2.3, after each computing device finishes executing the respective workload, collecting the execution time of each computing device, calculating the relative execution speed of each computing device, and updating the task division ratio according to the obtained relative execution speed;

s2.4, calculating the current cooperative parallel execution speed;

s2.5, adjusting the size of the next block according to the current collaborative parallel execution speed obtained in the S2.4, and determining the workload required to be completed in the next step; determining whether the size of the next block is the multiplication, multiplication or maintenance compared with the size of the current block by comparing the current coordinated parallel execution speed of the previous block and the current coordinated parallel execution speed of the current block and comparing the size of the previous block and the size of the current block;

and S2.6, updating the completed total workload and the completed residual workload, if the residual workload is 0, completing the calculation task, and if the residual workload is not 0, performing S3.

In S3:

in each iteration, if device D is calculated _i Is an accelerator, then at computing device D _i Uploading a part of data of the current block from the host side to the computing device D according to the task division ratio before executing the current block _i When computing device D _i Slave computing device D in task division ratio after ending execution of current block _i Downloading a part of processed data of the current block to a host end; after the current block is processed, the method is based on the cooperative parallelismThe dynamic variation of execution speed and workload adjusts the size of the next block, whose maximum size should not exceed the remaining workload.

In S3:

calculating the difference between the size of the next block and the residual workload of the current block in each iteration step, and if the difference is less than or equal to 0.5 times of the residual workload of the current block, taking the residual workload of the current block as the size of the next block; otherwise the size of the next block remains unchanged.

The invention has the following beneficial effects:

the invention discloses a heterogeneous collaborative parallel computing inter-device dynamic task scheduling method based on OpenCL. Therefore, the effect of feedback type dynamic task division is achieved, and meanwhile, the overall performance of multi-device collaborative parallel computing can be improved through the method.

Further, considering that a smaller block may result in underutilization of the computing power of the accelerator, in S3, a difference between the size of the next block and the remaining workload of the current block is calculated in each iteration, and if the difference is less than or equal to 0.5 times the remaining workload of the current block, the remaining workload of the current block is taken as the size of the next block; otherwise the size of the next block remains unchanged.

Drawings

FIG. 1 is a flowchart of a method for scheduling dynamic tasks among devices in OpenCL-based heterogeneous cooperative parallel computing.

FIG. 2 is a flowchart illustrating an embodiment of the present invention.

Detailed Description

The invention is further described below with reference to the figures and examples.

The expanded OpenCL programming framework is designed, and programmers can write a parallel application program which can be cooperatively and parallelly executed by any type and any plurality of computing devices in a master-slave heterogeneous system by using OpenCL. After a program is compiled and linked, it is necessary to package functions into a data structure called a kernel, i.e., to create a kernel. The runtime system is primarily responsible for dividing and equitably distributing a computing task to multiple computing devices, and then executing a device-specific kernel on each computing device to complete the sub-computing tasks assigned to it. According to the above, in a master-slave heterogeneous multi-device system, a CPU and multiple accelerators can be used to concurrently and cooperatively execute data-level parallel applications, but the most critical issue is how to reasonably and efficiently schedule tasks among computing devices. The dynamic task scheduling method between devices in heterogeneous cooperative parallel computing based on OpenCL provided by the invention can effectively solve the problem.

The research idea of the dynamic task scheduling method between devices in heterogeneous collaborative parallel computing based on OpenCL is to dynamically divide the whole iteration space of a computing kernel into a plurality of blocks with different sizes. Specifically, 1/n (namely W/n) of the total workload of a specified computational kernel is taken as an initial block size, wherein a parameter n can be manually set by a programmer, the preferred value range of the parameter n is 32-128, and the default value of the parameter n is 32; and then, dynamically adjusting the size of the next block according to the performance change of the multi-device cooperative parallel computing in the execution process of the specified computing kernel.

Referring to fig. 1 and fig. 2, the method for scheduling dynamic tasks among devices in heterogeneous collaborative parallel computing based on OpenCL specifically includes the following steps:

step 1: cooperatively executing a first block using p computing devices, the first block having a size of W/n; the method specifically comprises the following steps:

1.1 dividing the ratio R according to the initial task _i Part of workload W of the first block _{curr_i} To a device D _i Wherein W is _{curr_i} ＝W _curr ×R _i And W _curr I is not less than 1 and not more than p. Initial task division ratio R _i Can be manually set by a programmer or can be set by the programmerCalculating an initial task division ratio R according to the proportional relation of theoretical peak values of all devices participating in cooperative parallel computation _i 。

1.2 at computing device D _i In executing the workload W assigned to it _{curr_i} 。

1.3 computing device D _i After the workload assigned to it is completed, the collection computing device D _i Current execution time T _{curr_i} And computing device D _i Current execution speed V of _{curr_i} In which V is _{curr_i} ＝W _{curr_i} /T _{curr_i} 。

1.4 computing device D when all p computing devices have completed their respective jobs _i Relative execution speed RV _i In which

Here, the relative execution speed RV _i Is used as a new task division ratio, the task division ratio can be updated as follows: r _i ＝RV _i (1≤i≤p)。

1.5 calculating the current collaborative parallel execution speed V _curr In which V is _curr ＝W _curr /T _curr And T is _curr ＝max(T _{curr_1} ,T _{curr_2} ,...,T _{curr_p} )。

1.6 updating the Total workload W that has completed _f And the remaining workload W _r Wherein W is _f ＝W _r +W _curr And W _r ＝W-W _f 。

Step 2: judging whether the residual workload exists, if not, indicating that the specified calculation kernel is executed completely; if so, a second block of 2 xW/n is cooperatively executed using p computing devices, as per the process of step 1.

The method specifically comprises the following steps:

2.1 dividing the ratio R according to the updated task _i The workload of the second block is distributed to the computing devices participating in the collaborative parallel computing.

2.2 executing the respective workloads assigned to them in the respective computing devices.

And 2.3 after each computing device finishes the respective workload, collecting the execution time of each computing device, calculating the relative execution speed of each device, and updating the task division ratio according to the obtained relative execution speed.

2.4 calculating the current cooperative parallel execution speed.

2.5, adjusting the size of the next block according to the current coordinated parallel execution speed obtained in the step 2.4, namely determining the workload to be completed next step. The speed V of the coordinated parallel execution obtained by comparing step 1.5 _prev And the current cooperative parallel execution speed V _curr And comparing the size W of the previous block _prev (i.e., the amount of work done in the previous step) and the size W of the current block _curr (i.e., the amount of work done in the current step) to determine the size W of the next block _next Compared to the current block size W _curr Whether the multiplication, or subtraction remains the same.

And 2.6, updating the completed total workload and the residual workload, if the residual workload is 0, completing the calculation task, and if the residual workload is not 0, performing the step 3.

And 3, step 3: and repeating the step 2 until the residual workload is 0. Wherein, in each iteration step, if the device D is calculated _i Is an accelerator, then at computing device D _i Before the current block is executed, dividing the current block into a task division ratio R _i Uploading a portion of data of a current block from a host to a computing device D _i When computing device D _i Dividing the ratio R according to the task after finishing the execution of the current block _i Slave computing device D _i And downloading a part of processed data of the current block to the host end. After the current block is processed, the size of the next block is adjusted according to the dynamic change of the cooperative parallel execution speed and the workload, and the maximum size of the next block should not exceed W _r I.e. W _next Should be less than or equal to W _r . Furthermore, W is calculated in each iteration, considering that smaller blocks may result in underutilization of the computing power of the accelerator _next And W _r If the difference is less thanOr equal to 0.5 xW _next Then W is _next ＝W _r (ii) a Otherwise, W _next It remains unchanged.

In the dynamic task scheduling method between devices in heterogeneous collaborative parallel computing based on OpenCL, the setting of parameter n affects the performance of the algorithm. The preferable value range of the parameter n is 32-128, and the default value is 32; experimental results demonstrate that this default setting is reasonable but not necessarily optimal for each test procedure. The programmer can optimally set the parameter n for different computational kernels. The performance of the scheduling algorithm is also related to the initial block size setting, but experimental results show that: the setting of the initial block is of little impact on the performance of the scheduling algorithm as long as it is not too large or too small.

The embodiments of the present invention are illustrated below by specific examples. There is a computation task T Matrix-vector mul.cl system containing 1 CPU and 2 GPUs to test. OpenCL is used by programmers to write a parallel application that can be executed in parallel by any type and any number of computing devices in a master-slave heterogeneous system in coordination. After the program is compiled and linked, the function needs to be packed into a data structure called a kernel, i.e., the kernel is created. The feedback type dynamic task scheduling method is used for dynamically dividing the whole iteration space of a computing kernel (namely a data level parallel for loop) into a plurality of blocks with different sizes, and then dynamically adjusting the size of the next block according to the performance change of multi-device cooperative parallel computing in the execution process of a specified computing kernel.

In order to evaluate the performance of the dynamic task scheduling method between devices in the OpenCL-based heterogeneous cooperative parallel computing, the test programs in the table 1 are selected to realize each test program in the following ways. Single CPU core serial execution, multi-CPU parallel execution, multi-NIVIDIA GPU parallel execution, multi-AMD GPU parallel execution, CPU and NIVIDIA _ GPU coordinated parallel execution, CPU and AMD _ GPU coordinated parallel execution, and CPU and NIVIDIA _ GPU and AMD _ GPU coordinated parallel execution. Here, the parallel execution of multiple CPUs refers to using OpenMP to implement a specified test program and running the test program in an 8-core CPU; the parallel execution of multiple NIVIDIA GPUs means that a specified test program is realized by using CUDA and runs in a specified NIVIDIA GPU; the parallel execution of the multiple AMD GPUs refers to that a specified test program is realized by using an OpenCL programming model without adding a feedback type dynamic scheduling algorithm and the test program is operated in one specified AMD GPU. The CPU and NIVIDIA _ GPU and AMD _ GPU are compared using static scheduling, split scheduling policy, quic scheduling policy, and the feedback dynamic task scheduling policy proposed herein, respectively. In addition, considering that the initial block size has a great influence on the performance of the quick scheduling policy and the number of blocks also has a great influence on the performance of the split scheduling policy, for the sake of fairness, the appropriate initial block size is manually selected for the quick scheduling policy and the appropriate number of blocks is also manually selected for the split scheduling policy for a specified test procedure and a specified problem scale.

TABLE 1

/>

Claims

1. A dynamic task scheduling method among heterogeneous cooperative parallel computing devices based on OpenCL is characterized by comprising the following steps: firstly, taking a part of the total workload of a specified calculation kernel as an initial block size and executing, then obtaining the task division proportion of each calculation device during the first calculation according to the theoretical peak value of each device participating in the collaborative parallel calculation in the first calculation execution process of the specified calculation kernel, and then dynamically adjusting the size of the next block and the task division proportion of each calculation device during the next calculation according to the calculation speed fed back by each calculation device participating in the collaborative parallel calculation in the execution process of the specified calculation kernel;

the method comprises the following steps:

s2, judging whether residual workload exists or not, and if not, indicating that the specified calculation kernel is executed completely; if so, cooperatively executing the second block using the computing device;

s3, repeating the step S2, and cooperatively executing the next block by using the computing equipment until the residual workload is 0;

s1 comprises the following steps:

s1.2, at computing device D _i In executing the workload W assigned to it _{curr_i} ；

S1.3, when calculating device D _i Completes the workload W assigned to it _{curr_i} Thereafter, the collection computing device D _i Current execution time T _{curr_i} And computing device D _i Current execution speed V of _{curr_i} ；

S1.4, after all the computing devices complete respective work, computing device D _i Relative execution speed RV _i In which

S1.5, calculating the current collaborative parallel execution speed V _curr In which V is _curr ＝W _curr /T _curr ，

T _curr ＝max(T _{curr_1} ,T _{curr_2} ,...,T _{curr_p} )；

S1.6, updating the total workload W which is completed _f And the remaining workload W _r Wherein W is _f ＝W _r +W _curr ，W _r ＝W-W _f 。

2. The method for dynamically scheduling tasks among devices in heterogeneous cooperative parallel computing based on OpenCL as claimed in claim 1, wherein the method is implemented by using a distributed architectureStarting task division ratio R _i Calculating the initial task division ratio R through manual setting or according to the proportional relation of theoretical peak values of all computing devices participating in cooperative parallel computing _i 。

3. The method for dynamically scheduling the tasks among the devices in the heterogeneous collaborative parallel computing based on the OpenCL as recited in claim 1, wherein the second block size is 2 xw/n, and S2 comprises the following steps:

s2.2, executing the workload distributed to each computing device;

s2.4, calculating the current collaborative parallel execution speed;

4. The method for dynamically scheduling the tasks among the devices in the OpenCL-based heterogeneous cooperative parallel computing, according to claim 1, wherein in S3:

in each iteration, if device D is calculated _i Is an accelerator, then at computing device D _i Before executing the current block according to task divisionPartial proportion uploading part of data of current block from host computer end to computing device D _i When computing device D _i Slave computing device D in task division ratio after ending execution of current block _i Downloading a part of processed data of the current block to a host end; and after the current block is processed, adjusting the size of the next block according to the dynamic change of the cooperative parallel execution speed and the workload, wherein the maximum size of the next block should not exceed the residual workload.

5. The method for dynamically scheduling the tasks among the devices in the heterogeneous collaborative parallel computing based on the OpenCL as claimed in claim 4, wherein in S3:

calculating a difference between the size of the next block and the remaining workload of the current block in each iteration, and if the difference is less than or equal to 0.5 times the remaining workload of the current block, taking the remaining workload of the current block as the size of the next block; otherwise the size of the next block remains unchanged.