CN114356550B

CN114356550B - Automatic computing resource allocation method and system for three-level parallel middleware

Info

Publication number: CN114356550B
Application number: CN202111503888.7A
Authority: CN
Inventors: 刘金硕; 毛煜灵; 王欣盛; 付盼
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2021-12-10
Filing date: 2021-12-10
Publication date: 2024-09-24
Anticipated expiration: 2041-12-10
Also published as: CN114356550A

Abstract

The invention discloses a method and a system for automatically distributing computing resources for three-level parallel middleware, which take a distributed technology, middleware and a computer composition structure as a guiding theory, and aim at a task of large-scale computation in a cluster environment. The model may analyze the computational tasks performed under the clusters, first assigning the tasks to each computational unit in the form of queues through message middleware, and then reassigning the tasks to the CPU and GPU within the computational units. The invention can be applied to clusters, and can also be applied to a single computer when the task scale is not large. The calculation tasks on the single machine are reasonably distributed to the CPU and the GPU, so that the CPU and the GPU can operate simultaneously, and the calculation rate is improved. The invention can further improve the calculation speed under the background of large calculation task in the cluster mode.

Description

Automatic computing resource allocation method and system for three-level parallel middleware

Technical Field

The invention belongs to the technical fields of GPU (graphics processing unit) calculation, cluster calculation, multithreading, resource scheduling and middleware in computer science, relates to an automatic calculation resource allocation method and system, in particular to a reasonable allocation method and system for calculation tasks in parallel calculation of a heterogeneous system, and can effectively improve the calculation speed of parallel calculation of a program in the heterogeneous system.

Background

GPU computing is the use of GPUs (graphics processing units) as coprocessors to accelerate CPUs to speed up the running speed of scientific, analytical, engineering, consumer, and enterprise applications. The GPU speeds up applications running on the CPU by offloading some of the computationally intensive and time consuming code portions. The rest of the application is still running on the CPU. From the user's perspective, the application runs faster because it uses the massive parallel processing capabilities of the GPU to improve performance. This is known as "heterogeneous" or "hybrid" computing.

A CPU consists of 4 to 8 CPU cores, while a GPU consists of hundreds of smaller cores. Which work together to cope with the data in the application. This massively parallel architecture provides high computational performance for the GPU. Many GPU-accelerated applications provide a convenient way to access High Performance Computing (HPC).

Resource scheduling is to allocate resources required in the work. Herein, computing resources such as threads, processes, data streams, or hardware resources are primarily referred to. The invention refers to the allocation of tasks in clusters and the allocation of computing tasks on a single machine on a CPU and a GPU.

Disclosure of Invention

The invention aims to provide a method and a system for automatically distributing computing resources for three-level parallel middleware, which achieve the purpose of improving computing speed.

The technical scheme adopted by the method is as follows: a computing resource automatic allocation method for three-level parallel middleware comprises the following steps:

Step 1: the method for predicting the running time of the program by adopting the program instrumentation method comprises the following steps:

Step 1.1: positioning a circulation statement and a branch statement, inserting piles in a recirculation structure and inserting piles in the branch statement, and obtaining circulation times and branch numbers;

Step 1.2: positioning an MPI communication function, and performing pile inserting on the MPI communication function to obtain communication data volume;

Step 1.3: taking a new program after the pile insertion of the original program as a prediction object, inputting and operating the newly generated program after the pile insertion is completed, obtaining output values of CPU operation time, GPU operation time and overall operation time of the program, and performing calculation again as input, and performing cyclic execution for a plurality of times until the expected value is reached, namely the output value is stable, and the error is less than 0.001;

Step 2: dividing the calculation tasks, reasonably distributing the calculation tasks to the CPU and the GPU of the task execution unit for calculation, and realizing resource scheduling and load balancing.

The system of the invention adopts the technical proposal that: a computing resource automatic allocation system for three-level parallel middleware comprises the following modules:

The module 1 is used for predicting the running time of a program by adopting a program instrumentation method, and the specific implementation comprises the following sub-modules:

the module 1.1 is used for positioning the circulation statement and the branch statement, inserting piles in the circulation structure and inserting piles in the branch statement, and obtaining the circulation times and the branch number;

The module 1.2 is used for positioning the MPI communication function, inserting piles for the MPI communication function and obtaining communication data volume;

The module 1.3 is used for taking a new program after the pile is inserted into the original program as a prediction object, inputting and operating the newly generated program after the pile is inserted, obtaining output values of CPU operation time, GPU operation time and overall operation time of the program, calculating again as input, and circularly executing for a plurality of times until the expected value is reached;

And the module 2 is used for dividing the calculation tasks, reasonably distributing the calculation tasks to the CPU and the GPU of the task execution unit for calculation, and realizing resource scheduling and load balancing.

The invention relates to an automatic distribution system of computing resources, which mainly comprises a task distribution module and a computing resource scheduling module. The task allocation module allocates tasks to each task computing unit in a queue form through message middleware when the host communicates with the cluster. The computing resource scheduling module reasonably distributes computing tasks on a single machine to the CPU and the GPU, so that the CPU and the GPU can operate simultaneously, and the computing speed is improved.

The calculation resource scheduling module of the invention takes the principle that the total time is shortest when the CPU and the GPU finish executing simultaneously, and respectively distributes the tasks to the CPU and the GPU on a single task executing unit.

The automatic distribution system of the computing resources also comprises the prediction of the running time of the target program, and the prediction can be performed by using a program instrumentation method or a skeleton program-based method according to the specific task form. Meanwhile, when the program pile inserting method is used, a fitting model is selected according to specific situations.

The invention provides a three-level parallel middleware-oriented automatic computing resource allocation method aiming at the task of large-scale computing in a cluster environment by taking a distributed technology, middleware and a computer composition structure as a guiding theory. The method can analyze the computing tasks performed under the cluster, firstly, the tasks are distributed to each computing unit in a queue form through the message middleware, and then, the tasks are distributed to the CPU and the GPU again in the computing units. The invention can be applied to clusters, and can also be applied to a single computer when the task scale is not large. The calculation tasks on the single machine are reasonably distributed to the CPU and the GPU, so that the CPU and the GPU can operate simultaneously, and the calculation rate is improved. The invention can further improve the calculation speed under the background of large calculation task in the cluster mode.

Drawings

FIG. 1 is a flow chart of a program run-time prediction method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a resource scheduling model of middleware on a single node according to an embodiment of the present invention;

FIG. 3 is a general flow chart of CPU\GPU computing for fine-grained parallel computing in accordance with an embodiment of the invention;

FIG. 4 is a diagram of the overall architecture of a CPU/GPU for coarse-grained parallel computing in accordance with an embodiment of the invention.

Detailed Description

In order to facilitate the understanding and practice of the invention, those of ordinary skill in the art will now make further details with reference to the drawings and examples, it being understood that the examples described herein are for the purpose of illustration and explanation only and are not intended to limit the invention thereto.

The automatic allocation method for the computing resources of the three-level parallel middleware provided by the invention can allocate the computing tasks running on the cluster to each task execution unit, and reasonably allocate the computing resources again on a single task execution unit, so that the CPU and the GPU can jointly execute the tasks, and the purpose of improving the computing speed is achieved.

The method comprises the following steps:

step 1: predicting the running time of a program by adopting a program instrumentation method;

Referring to fig. 1, the present embodiment first predicts the running time of a program. The invention adopts a program pile inserting method to predict time. Because the platform used by the research is relatively stable, only the problem of predicting the running time of the parallel program on a certain operation platform under a certain input, namely the problem of predicting the running time of the program under other inputs, namely the running time of the program under the input a is t1, is considered.

Since the loop sentence is the most time-consuming structure in parallel computation, the more the number of repetitions, the longer the time required, the pile is inserted in the loop structure first. Since branch statements may have large differences in different branch computation times, a branch statement is also one of the instrumentation objects. In the case of the MPI communication function, if the number of communication times is large because of relatively fewer cycles, the inter-program communication is likely to take longer than the calculation time, and thus the MPI communication function is also instrumented. And taking a new program after the pile is inserted into the original program as a prediction object, and after the pile is inserted, using a large number of inputs to run the newly generated program, and obtaining output values of all the features as the input of the model. And (3) selecting a proper prediction model for fitting, evaluating the influence of each characteristic value on time, reasonably choosing and eliminating, and preventing over fitting until a final prediction model is obtained.

In the middleware, parallel tasks on the cluster are usually characterized by large data volume and large circulation times, so the circulation times of the program are taken as main input. Because some program running time is related to specific parameters in the program, such as the value of a certain parameter or the variation of a symbol, etc., the characteristic value can be added and deleted and modified according to specific items.

After the final prediction model is obtained, the prediction value of the current program running time can be obtained by changing the characteristic parameters. If the original program has the characteristic that the program logic is kept unchanged under a small amount of data, prediction based on the skeleton program can be adopted, so that the running time of the original program can be estimated by only running the skeleton program, and the method is strong in pertinence and poor in universality.

After the prediction of the running time of the program is finished, the automatic allocation part of the computing resource of the middleware is next, and some parameters needed by the computing resource allocation method are given in table 1.

TABLE 1

In the cluster environment, tasks (mainly referred to as computing tasks) to be processed are sent to a message middleware from outside, and then distributed to each task execution unit by the middleware for monitoring. And the result after the task execution unit is completed is stored in the back end. And middleware for automatic three-level parallel resource scheduling is respectively acted in each task execution unit.

Aiming at the problem of the improvement of the calculation rate in a heterogeneous system of the CPU/GPU cooperative calculation, in the embodiment, the calculation tasks are reasonably distributed to the CPU and the GPU by analyzing and calculating each influence factor and using the concept of blocking, and the calculation time of the CPU and the GPU is basically the same so as to achieve the best calculation effect.

The specific idea is to consider the hardware performance, the complex overhead of threads and the like and the influencing factors changed along with the environmental change as a whole with the data to be processed, and do not do independent processing. And then the whole data is segmented by using the segmentation idea, and the calculation time of one data block is used as a reference for measuring the hardware performance.

The required results are the number of tasks allocated to the CPU, the number of tasks allocated to the GPU, and the number of threads created by the CPU. And also requires a prediction of the target program run time.

Task allocation requires the computation of three undetermined variables, namely the number of blocks allocated to the CPU, the number of blocks allocated to the GPU and the number of threads created on the CPU. The final goal is to reduce the total execution time as much as possible using this task allocation strategy. The number of blocks allocated to the CPU is defined as b1, the number of blocks allocated to the GPU is defined as b2, and the number of threads created on the CPU is defined as Thread.

Referring to fig. 2, in this embodiment, the total tasks are divided by the distributed framework, and the total task number N is used as input and is contacted by a message middleware; the message middleware distributes the total task number N to each task execution unit respectively and monitors the running condition of each task execution unit; and storing the result after the operation of the task execution unit is finished.

The message middleware of the embodiment takes the task number of a task execution unit as input, based on the number of blocks allocated to a CPU, the number of blocks allocated to a GPU and the number of threads created on the CPU, allocates the tasks through a resource scheduling model according to the principle that the total time calculated under the condition that the CPU and the GPU finish the tasks at the same time is the shortest, and realizes the simultaneous calculation of the two;

According to fig. 4, the resource scheduling model of this embodiment first determines, for the incoming of new data, whether there are already configured relevant parameters (the number of blocks of the CPU, the number of blocks of the GPU and the number of threads created on the CPU mentioned above) in the database for the Worker to call, if there are no relevant parameters, a task is sent, the Worker calls the instrumentation program to perform fitting, and obtains these parameters (the same relevant parameters as above) from the relevant results (after the information obtained by the instrumentation program is analyzed and fitted). And then sending the parameters and the data to a task queue to wait for the slave to pull the task to run. The innovation point of the model is that the mathematical model is used for pre-distributing the CPU number, the GPU number and the thread number in advance, so that the purpose of using the CPU and the GPU at the same time of high-efficiency program is achieved.

Referring to fig. 3, the creation of the thread in this embodiment specifically includes the following sub-steps:

(1) A main process is created, and a sequential execution portion of the program is run.

(2) And entering a program parallelizable partial area to divide tasks.

(3) And distributing threads to the divided tasks according to the threads predicted by the instrumentation program.

(4) One more thread than the pre-assigned thread is created to control the CUDA kernel.

(5) The two stages of the program are executed in parallel.

(6) Waiting for completion of all thread computing tasks.

(7) After completion, the sequential execution portion of the program is run.

In this embodiment, the multi-core CPU creates an appropriate number of threads, one of which is responsible for GPU scheduling, while the other threads execute parallel tasks assigned to the CPU. While the GPU performs the tasks assigned to it. A simplified mapping example is provided below. The specific implementation process of the mapping comprises the following steps:

(1) The code blocks parallelizable after analysis are marked by # program parallel.

(2) Creating threads on the CPU according to the number of threads determined in the analysis stage, wherein the number of threads is one more than the number of threads determined, and the threads are used for controlling the running of the GPU on the task.

(3) The CPU threads execute the assigned tasks in parallel.

(4.1) Under the control of the CPU thread, the CUDA kernel performs parallel computation on the task.

(4.2) After the task to be performed on the GPU is specified, assigning the GPU core to the corresponding loop, and assigning the thread.

(4.3) Determining the dimension and size of grid and block.

(4.4) The CUDA core runs the computing task.

The automatic computing resource distribution system for the three-level parallel middleware in the embodiment aims to improve the computing speed of CPU/GPU cooperative computing in a heterogeneous system, and the invention is further described through specific embodiments.

The overall task N in the cluster is distributed to each task execution unit by the message middleware, and if there are i task execution units, there is N ₁+n₂+...+n_i =n. And for the task to be executed by each task execution unit, performing computing resource allocation again. Taking the ith task execution unit as an example, the number of task blocks to be executed is n _i. And (3) performing computing resource allocation on n _i, allocating a CPU thread to each task block, reserving one thread to control the GPU, and performing computing on the rest task blocks with the GPU. As can be seen from table one, in the task execution unit i, the number of task blocks allocated to the CPU is n _ic, and the number of blocks allocated to the GPU is n _ig. The number of threads generated by the CPU is Thread.

According to the CPU total execution time and the GPU total execution time which are the same or closest, the optimal calculation time can be obtained, and the distributed CPU thread number directly influences the access time and the calculation time of the CPU. There is thus an optimal division:

Since the GPU is much more computationally intensive than the CPU, when n _ic、n_ig is a non-integer solution, the tasks are split to the GPU. For the number of threads created by the CPU, experiments find that when the number of threads exceeds the number of CPU cores, additional overhead is generated, and the calculation time is increased instead, so that Thread is smaller than the number of CPU cores. In addition, thread is also related to the number of blocks n _ic allocated to the CPU. The value of Thread is therefore:

The invention can be applied to clusters or single machines, and the calculation speed is improved through resource scheduling. By combining the parallelization framework with the clusters, it is possible to run on each task execution unit separately.

It should be understood that parts of the specification not specifically set forth herein are all prior art.

It should be understood that the foregoing description of the preferred embodiments is not intended to limit the scope of the invention, but rather to limit the scope of the claims, and that those skilled in the art can make substitutions or modifications without departing from the scope of the invention as set forth in the appended claims.

Claims

1. The automatic computing resource allocation method for the three-level parallel middleware is characterized by comprising the following steps of:

step 1.1: positioning the circulation statement and the branch statement, inserting piles in the circulation structure and inserting piles in the branch statement, and obtaining the number of circulation statement strips and the number of branch statement strips;

Step 1.2: positioning an MPI communication function, and performing pile inserting on the MPI communication function to obtain MPI communication data volume;

Step 1.3: taking a new program after the pile is inserted into the original program as a prediction object, inputting and operating the newly generated program after the pile is inserted, obtaining the output values of the CPU running time, the GPU running time and the total running time of the program, and performing calculation again as input, and circularly executing for a plurality of times until the expected value is reached, namely the output value is stable, and the error is smaller than a preset value;

2. The automatic allocation method of computing resources for three-level parallel middleware according to claim 1, wherein: the expected value in step 1.3 is based on the influence of the output value of each feature on time, and each output value is fitted.

3. The automatic allocation method of computing resources for three-level parallel middleware according to claim 1, wherein: in the step 2, the total calculation tasks are divided through a distributed framework, the total task number N is used as input, and the total tasks are connected through a message middleware; the message middleware distributes the total task number N to each task execution unit respectively and monitors the running condition of each task execution unit; and storing the result after the operation of the task execution unit is finished.

4. The method for automatically distributing computing resources for three-level parallel middleware according to claim 3, wherein the method comprises the following steps: the message middleware takes the task number of a task execution unit as input, based on the number of blocks allocated to a CPU, the number of blocks allocated to a GPU and the number of threads created on the CPU, allocates the tasks according to the principle that the total time calculated under the condition that the CPU and the GPU finish the tasks simultaneously is the shortest through a resource scheduling model, and realizes the simultaneous calculation of the CPU and the GPU;

The resource scheduling model firstly judges whether a Worker to be called has configured related parameters in a database for the input of new data, wherein the related parameters comprise the number of blocks of a CPU, the number of blocks of a GPU and the number of threads created on the CPU; if the related parameters do not exist, sending a task, enabling the workbench to call a pile inserting program to fit, and obtaining the related parameters from a related result; and then sending the parameters and the data to a task queue to wait for the slave to pull the task to run.

5. The automatic allocation method of computing resources for three-level parallel middleware according to any one of claims 1 to 4, wherein: the CPU is responsible for processing a portion of the data blocks and creation and distribution of threads, and the GPU is responsible for processing the remaining data blocks.

6. The automatic allocation method for computing resources of three-level parallel middleware according to claim 5, wherein the method comprises the following steps: the creation of the thread comprises the following specific implementation steps:

(1) Creating a main thread part of a program, and firstly completing a sequence part of the program in the main thread;

(2) Dividing tasks for parallelizable loops, wherein the main task of dividing is to determine how many loops are executed by the CPU, namely how many threads of the CPU are needed to execute, and then determining the number of tasks executed by the GPU;

(3) Creating a thread to execute the allocated task, wherein a part of the thread executes the calculation task, and a part of the thread controls the GPU to execute the task allocated to the GPU;

(4) Waiting for the complete end of the task, and merging task results;

(5) Execution continues with the remainder of the code.

7. The automatic computing resource distribution system for the three-level parallel middleware is characterized by comprising the following modules:

The module 1.3 is used for taking a new program after the pile is inserted into the original program as a prediction object, inputting and running the newly generated program after the pile is inserted, obtaining the output values of the CPU running time, the GPU running time and the overall running time of the program, calculating again as input, and circularly executing for a plurality of times until the expected value is reached, namely the output value is stable, and the error is less than 0.001;