CN114201287B

CN114201287B - Method for cooperatively processing data based on CPU + GPU heterogeneous platform

Info

Publication number: CN114201287B
Application number: CN202210144694.0A
Authority: CN
Inventors: 王宇杰; 范长伟
Original assignee: Hunan Maixi Software Co ltd
Current assignee: Hunan Maixi Software Co ltd
Priority date: 2022-02-17
Filing date: 2022-02-17
Publication date: 2022-05-03
Anticipated expiration: 2042-02-17
Also published as: CN114201287A

Abstract

The invention relates to a method for cooperatively processing data based on a CPU + GPU heterogeneous platform, which specifically comprises the following steps: s1: decomposing the calculation task into a plurality of independent branch calculation tasks by using an elimination tree structure formed by a sparse matrix pattern, and representing the calculation tasks by using a plurality of subtrees into which the branches of the elimination tree are divided; s2: evaluating the sub-tree decomposition calculation performance of a plurality of devices, wherein the devices comprise a CPU (central processing unit) and a GPU (graphic processing unit), and establishing the distribution relationship between the sub-trees and the devices; s3: calculating corresponding subtrees based on the equipment; s4: and if the subtrees comprise root subtrees, calculating the corresponding root subtrees based on the equipment. The invention utilizes the 'cooperative calculation and mutual acceleration' between the CPU and the GPU, thereby breaking through the bottleneck of CPU development and effectively solving the problems of energy consumption and the like.

Description

Method for cooperatively processing data based on CPU + GPU heterogeneous platform

Technical Field

The invention relates to the field of data processing of heterogeneous platforms, in particular to a method for cooperatively processing data based on a CPU + GPU heterogeneous platform.

Background

Heterogeneous computing is seen as a third era after single-core, multi-core computer processors that will implement compute units using different types of instruction sets and architectures. Especially in solving problems of large-scale sparse linear systems, the method is becoming a key calculation step in many engineering problems, such as implicit finite element simulation analysis, power network simulation, computational fluid mechanics, weather prediction and the like. Taking finite element simulation analysis as an example, the solution time of the sparse linear equation system often occupies 80% of the whole calculation time, so the solution efficiency of the linear system directly determines the calculation performance of the whole finite element analysis. And with the rapid development of the industrial level, the finite element simulation calculation scale is larger and larger, and the corresponding calculation amount is increased explosively, so that the efficient linear solver is very important for improving the simulation calculation efficiency, shortening the simulation calculation time and further accelerating the iterative process of industrial products.

The MKL Pardiso sparse linear equation solver which is closer to the MKL Pardiso sparse linear equation solver comprises solution algorithms of various types of equations, and a shared memory-based multithreading parallel computing method is realized, but the MKL Pardiso sparse linear equation solver is only based on CPU parallel accelerated computing, and a CPU/GPU heterogeneous parallel computing method is not realized by fully utilizing the computing capacity of GPU equipment, so that the requirement for quickly solving the current large-scale sparse linear system is difficult to meet, equipment is easy to idle, and energy consumption is wasted.

Disclosure of Invention

The invention aims to provide a method for realizing the cooperative computing and mutual acceleration between a CPU and a GPU, so that the bottleneck of development of the CPU is broken through, and the problems of energy consumption and the like are effectively solved.

In order to achieve the purpose, the method for cooperatively processing data based on the CPU + GPU heterogeneous platform specifically comprises the following steps:

s1: decomposing the calculation task into a plurality of independent branch calculation tasks by using an elimination tree structure formed by a sparse matrix pattern, and representing the calculation tasks by using a plurality of subtrees into which the branches of the elimination tree are divided;

s2: evaluating the sub-tree decomposition calculation performance of a plurality of devices, wherein the devices comprise a CPU (central processing unit) and a GPU (graphic processing unit), and establishing the distribution relationship between the sub-trees and the devices;

s3: calculating corresponding subtrees based on the equipment;

S4: if the subtrees comprise root subtrees, calculating the corresponding root subtrees based on the equipment;

as a further improvement of the method for cooperatively processing data based on the CPU + GPU heterogeneous platform, in S2.1: evaluating the actual floating point calculation times per second (flops) of CPU and GPU by using subtrees with different sizes as reference, and generating coefficients according to the evaluation result to realize theoretical calculation of floating point calculation times per second (flops)flopsModifying;

where c is the clock frequency, np is the number of processors, d represents the number of double precision instructions per cycle;εis a correction factor;

s2.2: the subtrees representing the branch calculation tasks are assigned to the respective devices.

As a further improvement of the method for the cooperative processing of data based on the CPU + GPU heterogeneous platform, in S2.2:

mapping between the branch calculation task and the equipment is realized by using a greedy strategy;

the sub-trees are sorted in descending order according to floating point operands (flop) and then each time a branch computation task is assigned to the device that is currently least working.

As a further improvement of the method for the cooperative processing of data based on the CPU + GPU heterogeneous platform of the present invention,

in S2.2:

establishing a zero-first planning mathematical model;

（2）

in the formulaMIs the number of devices that are to be used,Nis the number of the sub-trees,

deviceiThe performance of the calculation of (a) is,

Is the firstjThe amount of computation of the individual sub-trees,

is a variable of zero one, representing a subtreejWhether or not to be controlled by the deviceiThe calculation is carried out according to the calculation,Twhich represents the total calculation time of the system,t _iis a deviceiThe calculation time of (2).

As a further improvement of the method for cooperative processing of data based on the CPU + GPU heterogeneous platform of the present invention, in S3:

s3.1: in the processing of subtree calculation based on the GPU, a batch processing kernel function strategy is implemented, so that a single kernel function can realize the same intensive algebraic operation of a plurality of submatrices; the nodes in the tree are called super nodes and are obtained by combining a plurality of columns, and the sub-matrix is a small dense matrix participating in calculation in the super nodes.

S3.2: setting a threshold for the size of one sub-matrix to divide all sub-matrices into two sets; sets with sub-matrix sizes exceeding a threshold are processed using a kernel function that is invoked multiple times in different CUDA flows; another set is processed using a batch kernel.

As a further improvement of the method for the cooperative processing of data based on the CPU + GPU heterogeneous platform, in S3.2:

s3.3: in the calculation process, data transmission is divided into two types, wherein one type is that decomposed node data is transmitted from a GPU video memory to a host page lock memory by using a zero-copy memory method; the other method copies data in the page lock memory to the conventional paging memory through asynchronous transmission and overlapping of kernel functions.

s3.4: in the processing of subtree parallel computation based on the CPU, a threshold value of the size of a submatrix is also set to divide all the submatrixes into two sets, then the submatrixes are processed in batch to be screened, the size of the submatrix exceeds the set of the threshold value, and a single matrix is computed in parallel through a multi-thread kernel function; the other set implements parallel computation between sub-matrices through serial kernel functions.

As a further improvement of the method for cooperative processing of data based on the CPU + GPU heterogeneous platform of the present invention, in S4:

s4.1: binding a corresponding number of CPU threads to each GPU, and calling the CPU threads as a workgroup;

s4.2: after all the nodes in the root subtrees are sorted according to the level, the nodes are put into a task pool, each working group takes one node from the task pool each time, the descendants of the node are preprocessed, and after all the nodes are finished, all the descendants are sorted according to the descending order of the size, so that the GPU and the CPU cooperate to calculate the Schuler complement of the nodes in parallel;

s4.3: setting a size threshold value, calculating the submatrices which are larger than the size threshold value by using a GPU, then respectively starting the calculation by the GPU and the CPU from the head and the tail of the descendants, using a tree parallel strategy in the initial stage of the CPU, and then converting into a node parallel strategy for calculation.

As a further improvement of the method for cooperatively processing data based on the CPU + GPU heterogeneous platform, when the root subtree is calculated, the waiting mode is Spin-wait (Spin-wait).

As a further improvement of the method for cooperatively processing data based on the CPU + GPU heterogeneous platform, after the method completes decomposition, the schulren's complement of the descendants must be calculated, and if the current descendant does not end and the next descendant ends, the positions of the descendants are exchanged.

The invention not only realizes the sparse linear solving method based on the CPU/GPU heterogeneous parallel, but also provides a heterogeneous load balancing strategy with high robustness to ensure that each device realizes efficient parallel solving, and can fully utilize the computing resources of multiple devices to realize the rapid solving of the linear equation.

Aiming at the requirement of multiple fields for fast solving of the large-scale sparse linear equation, the CPU/GPU heterogeneous hybrid computing method is realized on the basis of a task parallel solving strategy of an elimination tree, a task allocation model with high robustness is established to realize load balance among equipment, efficient solving of the large-scale sparse linear equation is realized, and the fast solving requirement of the sparse linear system in multiple fields such as finite element simulation analysis, power network simulation, computational fluid mechanics, weather prediction and the like can be met.

Drawings

FIG. 1 is a diagram of algebraic computational submatrix size distribution;

wherein (a) is a distribution diagram of the size of gemmm-k; wherein (b) is a schematic diagram of the size distribution of syrk-n.

FIG. 2 is a schematic diagram of a GPU-based subtree computation flow.

FIG. 3 is a schematic diagram of a CPU-based subtree computation strategy.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1-3, the invention uses the elimination tree structure formed by sparse matrix pattern to realize the decomposition of task in parallel computation, and divides the branches of complete elimination tree into multiple subtrees to ensure the computation independence between the subtrees; the optimal distribution of the calculation tasks among all devices is ensured by establishing a dynamic planning model, and the efficient heterogeneous calculation is realized; in GPU calculation, a large amount of starting delay is avoided by combining kernel functions, the calculation efficiency is improved, and the data transmission cost between devices is reduced by realizing the overlapping of calculation and data transmission; the efficient parallel computing method of the CPU is realized by combining tree parallel strategies and node parallel strategies; the synchronous operation in the parallel calculation of the root subtree is avoided by the rotation waiting, so that the calculation efficiency is further improved.

The invention can solve the linear algebra problem in high-performance numerical calculation, is a heterogeneous adaptive hybrid calculation method for solving a large-scale sparse linear equation set, aims at the requirement of multiple fields on the rapid solution of the large-scale sparse linear equation, realizes the CPU/GPU heterogeneous hybrid calculation method on the basis of a task parallel solution strategy of an elimination tree, establishes a task distribution model with high robustness to realize the load balance among equipment, realizes the efficient solution of the large-scale sparse linear equation, and can meet the rapid solution requirement of the sparse linear system in multiple fields such as finite element simulation analysis, power network simulation, computational fluid mechanics, weather prediction and the like.

The invention mainly comprises four basic steps:

step 1: the method comprises the following steps of load balancing among devices:

when mapping the sub-tree and the device, the invention considers the problem of load balancing, which is very important for improving the overall performance. If all devices are the same, then only task mapping based on the number of floating point calculations per sub-tree is needed to achieve relative load balancing. However, this is complicated when the computing power of the devices is different, especially when the CPU and GPU are used for mixed computing (the CPU is also regarded as a device), the present invention must accurately evaluate the performance of the subtree splitting computation of different devices — floating point operands per second. A simple approach is to use the double-precision floating-point computation peak of a device to represent its computational power, but this often results in unbalanced task allocation results because the percentage of peak performance that different devices can achieve in actual computation is not the same. To overcome this problem, the present invention uses subtrees of different sizes as references to evaluate the actual floating point number of computations per second (flops) of the CPU and GPU, and generates coefficients from test results to modify the theoretical flops

In the formulacIs the frequency of the clock or clocks,npis the number of processors that are to be processed,drepresenting the number of double precision instructions per cycle.

Thereafter, the present invention uses a greedy policy to establish a mapping between tasks and devices byflopThe subtrees are sorted in descending order and then each time tasks are assigned to the device that is currently working the least.

However, when the number of subtrees is small or the computing power of the devices is very different, this strategy is likely to cause load imbalance among the devices and further reduce the overall computing performance. Therefore, in order to achieve the highest computational efficiency under any circumstances, the invention establishes a zero-one programming mathematical model to achieve the goal:

（2）

in the formulaMIs the number of devices that are present,Nis the number of the sub-trees,

presentation apparatusiThe performance of the calculation of (a) is,

is the firstjThe amount of computation of an individual subtree,

is a variable of zero one, representing a subtree jWhether or not to be controlled by the deviceiThe calculation is carried out according to the calculation,Twhich represents the total calculation time of the system,t _iis a deviceiThe calculation time of (2).

Step 2: the GPU-based subtree calculation comprises the following specific steps:

a common method when using GPU computing is to use multiple streams to achieve computational overlap between kernels, but there are significant limitations when applying the present invention here, and from fig. 1, it can be seen that the distribution of the dense sub-matrix sizes, most of which are not large enough, results in a large number of GPU threads being idle. Furthermore, in fig. 1, about 90 ten thousand syrk (k-order update of symmetric Matrix) operations and 50 ten thousand gemm (general Matrix multiplication) general Matrix multiplication operations need to be called, and the start-up of a large number of kernel functions also causes additional overhead, which further reduces the computational efficiency of the GPU.

To overcome the above mentioned limitations, the present invention implements a batch kernel strategy to save time for multiple starts, i.e. one kernel is used to implement the same intensive algebraic operations of many sub-matrices. However, it should be noted that, in the CUDA (computer Unified Device architecture) general parallel computing architecture, the provided batch API is only suitable for the same size matrix, and the size of the sub-matrices of the present invention is different, so that the present invention can implement batch intensive kernel functions for different size matrices on the GPU. Furthermore, for some problem sizes, it may be more beneficial to call an API multiple times in different CUDA flows than to use a batch kernel. The invention sets a threshold value of one size according to the test result to divide all the submatrices into two sets, wherein the set with the size exceeding the threshold value uses a general kernel function which is called for multiple times in different CUDA flows (the CUDA flows represent a GPU operation queue), and the other uses a batch processing kernel function.

In fig. 2, Blas _ syrk, Blas _ gemm, Cusolver _ potrf, Blas _ trsm, and Large _ update respectively represent conventional kernel functions such as rank update of K order of matrix, matrix multiplication, matrix cholesky decomposition, triangle linear solution, and descendant update, and the corresponding "batch" prefix represents a batch kernel function; copyLx _ D2P represents a zero copy operation from the display memory to the host page-locked memory, and cudamememaetasync represents an asynchronous initialization operation.

FIG. 2 illustrates the flow of batch computations by nodes in a level, with bars at different levels representing different flows in different CUDAs, and black dashed lines representing synchronous operations between flows. In the calculation process, data transmission is divided into two types, one type is that decomposed node data is transmitted from a GPU video memory to a host page locking memory, and the method of the zero-copy memory is used in the invention; the other method is to copy the data in the page lock memory to the conventional paging memory by asynchronous transmission and the superposition of kernel functions.

And step 3: the method comprises the following steps of subtree parallel computing based on a CPU:

in many jobs using GPUs for accelerated computing, the CPU is only used to control task scheduling and launch kernel functions without participating in intensive computing tasks, which actually results in a waste of CPU computing resources. Therefore, in order to fully utilize the computing resources of all devices, the CPU also participates in the calculation as a device when the subtree numerical decomposition is executed, so that the overall calculation time is reduced. Similar to the sub-tree parallel computing strategy on the GPU, the CPU also performs numerical decomposition on the nodes in the same level in batches to avoid an excessive memory requirement, and the number of nodes that can be actually processed simultaneously on the CPU is also limited by the number of threads, which is precisely less than or equal to the maximum number of threads.

In consideration of the situation shown in FIG. 1, the invention makes two different parallel strategies to obtain better performance, and also sets a threshold value according to the test result to divide the same batch of sub-matrixes into two sets, wherein the set with the size exceeding the threshold value obtains the parallel through the multi-thread kernel function, and the other set realizes the parallel computation among the sub-matrixes through the serial kernel function.

And 4, step 4: the mixed calculation of the root subtrees comprises the following specific steps:

due to the limitation of the memory size of the device, the nodes contained in the root tree cannot be independently performed on the GPU, so that another parallel strategy is required to be adopted to realize the efficient hybrid calculation of the CPU and the GPU. First, considering that the complete decomposition of a node requires simultaneous calculation using a CPU and a GPU, the present invention binds a certain number of CPU threads to each GPU according to the number of devices, and refers to it as a workgroup. Following the need to determine the parallelism between the nodes, a simple approach is to continue to use the hierarchical parallel mode in the subtree, but it requires synchronization between levels, which can degrade performance when multiple GPUs compute.

To overcome this limitation, the present invention implements a per-node parallelism strategy to pipeline root sub-tree decomposition. And after the nodes in all the root subtrees are sorted according to the level, putting the nodes into a task pool, taking one node from the task pool by each working group each time, preprocessing the descendants of the nodes, and finishing all the previous nodes. All the descendants are sorted according to the descending order of the size, cooperation of the GPU and the CPU can be conveniently implemented to calculate the schulren complements of the descendants in parallel, because the submatrix with the large enough size can be accelerated on the GPU, a size threshold value is set to judge whether the matrixes are suitable for being calculated on the GPU, then the GPU and the CPU respectively start to calculate from the head and the tail of the descendants, a tree parallel strategy is used in the initial stage of the CPU, and then the calculation is converted into a node parallel strategy, so that the total calculation time can be further reduced.

The algorithm of the invention uses the synchronization between the rotation waiting substitution levels to ensure the correctness of the calculation, the calculation of the schulren's complement of the descendants must be carried out after the decomposition is completed, and in order to avoid the overlong waiting, the invention adds a switching strategy, if the current descendant does not finish but the next descendant finishes, the positions of the current descendants are switched. In addition, in order to reduce the communication overhead of the GPU and the host, a plurality of cache spaces are allocated to cover the data copy from the conventional memory of the host to the page lock memory, and the data transmission from the page lock memory to the device video memory is overlapped through a plurality of CUDA streams.

The foregoing is a further detailed description of the invention in connection with specific preferred embodiments and it is not intended to limit the invention to the specific embodiments described. For those skilled in the art to which the invention pertains, several equivalent substitutions or obvious modifications can be made without departing from the spirit of the invention, and all the properties or uses are considered to be within the scope of the invention.

Claims

1. A method for cooperatively processing data based on a CPU + GPU heterogeneous platform is characterized by comprising the following steps:

s3: calculating corresponding subtrees based on the equipment;

in S2:

s2.1: evaluating the actual floating point calculation times per second (flops) of the CPU and the GPU by using subtrees with different sizes as a reference, and generating coefficients according to evaluation results to modify the theoretical flops;

where c is the clock frequency, np is the number of processors, d represents the number of double precision instructions per cycle; epsilon is a correction coefficient;

s2.2: assigning a sub-tree representing a branch computation task to a corresponding device;

in S2.2:

sorting the subtrees in a descending order according to the floating point operand flop, and then distributing the branch calculation tasks to the equipment which works the least at present each time;

or:

establishing a zero-one programming mathematical model;

（2）

Where M is the number of devices, N is the number of subtrees,

presentation apparatusiThe performance of the calculation of (a) is,

is the firstjThe amount of computation of the individual sub-trees,

is a zero-one variable representing a subtreejWhether or not to be controlled by the deviceiThe calculation is carried out according to the calculation,Twhich represents the total calculation time of the system,t _iis a deviceiThe calculated time of (a);

in S3:

s3.1: in the processing of subtree calculation based on the GPU, a batch processing kernel function strategy is implemented, so that a single kernel function can realize the same intensive algebraic operation of a plurality of submatrices;

s3.2: setting a threshold for the size of one sub-matrix to divide all sub-matrices into two sets; sets with sub-matrix sizes exceeding a threshold are processed using a kernel function that is invoked multiple times in different CUDA flows; the other set is processed using a batch kernel;

in S4:

2. The method for cooperative data processing based on the CPU + GPU heterogeneous platform of claim 1, wherein in S3:

3. The method for cooperative data processing based on the CPU + GPU heterogeneous platform of claim 2, wherein in S3:

4. The method for cooperative data processing based on the CPU + GPU heterogeneous platform as claimed in claim 1, wherein when the root subtree is calculated, the waiting mode is Spin-wait.

5. The method for cooperative data processing based on the CPU + GPU heterogeneous platform of claim 4, wherein the computation of the schulren's complement of the descendants must exchange their positions after it completes the decomposition if the current descendant does not end and the next descendant ends.