CN114201287B - Method for cooperatively processing data based on CPU + GPU heterogeneous platform - Google Patents

Method for cooperatively processing data based on CPU + GPU heterogeneous platform Download PDF

Info

Publication number
CN114201287B
CN114201287B CN202210144694.0A CN202210144694A CN114201287B CN 114201287 B CN114201287 B CN 114201287B CN 202210144694 A CN202210144694 A CN 202210144694A CN 114201287 B CN114201287 B CN 114201287B
Authority
CN
China
Prior art keywords
cpu
calculation
gpu
subtrees
sub
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210144694.0A
Other languages
Chinese (zh)
Other versions
CN114201287A (en
Inventor
王宇杰
范长伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan Maixi Software Co ltd
Original Assignee
Hunan Maixi Software Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan Maixi Software Co ltd filed Critical Hunan Maixi Software Co ltd
Priority to CN202210144694.0A priority Critical patent/CN114201287B/en
Publication of CN114201287A publication Critical patent/CN114201287A/en
Application granted granted Critical
Publication of CN114201287B publication Critical patent/CN114201287B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Complex Calculations (AREA)

Abstract

The invention relates to a method for cooperatively processing data based on a CPU + GPU heterogeneous platform, which specifically comprises the following steps: s1: decomposing the calculation task into a plurality of independent branch calculation tasks by using an elimination tree structure formed by a sparse matrix pattern, and representing the calculation tasks by using a plurality of subtrees into which the branches of the elimination tree are divided; s2: evaluating the sub-tree decomposition calculation performance of a plurality of devices, wherein the devices comprise a CPU (central processing unit) and a GPU (graphic processing unit), and establishing the distribution relationship between the sub-trees and the devices; s3: calculating corresponding subtrees based on the equipment; s4: and if the subtrees comprise root subtrees, calculating the corresponding root subtrees based on the equipment. The invention utilizes the 'cooperative calculation and mutual acceleration' between the CPU and the GPU, thereby breaking through the bottleneck of CPU development and effectively solving the problems of energy consumption and the like.

Description

Method for cooperatively processing data based on CPU + GPU heterogeneous platform
Technical Field
The invention relates to the field of data processing of heterogeneous platforms, in particular to a method for cooperatively processing data based on a CPU + GPU heterogeneous platform.
Background
Heterogeneous computing is seen as a third era after single-core, multi-core computer processors that will implement compute units using different types of instruction sets and architectures. Especially in solving problems of large-scale sparse linear systems, the method is becoming a key calculation step in many engineering problems, such as implicit finite element simulation analysis, power network simulation, computational fluid mechanics, weather prediction and the like. Taking finite element simulation analysis as an example, the solution time of the sparse linear equation system often occupies 80% of the whole calculation time, so the solution efficiency of the linear system directly determines the calculation performance of the whole finite element analysis. And with the rapid development of the industrial level, the finite element simulation calculation scale is larger and larger, and the corresponding calculation amount is increased explosively, so that the efficient linear solver is very important for improving the simulation calculation efficiency, shortening the simulation calculation time and further accelerating the iterative process of industrial products.
The MKL Pardiso sparse linear equation solver which is closer to the MKL Pardiso sparse linear equation solver comprises solution algorithms of various types of equations, and a shared memory-based multithreading parallel computing method is realized, but the MKL Pardiso sparse linear equation solver is only based on CPU parallel accelerated computing, and a CPU/GPU heterogeneous parallel computing method is not realized by fully utilizing the computing capacity of GPU equipment, so that the requirement for quickly solving the current large-scale sparse linear system is difficult to meet, equipment is easy to idle, and energy consumption is wasted.
Disclosure of Invention
The invention aims to provide a method for realizing the cooperative computing and mutual acceleration between a CPU and a GPU, so that the bottleneck of development of the CPU is broken through, and the problems of energy consumption and the like are effectively solved.
In order to achieve the purpose, the method for cooperatively processing data based on the CPU + GPU heterogeneous platform specifically comprises the following steps:
s1: decomposing the calculation task into a plurality of independent branch calculation tasks by using an elimination tree structure formed by a sparse matrix pattern, and representing the calculation tasks by using a plurality of subtrees into which the branches of the elimination tree are divided;
s2: evaluating the sub-tree decomposition calculation performance of a plurality of devices, wherein the devices comprise a CPU (central processing unit) and a GPU (graphic processing unit), and establishing the distribution relationship between the sub-trees and the devices;
s3: calculating corresponding subtrees based on the equipment;
S4: if the subtrees comprise root subtrees, calculating the corresponding root subtrees based on the equipment;
as a further improvement of the method for cooperatively processing data based on the CPU + GPU heterogeneous platform, in S2.1: evaluating the actual floating point calculation times per second (flops) of CPU and GPU by using subtrees with different sizes as reference, and generating coefficients according to the evaluation result to realize theoretical calculation of floating point calculation times per second (flops)flopsModifying;
Figure 306155DEST_PATH_IMAGE001
Figure 727909DEST_PATH_IMAGE002
where c is the clock frequency, np is the number of processors, d represents the number of double precision instructions per cycle;εis a correction factor;
s2.2: the subtrees representing the branch calculation tasks are assigned to the respective devices.
As a further improvement of the method for the cooperative processing of data based on the CPU + GPU heterogeneous platform, in S2.2:
mapping between the branch calculation task and the equipment is realized by using a greedy strategy;
the sub-trees are sorted in descending order according to floating point operands (flop) and then each time a branch computation task is assigned to the device that is currently least working.
As a further improvement of the method for the cooperative processing of data based on the CPU + GPU heterogeneous platform of the present invention,
in S2.2:
establishing a zero-first planning mathematical model;
Figure 56122DEST_PATH_IMAGE003
(2)
in the formulaMIs the number of devices that are to be used,Nis the number of the sub-trees,
Figure 196117DEST_PATH_IMAGE004
deviceiThe performance of the calculation of (a) is,
Figure 635188DEST_PATH_IMAGE005
Is the firstjThe amount of computation of the individual sub-trees,
Figure 927760DEST_PATH_IMAGE006
is a variable of zero one, representing a subtreejWhether or not to be controlled by the deviceiThe calculation is carried out according to the calculation,Twhich represents the total calculation time of the system,t i is a deviceiThe calculation time of (2).
As a further improvement of the method for cooperative processing of data based on the CPU + GPU heterogeneous platform of the present invention, in S3:
s3.1: in the processing of subtree calculation based on the GPU, a batch processing kernel function strategy is implemented, so that a single kernel function can realize the same intensive algebraic operation of a plurality of submatrices; the nodes in the tree are called super nodes and are obtained by combining a plurality of columns, and the sub-matrix is a small dense matrix participating in calculation in the super nodes.
S3.2: setting a threshold for the size of one sub-matrix to divide all sub-matrices into two sets; sets with sub-matrix sizes exceeding a threshold are processed using a kernel function that is invoked multiple times in different CUDA flows; another set is processed using a batch kernel.
As a further improvement of the method for the cooperative processing of data based on the CPU + GPU heterogeneous platform, in S3.2:
s3.3: in the calculation process, data transmission is divided into two types, wherein one type is that decomposed node data is transmitted from a GPU video memory to a host page lock memory by using a zero-copy memory method; the other method copies data in the page lock memory to the conventional paging memory through asynchronous transmission and overlapping of kernel functions.
As a further improvement of the method for cooperative processing of data based on the CPU + GPU heterogeneous platform of the present invention, in S3:
s3.4: in the processing of subtree parallel computation based on the CPU, a threshold value of the size of a submatrix is also set to divide all the submatrixes into two sets, then the submatrixes are processed in batch to be screened, the size of the submatrix exceeds the set of the threshold value, and a single matrix is computed in parallel through a multi-thread kernel function; the other set implements parallel computation between sub-matrices through serial kernel functions.
As a further improvement of the method for cooperative processing of data based on the CPU + GPU heterogeneous platform of the present invention, in S4:
s4.1: binding a corresponding number of CPU threads to each GPU, and calling the CPU threads as a workgroup;
s4.2: after all the nodes in the root subtrees are sorted according to the level, the nodes are put into a task pool, each working group takes one node from the task pool each time, the descendants of the node are preprocessed, and after all the nodes are finished, all the descendants are sorted according to the descending order of the size, so that the GPU and the CPU cooperate to calculate the Schuler complement of the nodes in parallel;
s4.3: setting a size threshold value, calculating the submatrices which are larger than the size threshold value by using a GPU, then respectively starting the calculation by the GPU and the CPU from the head and the tail of the descendants, using a tree parallel strategy in the initial stage of the CPU, and then converting into a node parallel strategy for calculation.
As a further improvement of the method for cooperatively processing data based on the CPU + GPU heterogeneous platform, when the root subtree is calculated, the waiting mode is Spin-wait (Spin-wait).
As a further improvement of the method for cooperatively processing data based on the CPU + GPU heterogeneous platform, after the method completes decomposition, the schulren's complement of the descendants must be calculated, and if the current descendant does not end and the next descendant ends, the positions of the descendants are exchanged.
The invention not only realizes the sparse linear solving method based on the CPU/GPU heterogeneous parallel, but also provides a heterogeneous load balancing strategy with high robustness to ensure that each device realizes efficient parallel solving, and can fully utilize the computing resources of multiple devices to realize the rapid solving of the linear equation.
Aiming at the requirement of multiple fields for fast solving of the large-scale sparse linear equation, the CPU/GPU heterogeneous hybrid computing method is realized on the basis of a task parallel solving strategy of an elimination tree, a task allocation model with high robustness is established to realize load balance among equipment, efficient solving of the large-scale sparse linear equation is realized, and the fast solving requirement of the sparse linear system in multiple fields such as finite element simulation analysis, power network simulation, computational fluid mechanics, weather prediction and the like can be met.
Drawings
FIG. 1 is a diagram of algebraic computational submatrix size distribution;
wherein (a) is a distribution diagram of the size of gemmm-k; wherein (b) is a schematic diagram of the size distribution of syrk-n.
FIG. 2 is a schematic diagram of a GPU-based subtree computation flow.
FIG. 3 is a schematic diagram of a CPU-based subtree computation strategy.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1-3, the invention uses the elimination tree structure formed by sparse matrix pattern to realize the decomposition of task in parallel computation, and divides the branches of complete elimination tree into multiple subtrees to ensure the computation independence between the subtrees; the optimal distribution of the calculation tasks among all devices is ensured by establishing a dynamic planning model, and the efficient heterogeneous calculation is realized; in GPU calculation, a large amount of starting delay is avoided by combining kernel functions, the calculation efficiency is improved, and the data transmission cost between devices is reduced by realizing the overlapping of calculation and data transmission; the efficient parallel computing method of the CPU is realized by combining tree parallel strategies and node parallel strategies; the synchronous operation in the parallel calculation of the root subtree is avoided by the rotation waiting, so that the calculation efficiency is further improved.
The invention can solve the linear algebra problem in high-performance numerical calculation, is a heterogeneous adaptive hybrid calculation method for solving a large-scale sparse linear equation set, aims at the requirement of multiple fields on the rapid solution of the large-scale sparse linear equation, realizes the CPU/GPU heterogeneous hybrid calculation method on the basis of a task parallel solution strategy of an elimination tree, establishes a task distribution model with high robustness to realize the load balance among equipment, realizes the efficient solution of the large-scale sparse linear equation, and can meet the rapid solution requirement of the sparse linear system in multiple fields such as finite element simulation analysis, power network simulation, computational fluid mechanics, weather prediction and the like.
The invention mainly comprises four basic steps:
step 1: the method comprises the following steps of load balancing among devices:
when mapping the sub-tree and the device, the invention considers the problem of load balancing, which is very important for improving the overall performance. If all devices are the same, then only task mapping based on the number of floating point calculations per sub-tree is needed to achieve relative load balancing. However, this is complicated when the computing power of the devices is different, especially when the CPU and GPU are used for mixed computing (the CPU is also regarded as a device), the present invention must accurately evaluate the performance of the subtree splitting computation of different devices — floating point operands per second. A simple approach is to use the double-precision floating-point computation peak of a device to represent its computational power, but this often results in unbalanced task allocation results because the percentage of peak performance that different devices can achieve in actual computation is not the same. To overcome this problem, the present invention uses subtrees of different sizes as references to evaluate the actual floating point number of computations per second (flops) of the CPU and GPU, and generates coefficients from test results to modify the theoretical flops
Figure 161296DEST_PATH_IMAGE001
Figure 788586DEST_PATH_IMAGE007
In the formulacIs the frequency of the clock or clocks,npis the number of processors that are to be processed,drepresenting the number of double precision instructions per cycle.
Thereafter, the present invention uses a greedy policy to establish a mapping between tasks and devices byflopThe subtrees are sorted in descending order and then each time tasks are assigned to the device that is currently working the least.
However, when the number of subtrees is small or the computing power of the devices is very different, this strategy is likely to cause load imbalance among the devices and further reduce the overall computing performance. Therefore, in order to achieve the highest computational efficiency under any circumstances, the invention establishes a zero-one programming mathematical model to achieve the goal:
Figure 31348DEST_PATH_IMAGE008
(2)
in the formulaMIs the number of devices that are present,Nis the number of the sub-trees,
Figure 162116DEST_PATH_IMAGE004
presentation apparatusiThe performance of the calculation of (a) is,
Figure 582864DEST_PATH_IMAGE005
is the firstjThe amount of computation of an individual subtree,
Figure 431871DEST_PATH_IMAGE006
is a variable of zero one, representing a subtree jWhether or not to be controlled by the deviceiThe calculation is carried out according to the calculation,Twhich represents the total calculation time of the system,t i is a deviceiThe calculation time of (2).
Step 2: the GPU-based subtree calculation comprises the following specific steps:
a common method when using GPU computing is to use multiple streams to achieve computational overlap between kernels, but there are significant limitations when applying the present invention here, and from fig. 1, it can be seen that the distribution of the dense sub-matrix sizes, most of which are not large enough, results in a large number of GPU threads being idle. Furthermore, in fig. 1, about 90 ten thousand syrk (k-order update of symmetric Matrix) operations and 50 ten thousand gemm (general Matrix multiplication) general Matrix multiplication operations need to be called, and the start-up of a large number of kernel functions also causes additional overhead, which further reduces the computational efficiency of the GPU.
To overcome the above mentioned limitations, the present invention implements a batch kernel strategy to save time for multiple starts, i.e. one kernel is used to implement the same intensive algebraic operations of many sub-matrices. However, it should be noted that, in the CUDA (computer Unified Device architecture) general parallel computing architecture, the provided batch API is only suitable for the same size matrix, and the size of the sub-matrices of the present invention is different, so that the present invention can implement batch intensive kernel functions for different size matrices on the GPU. Furthermore, for some problem sizes, it may be more beneficial to call an API multiple times in different CUDA flows than to use a batch kernel. The invention sets a threshold value of one size according to the test result to divide all the submatrices into two sets, wherein the set with the size exceeding the threshold value uses a general kernel function which is called for multiple times in different CUDA flows (the CUDA flows represent a GPU operation queue), and the other uses a batch processing kernel function.
In fig. 2, Blas _ syrk, Blas _ gemm, Cusolver _ potrf, Blas _ trsm, and Large _ update respectively represent conventional kernel functions such as rank update of K order of matrix, matrix multiplication, matrix cholesky decomposition, triangle linear solution, and descendant update, and the corresponding "batch" prefix represents a batch kernel function; copyLx _ D2P represents a zero copy operation from the display memory to the host page-locked memory, and cudamememaetasync represents an asynchronous initialization operation.
FIG. 2 illustrates the flow of batch computations by nodes in a level, with bars at different levels representing different flows in different CUDAs, and black dashed lines representing synchronous operations between flows. In the calculation process, data transmission is divided into two types, one type is that decomposed node data is transmitted from a GPU video memory to a host page locking memory, and the method of the zero-copy memory is used in the invention; the other method is to copy the data in the page lock memory to the conventional paging memory by asynchronous transmission and the superposition of kernel functions.
And step 3: the method comprises the following steps of subtree parallel computing based on a CPU:
in many jobs using GPUs for accelerated computing, the CPU is only used to control task scheduling and launch kernel functions without participating in intensive computing tasks, which actually results in a waste of CPU computing resources. Therefore, in order to fully utilize the computing resources of all devices, the CPU also participates in the calculation as a device when the subtree numerical decomposition is executed, so that the overall calculation time is reduced. Similar to the sub-tree parallel computing strategy on the GPU, the CPU also performs numerical decomposition on the nodes in the same level in batches to avoid an excessive memory requirement, and the number of nodes that can be actually processed simultaneously on the CPU is also limited by the number of threads, which is precisely less than or equal to the maximum number of threads.
In consideration of the situation shown in FIG. 1, the invention makes two different parallel strategies to obtain better performance, and also sets a threshold value according to the test result to divide the same batch of sub-matrixes into two sets, wherein the set with the size exceeding the threshold value obtains the parallel through the multi-thread kernel function, and the other set realizes the parallel computation among the sub-matrixes through the serial kernel function.
And 4, step 4: the mixed calculation of the root subtrees comprises the following specific steps:
due to the limitation of the memory size of the device, the nodes contained in the root tree cannot be independently performed on the GPU, so that another parallel strategy is required to be adopted to realize the efficient hybrid calculation of the CPU and the GPU. First, considering that the complete decomposition of a node requires simultaneous calculation using a CPU and a GPU, the present invention binds a certain number of CPU threads to each GPU according to the number of devices, and refers to it as a workgroup. Following the need to determine the parallelism between the nodes, a simple approach is to continue to use the hierarchical parallel mode in the subtree, but it requires synchronization between levels, which can degrade performance when multiple GPUs compute.
To overcome this limitation, the present invention implements a per-node parallelism strategy to pipeline root sub-tree decomposition. And after the nodes in all the root subtrees are sorted according to the level, putting the nodes into a task pool, taking one node from the task pool by each working group each time, preprocessing the descendants of the nodes, and finishing all the previous nodes. All the descendants are sorted according to the descending order of the size, cooperation of the GPU and the CPU can be conveniently implemented to calculate the schulren complements of the descendants in parallel, because the submatrix with the large enough size can be accelerated on the GPU, a size threshold value is set to judge whether the matrixes are suitable for being calculated on the GPU, then the GPU and the CPU respectively start to calculate from the head and the tail of the descendants, a tree parallel strategy is used in the initial stage of the CPU, and then the calculation is converted into a node parallel strategy, so that the total calculation time can be further reduced.
The algorithm of the invention uses the synchronization between the rotation waiting substitution levels to ensure the correctness of the calculation, the calculation of the schulren's complement of the descendants must be carried out after the decomposition is completed, and in order to avoid the overlong waiting, the invention adds a switching strategy, if the current descendant does not finish but the next descendant finishes, the positions of the current descendants are switched. In addition, in order to reduce the communication overhead of the GPU and the host, a plurality of cache spaces are allocated to cover the data copy from the conventional memory of the host to the page lock memory, and the data transmission from the page lock memory to the device video memory is overlapped through a plurality of CUDA streams.
The foregoing is a further detailed description of the invention in connection with specific preferred embodiments and it is not intended to limit the invention to the specific embodiments described. For those skilled in the art to which the invention pertains, several equivalent substitutions or obvious modifications can be made without departing from the spirit of the invention, and all the properties or uses are considered to be within the scope of the invention.

Claims (5)

1. A method for cooperatively processing data based on a CPU + GPU heterogeneous platform is characterized by comprising the following steps:
S1: decomposing the calculation task into a plurality of independent branch calculation tasks by using an elimination tree structure formed by a sparse matrix pattern, and representing the calculation tasks by using a plurality of subtrees into which the branches of the elimination tree are divided;
s2: evaluating the sub-tree decomposition calculation performance of a plurality of devices, wherein the devices comprise a CPU (central processing unit) and a GPU (graphic processing unit), and establishing the distribution relationship between the sub-trees and the devices;
s3: calculating corresponding subtrees based on the equipment;
s4: if the subtrees comprise root subtrees, calculating the corresponding root subtrees based on the equipment;
in S2:
s2.1: evaluating the actual floating point calculation times per second (flops) of the CPU and the GPU by using subtrees with different sizes as a reference, and generating coefficients according to evaluation results to modify the theoretical flops;
Figure 940149DEST_PATH_IMAGE001
Figure 736067DEST_PATH_IMAGE002
where c is the clock frequency, np is the number of processors, d represents the number of double precision instructions per cycle; epsilon is a correction coefficient;
s2.2: assigning a sub-tree representing a branch computation task to a corresponding device;
in S2.2:
mapping between the branch calculation task and the equipment is realized by using a greedy strategy;
sorting the subtrees in a descending order according to the floating point operand flop, and then distributing the branch calculation tasks to the equipment which works the least at present each time;
or:
establishing a zero-one programming mathematical model;
Figure 489259DEST_PATH_IMAGE003
(2)
Where M is the number of devices, N is the number of subtrees,
Figure 105048DEST_PATH_IMAGE004
presentation apparatusiThe performance of the calculation of (a) is,
Figure 601889DEST_PATH_IMAGE005
is the firstjThe amount of computation of the individual sub-trees,
Figure 252313DEST_PATH_IMAGE006
is a zero-one variable representing a subtreejWhether or not to be controlled by the deviceiThe calculation is carried out according to the calculation,Twhich represents the total calculation time of the system,t i is a deviceiThe calculated time of (a);
in S3:
s3.1: in the processing of subtree calculation based on the GPU, a batch processing kernel function strategy is implemented, so that a single kernel function can realize the same intensive algebraic operation of a plurality of submatrices;
s3.2: setting a threshold for the size of one sub-matrix to divide all sub-matrices into two sets; sets with sub-matrix sizes exceeding a threshold are processed using a kernel function that is invoked multiple times in different CUDA flows; the other set is processed using a batch kernel;
in S4:
s4.1: binding a corresponding number of CPU threads to each GPU, and calling the CPU threads as a workgroup;
s4.2: after all the nodes in the root subtrees are sorted according to the level, the nodes are put into a task pool, each working group takes one node from the task pool each time, the descendants of the node are preprocessed, and after all the nodes are finished, all the descendants are sorted according to the descending order of the size, so that the GPU and the CPU cooperate to calculate the Schuler complement of the nodes in parallel;
S4.3: setting a size threshold value, calculating the submatrices which are larger than the size threshold value by using a GPU, then respectively starting the calculation by the GPU and the CPU from the head and the tail of the descendants, using a tree parallel strategy in the initial stage of the CPU, and then converting into a node parallel strategy for calculation.
2. The method for cooperative data processing based on the CPU + GPU heterogeneous platform of claim 1, wherein in S3:
s3.3: in the calculation process, data transmission is divided into two types, wherein one type is that decomposed node data is transmitted from a GPU video memory to a host page lock memory by using a zero-copy memory method; the other method copies data in the page lock memory to the conventional paging memory through asynchronous transmission and overlapping of kernel functions.
3. The method for cooperative data processing based on the CPU + GPU heterogeneous platform of claim 2, wherein in S3:
s3.4: in the processing of subtree parallel computation based on the CPU, a threshold value of the size of a submatrix is also set to divide all the submatrixes into two sets, then the submatrixes are processed in batch to be screened, the size of the submatrix exceeds the set of the threshold value, and a single matrix is computed in parallel through a multi-thread kernel function; the other set implements parallel computation between sub-matrices through serial kernel functions.
4. The method for cooperative data processing based on the CPU + GPU heterogeneous platform as claimed in claim 1, wherein when the root subtree is calculated, the waiting mode is Spin-wait.
5. The method for cooperative data processing based on the CPU + GPU heterogeneous platform of claim 4, wherein the computation of the schulren's complement of the descendants must exchange their positions after it completes the decomposition if the current descendant does not end and the next descendant ends.
CN202210144694.0A 2022-02-17 2022-02-17 Method for cooperatively processing data based on CPU + GPU heterogeneous platform Active CN114201287B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210144694.0A CN114201287B (en) 2022-02-17 2022-02-17 Method for cooperatively processing data based on CPU + GPU heterogeneous platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210144694.0A CN114201287B (en) 2022-02-17 2022-02-17 Method for cooperatively processing data based on CPU + GPU heterogeneous platform

Publications (2)

Publication Number Publication Date
CN114201287A CN114201287A (en) 2022-03-18
CN114201287B true CN114201287B (en) 2022-05-03

Family

ID=80645585

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210144694.0A Active CN114201287B (en) 2022-02-17 2022-02-17 Method for cooperatively processing data based on CPU + GPU heterogeneous platform

Country Status (1)

Country Link
CN (1) CN114201287B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115080913B (en) * 2022-05-11 2024-06-21 中国核动力研究设计院 Burnup sparse matrix solving method, system, equipment and storage medium
CN117032999B (en) * 2023-10-09 2024-01-30 之江实验室 CPU-GPU cooperative scheduling method and device based on asynchronous running
CN117311948B (en) * 2023-11-27 2024-03-19 湖南迈曦软件有限责任公司 Automatic multiple substructure data processing method for heterogeneous parallelism of CPU and GPU

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105576648A (en) * 2015-11-23 2016-05-11 中国电力科学研究院 Static security analysis double-layer parallel method based on GPU-CUP heterogeneous computing platform
EP3343392A1 (en) * 2016-12-31 2018-07-04 INTEL Corporation Hardware accelerator architecture and template for web-scale k-means clustering
US10127499B1 (en) * 2014-08-11 2018-11-13 Rigetti & Co, Inc. Operating a quantum processor in a heterogeneous computing architecture
WO2021155329A1 (en) * 2020-01-31 2021-08-05 Cytel Inc. Trial design platform

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9600277B2 (en) * 2014-02-21 2017-03-21 International Business Machines Corporation Asynchronous cleanup after a peer-to-peer remote copy (PPRC) terminate relationship operation
US9900378B2 (en) * 2016-02-01 2018-02-20 Sas Institute Inc. Node device function and cache aware task assignment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10127499B1 (en) * 2014-08-11 2018-11-13 Rigetti & Co, Inc. Operating a quantum processor in a heterogeneous computing architecture
CN105576648A (en) * 2015-11-23 2016-05-11 中国电力科学研究院 Static security analysis double-layer parallel method based on GPU-CUP heterogeneous computing platform
EP3343392A1 (en) * 2016-12-31 2018-07-04 INTEL Corporation Hardware accelerator architecture and template for web-scale k-means clustering
WO2021155329A1 (en) * 2020-01-31 2021-08-05 Cytel Inc. Trial design platform

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
《Progress in the time-domain full waveform inversion》;王庆;《 Progress in Geophysics》;20151231;全文 *
《基于GPU 的电网静态安全分析与灵敏度分析批量计算方法研究》;张宸赓;《中国优秀硕士学位论文全文数据库 (工程科技Ⅱ辑)》;20210331;全文 *
《面向机器学习的分布式并行计算关键技术及应用》;曹嵘晖;《CAAI Transactions on Intelligent Systems》;20210930;全文 *

Also Published As

Publication number Publication date
CN114201287A (en) 2022-03-18

Similar Documents

Publication Publication Date Title
CN114201287B (en) Method for cooperatively processing data based on CPU + GPU heterogeneous platform
Xiao et al. Caspmv: A customized and accelerative spmv framework for the sunway taihulight
Cevahir et al. High performance conjugate gradient solver on multi-GPU clusters using hypergraph partitioning
CN104714850B (en) A kind of isomery based on OPENCL calculates equalization methods jointly
CN107239823A (en) A kind of apparatus and method for realizing sparse neural network
Hadri et al. Tile QR factorization with parallel panel processing for multicore architectures
Jo et al. Accelerating LINPACK with MPI-OpenCL on clusters of multi-GPU nodes
CN107451097B (en) High-performance implementation method of multi-dimensional FFT on domestic Shenwei 26010 multi-core processor
Behrens et al. Efficient SIMD Vectorization for Hashing in OpenCL.
Ezzatti et al. High performance matrix inversion on a multi-core platform with several GPUs
Dzafic et al. High performance power flow algorithm for symmetrical distribution networks with unbalanced loading
CN111428192A (en) Method and system for optimizing high performance computational architecture sparse matrix vector multiplication
Mukunoki et al. Implementation and evaluation of triple precision BLAS subroutines on GPUs
Gupta A shared-and distributed-memory parallel general sparse direct solver
CN104615516A (en) Method for achieving large-scale high-performance Linpack testing benchmark for GPDSP
CN107256203A (en) The implementation method and device of a kind of matrix-vector multiplication
Wan et al. A novel cooperative accelerated parallel two-list algorithm for solving the subset-sum problem on a hybrid CPU–GPU cluster
Ltaief et al. Hybrid multicore cholesky factorization with multiple gpu accelerators
Stricker Supporting the hypercube programming model on mesh architectures: (a fast sorter for iWarp tori)
Aghapour Integrated ARM big
Tichy Parallel matrix multiplication on the connection machine
KR20220002284A (en) Apparatus and method for dynamically optimizing parallel computations
Karypis et al. Efficient parallel mappings of a dynamic programming algorithm: a summary of results
Takahashi et al. Performance of the block Jacobi method for the symmetric eigenvalue problem on a modern massively parallel computer
Wang et al. Fine-grained heterogeneous parallel direct solver for finite element problems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant