CN117311948B - Automatic multiple substructure data processing method for heterogeneous parallelism of CPU and GPU - Google Patents

Automatic multiple substructure data processing method for heterogeneous parallelism of CPU and GPU Download PDF

Info

Publication number
CN117311948B
CN117311948B CN202311585574.5A CN202311585574A CN117311948B CN 117311948 B CN117311948 B CN 117311948B CN 202311585574 A CN202311585574 A CN 202311585574A CN 117311948 B CN117311948 B CN 117311948B
Authority
CN
China
Prior art keywords
node
matrix
cpu
gpu
nodes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311585574.5A
Other languages
Chinese (zh)
Other versions
CN117311948A (en
Inventor
王桂冬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan Maixi Software Co ltd
Original Assignee
Hunan Maixi Software Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan Maixi Software Co ltd filed Critical Hunan Maixi Software Co ltd
Priority to CN202311585574.5A priority Critical patent/CN117311948B/en
Publication of CN117311948A publication Critical patent/CN117311948A/en
Application granted granted Critical
Publication of CN117311948B publication Critical patent/CN117311948B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/52Program synchronisation; Mutual exclusion, e.g. by means of semaphores
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses an automatic multiple sub-structure data processing method for heterogeneous parallelism of a CPU and a GPU, which comprises dividing boundaries and parallel CPU subtrees, node mixed parallel computation based on the CPU and the GPU, parallel computation based on the problem of dimension reduction characteristic values of the GPU and a parallel strategy for back-substitution conversion. According to the method, the decomposition line of the elimination tree is determined according to the node distribution characteristics of the substructures elimination tree formed by the sparse matrix reordering algorithm and the number of CPU threads, and the nodes above the limit adopt a multi-node CPU and GPU mixed parallel computing mode, so that the effective utilization of computing resources is ensured as much as possible; independent subtree tasks are calculated in parallel by using CPU multi-cores under the limit, and a parallel strategy for reducing data synchronization is designed for the independent subtree tasks.

Description

Automatic multiple substructure data processing method for heterogeneous parallelism of CPU and GPU
Technical Field
The invention relates to the technical field of high-performance numerical computation in engineering, in particular to an automatic multiple substructure data processing method for heterogeneous parallelism of a CPU and a GPU.
Background
The automatic multiple substructure method has wide application in engineering fields such as structural dynamics analysis, structural optimization design, vibration control, electromagnetics, fluid-solid coupling problems and the like, and can provide a solution for efficient and accurate dynamics analysis and optimization design. Taking structural simulation analysis as an example, with the development of industrial level, the finite element simulation calculation scale is gradually increased, and an automatic multiple sub-structure method has become the primary choice of the intrinsic mode of a calculation model, and generally occupies more than half of the mode solving time, so that the calculation efficiency of the whole simulation analysis is directly determined. With the continuous development of computer technology, the use of GPU (Graphics Processing Unit, graphics processor) to increase the computing speed in massively parallel has become an efficient option for high-performance numerical computation. At present, the automatic multiple substructure method mainly utilizes CPU multi-core parallel computation to improve the computing efficiency in a shared memory environment, which idles the computing resources of the GPU to a certain extent.
The SLEEc sparse feature equation solver, which is relatively close to the invention, comprises a solving algorithm of various types of equations, but does not provide an automatic multiple substructure solving algorithm for solving linear generalized feature equations. The time required for solving the characteristic equation of the finite element by using the SLEEPc default Krylov subspace algorithm is several times or even tens times of that of the automatic multiple substructure algorithm, and the actual requirement of large-scale model simulation calculation is difficult to meet.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides an automatic multiple substructure data processing method for heterogeneous parallelism of a CPU and a GPU, which reasonably distributes calculation tasks in the automatic multiple substructure method to the GPU, provides a heterogeneous parallelism strategy with high robustness, and fully utilizes GPU calculation resources to accelerate solving of generalized characteristic equations.
The invention provides an automatic multiple substructure data processing method for heterogeneous parallelism of a CPU and a GPU, which comprises the following steps:
determining a boundary level of parallel subtrees and parallel nodes, dividing the nodes below the boundary into mutually independent subtrees according to the distribution of elimination trees, regarding the conversion of each subtree as a calculation task capable of being solved in parallel, and enabling each CPU thread to obtain a preset number of subtree tasks through the dynamic thread scheduling of OpenMP;
secondly, taking the calculated nodes of each layer as data synchronization points, extracting and combining corresponding update matrixes from left and right child nodes by using multithread parallel at a CPU end when calculation starts, assembling a rigidity matrix and a quality matrix of a current node, copying data into a GPU memory by using asynchronous flow provided by CUDA, and then calculating a node characteristic equation at the CPU end by using an iteration method and updating a corresponding quality matrix for a substructure with a solution to an interval;
step three, utilizing the rigidity matrix as a diagonal matrix and the quality matrixThe method is characterized in that the sparse symmetric block matrix with the diagonal of 1 is obtained, and a CSR (space-time basis) representation format of the sparse matrix is obtained and stored on the GPU; by implementing Arnoldi method on GPU, using the numerical calculation library provided by CUDA, calculating sparse matrix vector multiplication and vector re-orthometric operation;
and step four, starting from the root node of the elimination tree, sequentially increasing the number of nodes from top to bottom until the required node memory is larger than the device video memory, transmitting data to the device side through the page locking memory at one time, and completing all the generation-back solving process of the nodes on the GPU.
Further, in the first step, determining a demarcation level of subtree parallelism and node parallelism by the following formula;
where Lc is the number of layers decomposed, nt is the number of CPU threads, nk represents the number of nodes at the k-th layer of the erasure tree, k represents the number of layers of the erasure tree, Z represents the integer set, vj represents the degrees of freedom of the substructures of the j-th node at the k-th layer, and Vj of the leaf node takes 0.
In the first step, in the process of the transformation of the substructured matrix, the solution of all constraint matrixes and the multiplication operation of the connection matrixes of the current node and the ancestor node only calculate the rows and columns with non-zero elements, and the matrix update data of all ancestor nodes are temporarily stored in the current child node object, so that the competition problem of data update of the same node by different threads is eliminated; and for the rigidity matrix and quality matrix calculation of the leaf nodes, adopting CSR sparse matrix format for storage and calculation.
Further, in the second step, the device end binds a CUDA data stream to each node by using mutual independence of the peer sub-structures, so as to hide the execution time of the memory copy into the computation time of the kernel function; if the calculation data of all the nodes cannot be copied to the GPU memory at one time in the calculation process of eliminating a certain layer of the tree, determining the maximum structure number of the once copying to the GPU memory and starting matrix calculation of the nodes, releasing occupied memory space after each node is calculated, and selecting a binding data stream with proper size from the non-calculated nodes to start calculation.
Further, in the third step, the first step,
solving the equationIn section->The obtained rigidity matrix and the mass matrix have the following form structure:
wherein,for stiffness matrix->For the quality matrix->、/>、/>The 1 st, i th and n th order eigenvalues are calculated in the substructured matrix conversion process respectively; n is the total number of feature pairs calculated in the conversion process; />The coupling quality matrix is the coupling quality matrix of the ith substructure and the jth substructure after dimension reduction; lambda represents the eigenvalue of the dimension-reduction eigenvalue, and x represents andcorresponding feature vectors;
representing the upper frequency limit set by the analysis of the actual working conditions;
the stiffness matrix is a diagonal matrix and the mass matrixIs characterized by a sparse symmetrical block matrix with the diagonal lines of 1, and obtains a sparse matrix +.>CSR representation format of (2) and stored on the GPU, the original question is converted into the calculation equation +.>Conversion to be in interval->Features and solutions in the interior, delta is the feature value.
Further, in the fourth step, the first step,
the calculated node set is called SetA, and the rest all the substructure node sets are SetB; the computing task of the node in the SetB can be divided into two blocks of a CPU and a GPU, the CPU calculates the ancestor node of the current node in the SetB, the GPU calculates the ancestor node in the SetA, a final result can be obtained after one-time data addition, after each node is calculated, if the current node is a non-leaf node, the computing tasks of the left child node and the right child node can be synchronously created, and efficient thread scheduling and switching are realized through a task mechanism of OpenMP.
The invention has the following beneficial effects: according to the method for processing the data of the automatic multiple substructures in heterogeneous parallelism of the CPU and the GPU, which is provided by the invention, aiming at the calculation requirement of efficient approximate solution of a large-scale sparse generalized characteristic equation in the CAE simulation field, the two stages of sub-tree parallelism and node task CPU and GPU heterogeneous parallelism are divided according to the substructures size and GPU video memory in the elimination tree conversion stage of the automatic multiple substructures solution, so that a strong robust load balancing heterogeneous calculation scheme is realized. In addition, in the eigenvalue problem of dimension reduction and the back-substitution conversion solving stage, the calculating speed is obviously improved by distributing proper calculating tasks on the GPU, the GPU heterogeneous algorithm of the automatic multiple substructure of the efficient large-scale sparse generalized characteristic equation is realized, and the requirements of the rapid approximate solving of the sparse generalized characteristic equation in the general fields of finite element simulation analysis, computational fluid mechanics, electromagnetics and the like can be effectively met.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a diagram of a parallel elimination tree;
FIG. 2 is a CPU parallel sub-tree flow diagram;
FIG. 3 is a flow chart of heterogeneous hybrid computation of CPU and GPU.
Detailed Description
It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other. The invention will be described in detail below with reference to the drawings in connection with embodiments. It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the present application. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
The invention provides an automatic multiple substructure data processing method of heterogeneous parallel of a CPU and a GPU, which is characterized in that the node distribution characteristics of a substructure elimination tree formed by a sparse matrix reordering algorithm and the number of CPU threads are used for determining the decomposition line of the elimination tree, and the nodes above the limit adopt a multi-node CPU and GPU mixed parallel computing mode, so that the effective utilization of computing resources is ensured as much as possible; independent subtree tasks are calculated in parallel by using CPU multi-cores under the limit, and a parallel strategy for reducing data synchronization is designed for the independent subtree tasks.
The invention mainly comprises four steps:
step one, dividing lines are parallel to CPU subtrees:
the matrix conversion process of the whole elimination tree is the most time-consuming step of solving the whole characteristic equation, wherein nodes have strict calculation sequence requirements, each node can be started after all child nodes of the node are calculated, and the calculation tasks of a single node can be roughly divided into a plurality of calculation tasks such as small-scale dense matrix characteristic value problem calculation, constraint matrix calculation, node quality matrix update, ancestor node quality matrix update, master ancestor node rigidity matrix update, child node quality matrix update and the like, and the time consumption of the tasks is different and a large number of parallel opportunities exist.
FIG. 1 is a schematic diagram of a subtractive tree generated by a sparse matrix reordering algorithm, in which all data cannot be copied to a device memory at one time for expansion computation due to computation limitations of the memory during acceleration computation using a GPU. The node calculation with numerous layers below the elimination tree but extremely small calculation amount is still reserved at the CPU end for solving. Based on the thought, the invention determines the demarcation level of the subtree parallelism and the node parallelism through the following formula.
Wherein Lc is the number of decomposed layers, nt is the number of CPU threads, nk represents the number of nodes of the k-th layer of the elimination tree, k represents the number of layers of the elimination tree, Z represents an integer set, and Vj represents the degree of freedom of the substructure of the j-th node of the k-th layer, wherein the Vj of the leaf node takes 0 because the leaf node adopts sparse matrix calculation and has small calculation amount.
Nodes below the dividing line are divided into mutually independent subtrees according to the elimination tree distribution, the conversion of each subtree is regarded as a calculation task which can be solved in parallel, each CPU thread is led to obtain a certain number of subtree tasks through the dynamic thread scheduling of OpenMP (Open Multi-Processing, shared storage parallel programming), the calculation flow of each independent subtree is shown in figure 2, in the process of the transformation of the substructured matrix, the solving of all constraint matrixes and the multiplication operation of the connection matrix of the current node and the ancestor node only calculate the row with non-zero elements, matrix update data of all ancestor nodes are temporarily stored in the current child node object, and the competition problem that different threads update data of the same node at the same time is solved. For the calculation of the rigidity matrix and the quality matrix of the leaf node, the CSR sparse matrix format is adopted for storage and calculation due to the irregular sparsity of the rigidity matrix and the quality matrix.
Step two, node hybrid parallel computing based on CPU and GPU:
the size of the nodes above the boundary line is larger, and in the previous CPU parallel calculation process, constraint matrix solving and update matrix calculation on the father node occupy more than 80% of calculation time of each node. In heterogeneous hybrid algorithms, consider transferring this portion of the computing task into the GPU.
The overall calculation idea is shown in fig. 3, wherein blue arrows indicate data transmission through a PCIe bus, all nodes in each layer are calculated and used as data synchronization points, when calculation is started, a CPU end extracts and merges corresponding update matrixes from left and right child nodes in parallel, a stiffness matrix and a quality matrix of a current node are assembled, meanwhile, data are copied into a GPU memory by utilizing asynchronous flows provided by CUDA (Compute Unified Device Architecture, unified computing equipment architecture), and then an iteration method is used for calculating a node characteristic equation at the CPU end and updating a corresponding quality matrix for a substructure with a solution to an interval. Because the sub-structure of the existing interval solutions has small duty ratio, and the number of the solutions is mostly not more than 10, the tasks are simply and efficiently completed by utilizing the logic processing capacity of the CPU strength.
The device end binds a CUDA data stream to each node by utilizing mutual independence of the same-level substructures, so that the execution time of memory copying is hidden into the calculation time of a kernel function as much as possible, and higher parallelism and throughput are realized. If the calculation data of all the nodes cannot be copied to the GPU memory at one time in the calculation process of eliminating a certain layer of the tree, determining the maximum structure number of the once copying to the GPU memory and starting matrix calculation of the nodes, releasing occupied memory space after each node is calculated, and selecting a binding data stream with proper size from the non-calculated nodes to start calculation.
Step three, parallel calculation is performed based on the dimension reduction eigenvalue problem of the GPU:
the matrix conversion of the substructure can reduce the scale of the original sparse matrix characteristic equation by more than two orders of magnitude, and consider solving the equationIn section->The characteristic pair in the matrix has the following form structure of the rigidity matrix and the mass matrix
Wherein,for stiffness matrix->For the quality matrix->、/>、/>The 1 st, i th and n th order eigenvalues are calculated in the substructured matrix conversion process respectively; n is the total number of feature pairs calculated in the conversion process; />The coupling quality matrix is the coupling quality matrix of the ith substructure and the jth substructure after dimension reduction; lambda represents the eigenvalue of the dimension-reduction eigenvector, and x represents the eigenvector corresponding to the eigenvalue;
representing the upper frequency limit set by the analysis of the actual working conditions;
the stiffness matrix is a diagonal matrix and the mass matrixIs characterized by a sparse symmetrical block matrix with the diagonal lines of 1, and obtains a sparse matrix +.>CSR representation format of (2) and stored on the GPU, the original question is converted into the calculation equation +.>Conversion to be in interval->Features and solutions in the interior, delta is the feature value. By implementing the Arnoldi method on the GPU, operations such as sparse matrix vector multiplication (SPMV) and vector heavy orthogonality (GEMM) can be rapidly calculated by utilizing a numerical calculation library provided by CUDA, and the extreme eigenvalue efficiency of a calculation matrix is remarkably improved.
Step four, parallel strategy of back-substitution conversion:
because the size of the equipment memory is limited, the nodes in the whole back-substitution conversion stage cannot be independently carried out on the GPU, so that the number of the nodes is sequentially increased from top to bottom from the root node of the elimination tree until the required node memory is larger than the equipment memory, data are transmitted to the equipment terminal at one time through page locking memory, the whole back-substitution solving process of the nodes is completed on the GPU, the node set calculated in the form is called set A, and the rest of all the substructure node sets are SetB. The node calculation with larger size is placed on the GPU, and the CPU end is only responsible for small-scale calculation tasks, so that the data quantity required to be transmitted is obviously reduced.
Because the back-substitution conversion calculation of the nodes needs to traverse all ancestor node data but has no requirement on the calculation sequence, the calculation task of the nodes in the SetB can be divided into two blocks of a CPU and a GPU, the CPU calculates the ancestor nodes of the current node in the SetB, the GPU calculates the ancestor nodes in the SetA, the final result can be obtained after one-time data addition, after the calculation of each node is finished, if the current node is a non-leaf node, the calculation tasks of the left child node and the right child node can be synchronously created, and the efficient thread scheduling and switching can be realized through the task mechanism of OpenMP.
According to the method for processing the data of the automatic multiple sub-structure of the heterogeneous parallel of the CPU and the GPU, subtree parallel and node parallel calculation strategies are divided according to the distribution characteristics of the GPU video memory and the nodes of the eliminated tree, and a calculation scheme capable of eliminating different child nodes and simultaneously updating the same father node so as to generate the problem of multithreading deterministic competition is designed. In the matrix conversion stage, the corresponding calculation format is designed by utilizing the sparsity of different matrixes of the leaf nodes and the non-leaf nodes, so that unnecessary operand and memory requirements are effectively reduced, and the solving efficiency is improved. The heterogeneous mixed solution calculation based on the GPU and the CPU of a single node is realized, the CPU is used for calculating a small-scale calculation task while the massive parallel resources of the GPU are fully utilized to accelerate the dense matrix operation, and the idle waste of calculation resources is avoided to the greatest extent. And accelerating the operation process by using the GPU in the reduced generalized eigenvalue problem and the back-substitution conversion process. And (3) adjusting the parallel strategy according to the GPU video memory elasticity, ensuring the compatibility of different GPU devices, and carrying out overlapped data transmission and calculation through the streaming processing of the GPUs in the multi-node parallel process.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments in accordance with the present application. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.
It should be noted that the terms "first," "second," and the like in the description and claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be capable of being practiced otherwise than as specifically illustrated and described herein.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (5)

1. An automatic multiple substructure data processing method for heterogeneous parallelism of a CPU and a GPU is characterized by comprising the following steps:
determining a boundary level of parallel subtrees and parallel nodes, dividing the nodes below the boundary into mutually independent subtrees according to the distribution of elimination trees, regarding the conversion of each subtree as a calculation task capable of being solved in parallel, and enabling each CPU thread to obtain a preset number of subtree tasks through the dynamic thread scheduling of OpenMP;
secondly, taking the calculated nodes of each layer as data synchronization points, extracting and combining corresponding update matrixes from left and right child nodes by using multithread parallel at a CPU end when calculation starts, assembling a rigidity matrix and a quality matrix of a current node, copying data into a GPU memory by using asynchronous flow provided by CUDA, and then calculating a node characteristic equation at the CPU end by using an iteration method and updating a corresponding quality matrix for a substructure with a solution to an interval;
in step three, solve the equationIn section->The obtained rigidity matrix and the mass matrix have the following form structure:
wherein,for stiffness matrix->For the quality matrix->、/>、/>The 1 st, i th and n th order eigenvalues are calculated in the substructured matrix conversion process respectively; n is the total number of feature pairs calculated in the conversion process; />The coupling quality matrix is the coupling quality matrix of the ith substructure and the jth substructure after dimension reduction; lambda represents the eigenvalue of the dimension-reduction eigenvector, and x represents the eigenvector corresponding to the eigenvalue;
representing the upper frequency limit set by the analysis of the actual working conditions;
the stiffness matrix is a diagonal matrix and the mass matrixIs characterized by a sparse symmetrical block matrix with the diagonal lines of 1, and obtains a sparse matrix +.>CSR representation format of (2) and stored on GPU, the original problem is converted into calculation equationConversion to be in interval->The internal features and solutions, delta is the feature value;
and step four, starting from the root node of the elimination tree, sequentially increasing the number of nodes from top to bottom until the required node memory is larger than the device video memory, transmitting data to the device side through the page locking memory at one time, and completing all the generation-back solving process of the nodes on the GPU.
2. The method for automatically processing multiple sub-structure data in heterogeneous parallel with respect to a CPU and a GPU according to claim 1, wherein in the first step, a boundary level between sub-tree parallelism and node parallelism is determined by the following formula;
where Lc is the number of layers decomposed, nt is the number of CPU threads, nk represents the number of nodes at the k-th layer of the erasure tree, k represents the number of layers of the erasure tree, Z represents the integer set, vj represents the degrees of freedom of the substructures of the j-th node at the k-th layer, and Vj of the leaf node takes 0.
3. The method for automatically processing the multi-substructure data by heterogeneous parallelism of the CPU and the GPU according to claim 1, wherein in the step one, in the process of converting the substructure matrix, the solving of all constraint matrixes and the multiplication operation of the connection matrixes of the current node and the ancestor node only calculate the rows and columns with non-zero elements, and the matrix update data of all ancestor nodes are temporarily stored in the current child node object, so that the competition problem of data update of the same node by different threads is solved; and for the rigidity matrix and quality matrix calculation of the leaf nodes, adopting CSR sparse matrix format for storage and calculation.
4. The method for automatically processing multiple sub-structure data in heterogeneous parallel with the CPU and the GPU according to claim 1, wherein in the second step, the device end binds a CUDA data stream to each node by utilizing the mutual independence of the same-level sub-structure, and hides the execution time of memory copy into the computation time of kernel functions; if the calculation data of all the nodes cannot be copied to the GPU memory at one time in the calculation process of eliminating a certain layer of the tree, determining the maximum structure number of the once copying to the GPU memory and starting matrix calculation of the nodes, releasing occupied memory space after each node is calculated, and selecting a binding data stream with proper size from the non-calculated nodes to start calculation.
5. The method for processing data of an automatic multiple sub-structure in heterogeneous parallel with a CPU and a GPU according to claim 1, wherein in the fourth step,
the calculated node set is called SetA, and the rest all the substructure node sets are SetB; the computing task of the node in the SetB can be divided into two blocks of a CPU and a GPU, the CPU calculates the ancestor node of the current node in the SetB, the GPU calculates the ancestor node in the SetA, a final result can be obtained after one-time data addition, after each node is calculated, if the current node is a non-leaf node, the computing tasks of the left child node and the right child node can be synchronously created, and efficient thread scheduling and switching are realized through a task mechanism of OpenMP.
CN202311585574.5A 2023-11-27 2023-11-27 Automatic multiple substructure data processing method for heterogeneous parallelism of CPU and GPU Active CN117311948B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311585574.5A CN117311948B (en) 2023-11-27 2023-11-27 Automatic multiple substructure data processing method for heterogeneous parallelism of CPU and GPU

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311585574.5A CN117311948B (en) 2023-11-27 2023-11-27 Automatic multiple substructure data processing method for heterogeneous parallelism of CPU and GPU

Publications (2)

Publication Number Publication Date
CN117311948A CN117311948A (en) 2023-12-29
CN117311948B true CN117311948B (en) 2024-03-19

Family

ID=89273875

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311585574.5A Active CN117311948B (en) 2023-11-27 2023-11-27 Automatic multiple substructure data processing method for heterogeneous parallelism of CPU and GPU

Country Status (1)

Country Link
CN (1) CN117311948B (en)

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103399841A (en) * 2013-07-31 2013-11-20 清华大学 Sparse matrix LU decomposition method based on GPU
CN103617150A (en) * 2013-11-19 2014-03-05 国家电网公司 GPU (graphic processing unit) based parallel power flow calculation system and method for large-scale power system
CN103984527A (en) * 2014-04-01 2014-08-13 杭州电子科技大学 Method optimizing sparse matrix vector multiplication to improve incompressible pipe flow simulation efficiency
CN104461467A (en) * 2013-09-25 2015-03-25 广州中国科学院软件应用技术研究所 Method for increasing calculation speed of SMP cluster system through MPI and OpenMP in hybrid parallel mode
CN104461466A (en) * 2013-09-25 2015-03-25 广州中国科学院软件应用技术研究所 Method for increasing computing speed through parallel computing based on MPI and OpenMP hybrid programming model
CN105068787A (en) * 2015-08-28 2015-11-18 华南理工大学 Heterogeneous parallel computing method for sparse matrix-vector multiplication
CN105808829A (en) * 2016-03-02 2016-07-27 西安交通大学 CPU+GPU heterogeneous parallel computing based natural frequency characteristic analysis method for turbomachinery blade
CN106407158A (en) * 2016-09-12 2017-02-15 东南大学 GPU accelerated method for performing batch processing of isomorphic sparse matrixes multiplied by full vectors
CN110162736A (en) * 2018-01-10 2019-08-23 成都信息工程大学 Large Scale Sparse symmetrical linear equation group method for parallel processing based on elimination-tree
CN110598174A (en) * 2019-09-11 2019-12-20 北京华大九天软件有限公司 Back-substitution solving method of sparse matrix based on GPU architecture
CN111651208A (en) * 2020-05-08 2020-09-11 上海交通大学 Modal parallel computing method and system for heterogeneous many-core parallel computer
CN114201287A (en) * 2022-02-17 2022-03-18 湖南迈曦软件有限责任公司 Method for cooperatively processing data based on CPU + GPU heterogeneous platform
CN114329327A (en) * 2021-12-14 2022-04-12 清华大学 Sparse matrix parallel solving method and device based on upper and lower triangular decomposition

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8364739B2 (en) * 2009-09-30 2013-01-29 International Business Machines Corporation Sparse matrix-vector multiplication on graphics processor units
CN104036451B (en) * 2014-06-20 2018-12-11 深圳市腾讯计算机系统有限公司 Model method for parallel processing and device based on multi-graphics processor
US10346507B2 (en) * 2016-11-01 2019-07-09 Nvidia Corporation Symmetric block sparse matrix-vector multiplication

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103399841A (en) * 2013-07-31 2013-11-20 清华大学 Sparse matrix LU decomposition method based on GPU
CN104461467A (en) * 2013-09-25 2015-03-25 广州中国科学院软件应用技术研究所 Method for increasing calculation speed of SMP cluster system through MPI and OpenMP in hybrid parallel mode
CN104461466A (en) * 2013-09-25 2015-03-25 广州中国科学院软件应用技术研究所 Method for increasing computing speed through parallel computing based on MPI and OpenMP hybrid programming model
CN103617150A (en) * 2013-11-19 2014-03-05 国家电网公司 GPU (graphic processing unit) based parallel power flow calculation system and method for large-scale power system
CN103984527A (en) * 2014-04-01 2014-08-13 杭州电子科技大学 Method optimizing sparse matrix vector multiplication to improve incompressible pipe flow simulation efficiency
CN105068787A (en) * 2015-08-28 2015-11-18 华南理工大学 Heterogeneous parallel computing method for sparse matrix-vector multiplication
CN105808829A (en) * 2016-03-02 2016-07-27 西安交通大学 CPU+GPU heterogeneous parallel computing based natural frequency characteristic analysis method for turbomachinery blade
CN106407158A (en) * 2016-09-12 2017-02-15 东南大学 GPU accelerated method for performing batch processing of isomorphic sparse matrixes multiplied by full vectors
CN110162736A (en) * 2018-01-10 2019-08-23 成都信息工程大学 Large Scale Sparse symmetrical linear equation group method for parallel processing based on elimination-tree
CN110598174A (en) * 2019-09-11 2019-12-20 北京华大九天软件有限公司 Back-substitution solving method of sparse matrix based on GPU architecture
CN111651208A (en) * 2020-05-08 2020-09-11 上海交通大学 Modal parallel computing method and system for heterogeneous many-core parallel computer
CN114329327A (en) * 2021-12-14 2022-04-12 清华大学 Sparse matrix parallel solving method and device based on upper and lower triangular decomposition
CN114201287A (en) * 2022-02-17 2022-03-18 湖南迈曦软件有限责任公司 Method for cooperatively processing data based on CPU + GPU heterogeneous platform

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
基于CUDA的有限元矩阵并行装配算法研究;胡斌星;李新国;孙鹏;;计算力学学报(第03期);全文 *
基于HYB格式稀疏矩阵与向量乘在CPU+GPU异构系统中的实现与优化;阳王东;李肯立;;计算机工程与科学(第02期);全文 *
基于通用计算图形处理器的电磁场有限元计算加速方法探讨;徐小宇;刘国强;;科研信息化技术与应用(第04期);全文 *

Also Published As

Publication number Publication date
CN117311948A (en) 2023-12-29

Similar Documents

Publication Publication Date Title
WO2021057746A1 (en) Neural network processing method and apparatus, computer device and storage medium
CN104714850B (en) A kind of isomery based on OPENCL calculates equalization methods jointly
CN111459877A (en) FPGA (field programmable Gate array) acceleration-based Winograd YO L Ov2 target detection model method
WO2022068663A1 (en) Memory allocation method, related device, and computer readable storage medium
CN110516316B (en) GPU acceleration method for solving Euler equation by interrupted Galerkin method
CN111368484B (en) Cosmic N-body numerical simulation optimization method and system based on Shenwei architecture
Du et al. Model parallelism optimization for distributed inference via decoupled CNN structure
CN112947870B (en) G-code parallel generation method of 3D printing model
CN111539526A (en) Neural network convolution method and device
CN110750265A (en) High-level synthesis method and system for graph calculation
CN105528243A (en) A priority packet scheduling method and system utilizing data topological information
CN101639788A (en) Multi-core parallel method for continuous system simulation based on TBB threading building blocks
Augonnet et al. A hierarchical fast direct solver for distributed memory machines with manycore nodes
CN109753682B (en) Finite element stiffness matrix simulation method based on GPU (graphics processing Unit) end
Shu et al. Design of deep learning accelerated algorithm for online recognition of industrial products defects
Huang et al. EvoX: A Distributed GPU-accelerated Framework for Scalable Evolutionary Computation
CN117311948B (en) Automatic multiple substructure data processing method for heterogeneous parallelism of CPU and GPU
Davis et al. Paradigmatic shifts for exascale supercomputing
Abdollahi-Kalkhoran et al. TEA-SEA: Tiling and scheduling of non-uniform two-level perfectly nested loops using an evolutionary approach
CN110415162B (en) Adaptive graph partitioning method facing heterogeneous fusion processor in big data
Wang et al. A novel parallel algorithm for sparse tensor matrix chain multiplication via tcu-acceleration
Bożejko A new class of parallel scheduling algorithms
US20230259780A1 (en) Neural network sparsification apparatus and method and related product
Garba et al. Asymptotic peak utilisation in heterogeneous parallel CPU/GPU pipelines: a decentralised queue monitoring strategy
CN116185377A (en) Optimization method and device for calculation graph and related product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant