CN117311948B

CN117311948B - Automatic multiple substructure data processing method for heterogeneous parallelism of CPU and GPU

Info

Publication number: CN117311948B
Application number: CN202311585574.5A
Authority: CN
Inventors: 王桂冬
Original assignee: Hunan Maixi Software Co ltd
Current assignee: Hunan Maixi Software Co ltd
Priority date: 2023-11-27
Filing date: 2023-11-27
Publication date: 2024-03-19
Anticipated expiration: 2043-11-27
Also published as: CN117311948A

Abstract

The invention discloses an automatic multiple sub-structure data processing method for heterogeneous parallelism of a CPU and a GPU, which comprises dividing boundaries and parallel CPU subtrees, node mixed parallel computation based on the CPU and the GPU, parallel computation based on the problem of dimension reduction characteristic values of the GPU and a parallel strategy for back-substitution conversion. According to the method, the decomposition line of the elimination tree is determined according to the node distribution characteristics of the substructures elimination tree formed by the sparse matrix reordering algorithm and the number of CPU threads, and the nodes above the limit adopt a multi-node CPU and GPU mixed parallel computing mode, so that the effective utilization of computing resources is ensured as much as possible; independent subtree tasks are calculated in parallel by using CPU multi-cores under the limit, and a parallel strategy for reducing data synchronization is designed for the independent subtree tasks.

Description

Automatic multiple substructure data processing method for heterogeneous parallelism of CPU and GPU

Technical Field

The invention relates to the technical field of high-performance numerical computation in engineering, in particular to an automatic multiple substructure data processing method for heterogeneous parallelism of a CPU and a GPU.

Background

The automatic multiple substructure method has wide application in engineering fields such as structural dynamics analysis, structural optimization design, vibration control, electromagnetics, fluid-solid coupling problems and the like, and can provide a solution for efficient and accurate dynamics analysis and optimization design. Taking structural simulation analysis as an example, with the development of industrial level, the finite element simulation calculation scale is gradually increased, and an automatic multiple sub-structure method has become the primary choice of the intrinsic mode of a calculation model, and generally occupies more than half of the mode solving time, so that the calculation efficiency of the whole simulation analysis is directly determined. With the continuous development of computer technology, the use of GPU (Graphics Processing Unit, graphics processor) to increase the computing speed in massively parallel has become an efficient option for high-performance numerical computation. At present, the automatic multiple substructure method mainly utilizes CPU multi-core parallel computation to improve the computing efficiency in a shared memory environment, which idles the computing resources of the GPU to a certain extent.

The SLEEc sparse feature equation solver, which is relatively close to the invention, comprises a solving algorithm of various types of equations, but does not provide an automatic multiple substructure solving algorithm for solving linear generalized feature equations. The time required for solving the characteristic equation of the finite element by using the SLEEPc default Krylov subspace algorithm is several times or even tens times of that of the automatic multiple substructure algorithm, and the actual requirement of large-scale model simulation calculation is difficult to meet.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides an automatic multiple substructure data processing method for heterogeneous parallelism of a CPU and a GPU, which reasonably distributes calculation tasks in the automatic multiple substructure method to the GPU, provides a heterogeneous parallelism strategy with high robustness, and fully utilizes GPU calculation resources to accelerate solving of generalized characteristic equations.

The invention provides an automatic multiple substructure data processing method for heterogeneous parallelism of a CPU and a GPU, which comprises the following steps:

determining a boundary level of parallel subtrees and parallel nodes, dividing the nodes below the boundary into mutually independent subtrees according to the distribution of elimination trees, regarding the conversion of each subtree as a calculation task capable of being solved in parallel, and enabling each CPU thread to obtain a preset number of subtree tasks through the dynamic thread scheduling of OpenMP;

secondly, taking the calculated nodes of each layer as data synchronization points, extracting and combining corresponding update matrixes from left and right child nodes by using multithread parallel at a CPU end when calculation starts, assembling a rigidity matrix and a quality matrix of a current node, copying data into a GPU memory by using asynchronous flow provided by CUDA, and then calculating a node characteristic equation at the CPU end by using an iteration method and updating a corresponding quality matrix for a substructure with a solution to an interval;

step three, utilizing the rigidity matrix as a diagonal matrix and the quality matrixThe method is characterized in that the sparse symmetric block matrix with the diagonal of 1 is obtained, and a CSR (space-time basis) representation format of the sparse matrix is obtained and stored on the GPU; by implementing Arnoldi method on GPU, using the numerical calculation library provided by CUDA, calculating sparse matrix vector multiplication and vector re-orthometric operation;

and step four, starting from the root node of the elimination tree, sequentially increasing the number of nodes from top to bottom until the required node memory is larger than the device video memory, transmitting data to the device side through the page locking memory at one time, and completing all the generation-back solving process of the nodes on the GPU.

Further, in the first step, determining a demarcation level of subtree parallelism and node parallelism by the following formula;

where Lc is the number of layers decomposed, nt is the number of CPU threads, nk represents the number of nodes at the k-th layer of the erasure tree, k represents the number of layers of the erasure tree, Z represents the integer set, vj represents the degrees of freedom of the substructures of the j-th node at the k-th layer, and Vj of the leaf node takes 0.

In the first step, in the process of the transformation of the substructured matrix, the solution of all constraint matrixes and the multiplication operation of the connection matrixes of the current node and the ancestor node only calculate the rows and columns with non-zero elements, and the matrix update data of all ancestor nodes are temporarily stored in the current child node object, so that the competition problem of data update of the same node by different threads is eliminated; and for the rigidity matrix and quality matrix calculation of the leaf nodes, adopting CSR sparse matrix format for storage and calculation.

Further, in the second step, the device end binds a CUDA data stream to each node by using mutual independence of the peer sub-structures, so as to hide the execution time of the memory copy into the computation time of the kernel function; if the calculation data of all the nodes cannot be copied to the GPU memory at one time in the calculation process of eliminating a certain layer of the tree, determining the maximum structure number of the once copying to the GPU memory and starting matrix calculation of the nodes, releasing occupied memory space after each node is calculated, and selecting a binding data stream with proper size from the non-calculated nodes to start calculation.

Further, in the third step, the first step,

solving the equationIn section->The obtained rigidity matrix and the mass matrix have the following form structure:

wherein,for stiffness matrix->For the quality matrix->、/>、/>The 1 st, i th and n th order eigenvalues are calculated in the substructured matrix conversion process respectively; n is the total number of feature pairs calculated in the conversion process; />The coupling quality matrix is the coupling quality matrix of the ith substructure and the jth substructure after dimension reduction; lambda represents the eigenvalue of the dimension-reduction eigenvalue, and x represents andcorresponding feature vectors;

representing the upper frequency limit set by the analysis of the actual working conditions;

the stiffness matrix is a diagonal matrix and the mass matrixIs characterized by a sparse symmetrical block matrix with the diagonal lines of 1, and obtains a sparse matrix +.>CSR representation format of (2) and stored on the GPU, the original question is converted into the calculation equation +.>Conversion to be in interval->Features and solutions in the interior, delta is the feature value.

Further, in the fourth step, the first step,

the calculated node set is called SetA, and the rest all the substructure node sets are SetB; the computing task of the node in the SetB can be divided into two blocks of a CPU and a GPU, the CPU calculates the ancestor node of the current node in the SetB, the GPU calculates the ancestor node in the SetA, a final result can be obtained after one-time data addition, after each node is calculated, if the current node is a non-leaf node, the computing tasks of the left child node and the right child node can be synchronously created, and efficient thread scheduling and switching are realized through a task mechanism of OpenMP.

The invention has the following beneficial effects: according to the method for processing the data of the automatic multiple substructures in heterogeneous parallelism of the CPU and the GPU, which is provided by the invention, aiming at the calculation requirement of efficient approximate solution of a large-scale sparse generalized characteristic equation in the CAE simulation field, the two stages of sub-tree parallelism and node task CPU and GPU heterogeneous parallelism are divided according to the substructures size and GPU video memory in the elimination tree conversion stage of the automatic multiple substructures solution, so that a strong robust load balancing heterogeneous calculation scheme is realized. In addition, in the eigenvalue problem of dimension reduction and the back-substitution conversion solving stage, the calculating speed is obviously improved by distributing proper calculating tasks on the GPU, the GPU heterogeneous algorithm of the automatic multiple substructure of the efficient large-scale sparse generalized characteristic equation is realized, and the requirements of the rapid approximate solving of the sparse generalized characteristic equation in the general fields of finite element simulation analysis, computational fluid mechanics, electromagnetics and the like can be effectively met.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a diagram of a parallel elimination tree;

FIG. 2 is a CPU parallel sub-tree flow diagram;

FIG. 3 is a flow chart of heterogeneous hybrid computation of CPU and GPU.

Detailed Description

It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other. The invention will be described in detail below with reference to the drawings in connection with embodiments. It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the present application. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

The invention provides an automatic multiple substructure data processing method of heterogeneous parallel of a CPU and a GPU, which is characterized in that the node distribution characteristics of a substructure elimination tree formed by a sparse matrix reordering algorithm and the number of CPU threads are used for determining the decomposition line of the elimination tree, and the nodes above the limit adopt a multi-node CPU and GPU mixed parallel computing mode, so that the effective utilization of computing resources is ensured as much as possible; independent subtree tasks are calculated in parallel by using CPU multi-cores under the limit, and a parallel strategy for reducing data synchronization is designed for the independent subtree tasks.

The invention mainly comprises four steps:

step one, dividing lines are parallel to CPU subtrees:

the matrix conversion process of the whole elimination tree is the most time-consuming step of solving the whole characteristic equation, wherein nodes have strict calculation sequence requirements, each node can be started after all child nodes of the node are calculated, and the calculation tasks of a single node can be roughly divided into a plurality of calculation tasks such as small-scale dense matrix characteristic value problem calculation, constraint matrix calculation, node quality matrix update, ancestor node quality matrix update, master ancestor node rigidity matrix update, child node quality matrix update and the like, and the time consumption of the tasks is different and a large number of parallel opportunities exist.

FIG. 1 is a schematic diagram of a subtractive tree generated by a sparse matrix reordering algorithm, in which all data cannot be copied to a device memory at one time for expansion computation due to computation limitations of the memory during acceleration computation using a GPU. The node calculation with numerous layers below the elimination tree but extremely small calculation amount is still reserved at the CPU end for solving. Based on the thought, the invention determines the demarcation level of the subtree parallelism and the node parallelism through the following formula.

Wherein Lc is the number of decomposed layers, nt is the number of CPU threads, nk represents the number of nodes of the k-th layer of the elimination tree, k represents the number of layers of the elimination tree, Z represents an integer set, and Vj represents the degree of freedom of the substructure of the j-th node of the k-th layer, wherein the Vj of the leaf node takes 0 because the leaf node adopts sparse matrix calculation and has small calculation amount.

Nodes below the dividing line are divided into mutually independent subtrees according to the elimination tree distribution, the conversion of each subtree is regarded as a calculation task which can be solved in parallel, each CPU thread is led to obtain a certain number of subtree tasks through the dynamic thread scheduling of OpenMP (Open Multi-Processing, shared storage parallel programming), the calculation flow of each independent subtree is shown in figure 2, in the process of the transformation of the substructured matrix, the solving of all constraint matrixes and the multiplication operation of the connection matrix of the current node and the ancestor node only calculate the row with non-zero elements, matrix update data of all ancestor nodes are temporarily stored in the current child node object, and the competition problem that different threads update data of the same node at the same time is solved. For the calculation of the rigidity matrix and the quality matrix of the leaf node, the CSR sparse matrix format is adopted for storage and calculation due to the irregular sparsity of the rigidity matrix and the quality matrix.

Step two, node hybrid parallel computing based on CPU and GPU:

the size of the nodes above the boundary line is larger, and in the previous CPU parallel calculation process, constraint matrix solving and update matrix calculation on the father node occupy more than 80% of calculation time of each node. In heterogeneous hybrid algorithms, consider transferring this portion of the computing task into the GPU.

The overall calculation idea is shown in fig. 3, wherein blue arrows indicate data transmission through a PCIe bus, all nodes in each layer are calculated and used as data synchronization points, when calculation is started, a CPU end extracts and merges corresponding update matrixes from left and right child nodes in parallel, a stiffness matrix and a quality matrix of a current node are assembled, meanwhile, data are copied into a GPU memory by utilizing asynchronous flows provided by CUDA (Compute Unified Device Architecture, unified computing equipment architecture), and then an iteration method is used for calculating a node characteristic equation at the CPU end and updating a corresponding quality matrix for a substructure with a solution to an interval. Because the sub-structure of the existing interval solutions has small duty ratio, and the number of the solutions is mostly not more than 10, the tasks are simply and efficiently completed by utilizing the logic processing capacity of the CPU strength.

The device end binds a CUDA data stream to each node by utilizing mutual independence of the same-level substructures, so that the execution time of memory copying is hidden into the calculation time of a kernel function as much as possible, and higher parallelism and throughput are realized. If the calculation data of all the nodes cannot be copied to the GPU memory at one time in the calculation process of eliminating a certain layer of the tree, determining the maximum structure number of the once copying to the GPU memory and starting matrix calculation of the nodes, releasing occupied memory space after each node is calculated, and selecting a binding data stream with proper size from the non-calculated nodes to start calculation.

Step three, parallel calculation is performed based on the dimension reduction eigenvalue problem of the GPU:

the matrix conversion of the substructure can reduce the scale of the original sparse matrix characteristic equation by more than two orders of magnitude, and consider solving the equationIn section->The characteristic pair in the matrix has the following form structure of the rigidity matrix and the mass matrix

Wherein,for stiffness matrix->For the quality matrix->、/>、/>The 1 st, i th and n th order eigenvalues are calculated in the substructured matrix conversion process respectively; n is the total number of feature pairs calculated in the conversion process; />The coupling quality matrix is the coupling quality matrix of the ith substructure and the jth substructure after dimension reduction; lambda represents the eigenvalue of the dimension-reduction eigenvector, and x represents the eigenvector corresponding to the eigenvalue;

the stiffness matrix is a diagonal matrix and the mass matrixIs characterized by a sparse symmetrical block matrix with the diagonal lines of 1, and obtains a sparse matrix +.>CSR representation format of (2) and stored on the GPU, the original question is converted into the calculation equation +.>Conversion to be in interval->Features and solutions in the interior, delta is the feature value. By implementing the Arnoldi method on the GPU, operations such as sparse matrix vector multiplication (SPMV) and vector heavy orthogonality (GEMM) can be rapidly calculated by utilizing a numerical calculation library provided by CUDA, and the extreme eigenvalue efficiency of a calculation matrix is remarkably improved.

Step four, parallel strategy of back-substitution conversion:

because the size of the equipment memory is limited, the nodes in the whole back-substitution conversion stage cannot be independently carried out on the GPU, so that the number of the nodes is sequentially increased from top to bottom from the root node of the elimination tree until the required node memory is larger than the equipment memory, data are transmitted to the equipment terminal at one time through page locking memory, the whole back-substitution solving process of the nodes is completed on the GPU, the node set calculated in the form is called set A, and the rest of all the substructure node sets are SetB. The node calculation with larger size is placed on the GPU, and the CPU end is only responsible for small-scale calculation tasks, so that the data quantity required to be transmitted is obviously reduced.

Because the back-substitution conversion calculation of the nodes needs to traverse all ancestor node data but has no requirement on the calculation sequence, the calculation task of the nodes in the SetB can be divided into two blocks of a CPU and a GPU, the CPU calculates the ancestor nodes of the current node in the SetB, the GPU calculates the ancestor nodes in the SetA, the final result can be obtained after one-time data addition, after the calculation of each node is finished, if the current node is a non-leaf node, the calculation tasks of the left child node and the right child node can be synchronously created, and the efficient thread scheduling and switching can be realized through the task mechanism of OpenMP.

According to the method for processing the data of the automatic multiple sub-structure of the heterogeneous parallel of the CPU and the GPU, subtree parallel and node parallel calculation strategies are divided according to the distribution characteristics of the GPU video memory and the nodes of the eliminated tree, and a calculation scheme capable of eliminating different child nodes and simultaneously updating the same father node so as to generate the problem of multithreading deterministic competition is designed. In the matrix conversion stage, the corresponding calculation format is designed by utilizing the sparsity of different matrixes of the leaf nodes and the non-leaf nodes, so that unnecessary operand and memory requirements are effectively reduced, and the solving efficiency is improved. The heterogeneous mixed solution calculation based on the GPU and the CPU of a single node is realized, the CPU is used for calculating a small-scale calculation task while the massive parallel resources of the GPU are fully utilized to accelerate the dense matrix operation, and the idle waste of calculation resources is avoided to the greatest extent. And accelerating the operation process by using the GPU in the reduced generalized eigenvalue problem and the back-substitution conversion process. And (3) adjusting the parallel strategy according to the GPU video memory elasticity, ensuring the compatibility of different GPU devices, and carrying out overlapped data transmission and calculation through the streaming processing of the GPUs in the multi-node parallel process.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments in accordance with the present application. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.

It should be noted that the terms "first," "second," and the like in the description and claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be capable of being practiced otherwise than as specifically illustrated and described herein.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An automatic multiple substructure data processing method for heterogeneous parallelism of a CPU and a GPU is characterized by comprising the following steps:

in step three, solve the equationIn section->The obtained rigidity matrix and the mass matrix have the following form structure:

；

the stiffness matrix is a diagonal matrix and the mass matrixIs characterized by a sparse symmetrical block matrix with the diagonal lines of 1, and obtains a sparse matrix +.>CSR representation format of (2) and stored on GPU, the original problem is converted into calculation equationConversion to be in interval->The internal features and solutions, delta is the feature value;

2. The method for automatically processing multiple sub-structure data in heterogeneous parallel with respect to a CPU and a GPU according to claim 1, wherein in the first step, a boundary level between sub-tree parallelism and node parallelism is determined by the following formula;

；

3. The method for automatically processing the multi-substructure data by heterogeneous parallelism of the CPU and the GPU according to claim 1, wherein in the step one, in the process of converting the substructure matrix, the solving of all constraint matrixes and the multiplication operation of the connection matrixes of the current node and the ancestor node only calculate the rows and columns with non-zero elements, and the matrix update data of all ancestor nodes are temporarily stored in the current child node object, so that the competition problem of data update of the same node by different threads is solved; and for the rigidity matrix and quality matrix calculation of the leaf nodes, adopting CSR sparse matrix format for storage and calculation.

4. The method for automatically processing multiple sub-structure data in heterogeneous parallel with the CPU and the GPU according to claim 1, wherein in the second step, the device end binds a CUDA data stream to each node by utilizing the mutual independence of the same-level sub-structure, and hides the execution time of memory copy into the computation time of kernel functions; if the calculation data of all the nodes cannot be copied to the GPU memory at one time in the calculation process of eliminating a certain layer of the tree, determining the maximum structure number of the once copying to the GPU memory and starting matrix calculation of the nodes, releasing occupied memory space after each node is calculated, and selecting a binding data stream with proper size from the non-calculated nodes to start calculation.

5. The method for processing data of an automatic multiple sub-structure in heterogeneous parallel with a CPU and a GPU according to claim 1, wherein in the fourth step,