CN114911619A - Batch parallel LU decomposition method of small and medium-sized dense matrix based on GPU for simulation system - Google Patents
Batch parallel LU decomposition method of small and medium-sized dense matrix based on GPU for simulation system Download PDFInfo
- Publication number
- CN114911619A CN114911619A CN202210522339.2A CN202210522339A CN114911619A CN 114911619 A CN114911619 A CN 114911619A CN 202210522339 A CN202210522339 A CN 202210522339A CN 114911619 A CN114911619 A CN 114911619A
- Authority
- CN
- China
- Prior art keywords
- matrix
- block
- sub
- column
- blocksize
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 239000011159 matrix material Substances 0.000 title claims abstract description 110
- 238000000034 method Methods 0.000 title claims abstract description 48
- 238000000354 decomposition reaction Methods 0.000 title claims abstract description 47
- 238000004088 simulation Methods 0.000 title claims abstract description 26
- 238000000638 solvent extraction Methods 0.000 claims abstract description 6
- 238000005192 partition Methods 0.000 claims abstract description 5
- 238000012545 processing Methods 0.000 claims description 22
- 230000008569 process Effects 0.000 claims description 9
- 230000000694 effects Effects 0.000 claims description 5
- 238000004364 calculation method Methods 0.000 description 11
- 238000010586 diagram Methods 0.000 description 8
- 230000000903 blocking effect Effects 0.000 description 4
- 230000001133 acceleration Effects 0.000 description 3
- 238000003379 elimination reaction Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000001364 causal effect Effects 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000004069 differentiation Effects 0.000 description 1
- 230000008030 elimination Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0877—Cache access modes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T1/00—General purpose image data processing
- G06T1/20—Processor architectures; Processor configuration, e.g. pipelining
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2209/00—Indexing scheme relating to G06F9/00
- G06F2209/50—Indexing scheme relating to G06F9/50
- G06F2209/5018—Thread allocation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Complex Calculations (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention aims at a small and medium dense matrix of a simulation system to realize a high-performance batch parallel LU decomposition method based on a GPU, which comprises the following steps: s1, calculating Jacobian matrixes of model units of the simulation system at each time step, taking the matrixes as sub-matrixes, and splicing the sub-matrixes into a large matrix H; s2, partitioning the interior of each sub-matrix, and distributing GPU thread groups for each sub-matrix according to the size of the partitions; s3, the GPU thread groups sequentially complete the selection of the main elements of all columns in each diagonal block, the updating of the block on the right side of the row where the diagonal block is located, the exchange of the columns in the diagonal block and the main element columns and the updating of all block data below the column where the diagonal block is located and all block data below the right side of the column where the diagonal block is located according to the diagonal sequence from top left to bottom right, and therefore the LU decomposition of all sub-matrixes in the large matrix H is completed. The invention can increase the performance in an approximately linear trend along with the increase of batch quantity, and can accelerate the solving speed and effectively shorten the simulation time when implicitly solving the model containing a large batch of small ordinary differential equations.
Description
Technical Field
The invention belongs to the field of modeling simulation and parallel computing cross technology application, and particularly relates to a batch parallel LU Decomposition (Lower-Upper Decomposition) method for a small and medium-sized dense matrix based on a GPU (Graphics Processing Unit) in complex system simulation.
Background
LU decomposition is the most important calculation step in many engineering and scientific calculation problems, and has wide application in the fields of system simulation, machine learning, image processing, information encryption, high-speed information transmission and the like. With the continuous development of computer science, the data volume needing to be processed grows exponentially, and how to efficiently and quickly process experimental data by using a computer becomes a difficult problem. In most critical applications, the parallel solution problem of a large number of small-scale matrixes needs to be solved, namely, a large-batch matrix LU decomposition is completed under a set of operation logic, and a few large-scale linear systems are solved in series.
The complex system usually comprises a large number of medium and small non-linear simulation model units and needs linear form approximate solution, wherein the most time-consuming step is calculation of a linear equation system and LU decomposition of a model unit Jacobian matrix is carried out. The current LU decomposition method based on the GPU is mainly oriented to large matrixes, adopts a blocking mode, disperses processing tasks into a plurality of threads, and belongs to a method for dividing computing tasks inside a large model. However, this method is not suitable for small and medium models, and since the jacobian matrix is small, the time spent on reading and writing matrix elements is far greater than the time spent on calculation, the effective calculation time ratio cannot be improved by the matrix operation based on the block form, and the global memory access delay constitutes a major performance bottleneck. The smaller the matrix, the more serious the problem.
Disclosure of Invention
In order to solve the above technical problems, the present invention provides a batch parallel LU decomposition method for a GPU-based medium and small dense matrix for a simulation system, which can effectively reduce the time consumed by the decomposition process and improve the operation efficiency.
The invention provides a batch LU decomposition method of a small and medium-sized dense matrix based on a GPU (graphics processing unit) for a simulation system, which comprises the following steps:
s1, calculating Jacobian matrixes of model units of the simulation system at each time step, taking the matrixes as sub-matrixes, and splicing the sub-matrixes into a large matrix H;
s2, partitioning the interior of each sub-matrix, and distributing GPU thread groups for each sub-matrix according to the size of the partitions;
s3, the GPU thread groups sequentially complete the selection of the main elements of all columns in each diagonal block, the updating of the block on the right side of the row where the diagonal block is located, the exchange of the columns in the diagonal block and the main element columns and the updating of all block data below the column where the diagonal block is located and all block data below the right side of the column where the diagonal block is located according to the diagonal sequence from top left to bottom right, and therefore the LU decomposition of all sub-matrixes in the large matrix H is completed.
Further, in step S1, the simulation system has n model units, and the jacobian matrix is an m-order medium-small dense matrix H i I is in the range of [0, n-1]Wherein, the acquisition method of m is as follows: the dimensions of all model elements to be solved at present are extended up to an integer power m of 2 (i.e. the jacobian matrix is extended to the order m), and for the added rows and columns, all but the value of the diagonal element is 1, are filled with 0.
Further, in step S2, the method includes:
s21, combining the submatrix H i The thread groups are divided into a plurality of sub-blocks so as to ensure that the thread groups can read data in the whole sub-blocks from the global memory at one time in the subsequent processing, and the access times of the global memory are reduced;
s22, according to the size of the block, the sub-matrix is H i Allocating a corresponding number of GPU threads to form a thread Group i 。
Further, the submatrix H i The method for dividing the block into a plurality of sub-blocks comprises the following steps:
if m is larger than or equal to 16, setting the block size to be 8 or 16;
if m is smaller than 16, the entire sub-matrix is treated as a block, when the block size is m.
Further, in step S3, the specific steps include:
s31, each thread Group i Stator-selecting matrix H i Diagonal block B with inner index j j,j Determining B j,j The principal element columns corresponding to all the columns in the array, and the pair is located at B j,j Updating all the blocks on the right row; where j is initialized to 0;
s32, all the obtained pivot columns and the diagonal block B j,j Internally corresponding column interchange positions;
s33, aligning at diagonal block B j,j Updating all the blocks below the column;
s34, aligning at diagonal block B j,j Updating all the blocks at the lower right;
s35, if diagonal block B j,j Is a submatrix H i The lower right corner in the middle is blocked, the batch parallel LU decomposition is finished, and a large matrix H is output, wherein each sub-matrix H i Are all composed of L i And U i The unit lower triangular matrix and the unit upper triangular matrix are used for quickly solving a subsequent linear equation set; otherwise, let j equal to j +1, go to step S31, and continue processing the next block.
Further, step S31 includes:
s311: initializing a one-dimensional matrix To of dimension m i Let To i [l]=m,0≤l<m, wherein,
array index l corresponds to sub-matrix H i Column number of (2), To i [l]The sequence number of the final main element column is used for recording the column with the sequence number l;
s312: thread group from current block B j,j Starting, reading all elements of the p-th line in the block in parallel, and storing the elements into a cache in the thread; where p is according to the sub-matrix H i Calculating the line index of (1), and initializing to be j multiplied by blocksize; in this step, due to block B j,j The line elements are closely arranged in the global memory, and the GPU can read all data of the thread group by only transmitting an instruction once, so that the effect of memory merging access is achieved; wherein j × blocksize is less than or equal to p<(j+1)×blocksize);
S313: thread group continues reading B in parallel j,j All data of p rows in the next block adjacent to the right side are compared with the data in the cache, and the element with the larger absolute value and the serial number of the column where the element is located are stored in the internal cache; the process is iterated continuously until the last block data of the submatrix is read;
s314: comparing all elements cached in the thread group, taking the column q corresponding To the element with the maximum absolute value as the principal element column corresponding To the column p, and updating the element To i [p]=q;
S315: submatrix H i From the column index j × block size, traverse all subsequent elements in the rowIf To i [l]>q, then perform the following operations:
s316: for each element of the P1 regionIf To i [y]>q, then perform the following argument operation:
wherein the P1 region is located in the sub-matrix H i (p +1), (j +1) xblocksize-1]And [ j × blocksize, m-1 ] of rows]A matrix between the columns; p +1 is less than or equal to x<(j+1)×blocksize,j×blocksize≤y<m;
S317: if p is block B j,j Last line in, and step S31 ends, at which time B j,j Including the LU decomposition result L in the current block j,j And U j,j (ii) a Otherwise, let p be p +1, go to step S312, and continue processing the next line.
Further, in step S32, for B j,j Any column p in the thread group, the p-th thread execution column col in the thread group p And col To i [p] Element exchange between them.
Further, in step S33, let the P2 region be located in the submatrix H i [ (j +1) x blocksize, m-1)]And [ j × Block size, (j +1) × Block size-1]Matrix between columns for each element thereinThe following operations are performed:
wherein, x is less than or equal to (j +1) x blocksize and less than or equal to (j +1) x blocksize.
Further, in step S34, let the P3 region be located in the submatrix H i [ (j +1) x blocksize, m-1)]And row [ (j +1) xblocksize, m-1]A matrix between the columns; the P1 and P3 areas are further divided into a plurality of sub-areas according to blocksize, and the thread groups are Group i For any sub-region P1 therein in turn t 、P3 t ,0≤t<k-j, the following treatment is carried out:
thread Group i Each GPU thread T in (1) y ,0≤y<blocksize, performing the following operations:
read-in sub-region P1 t The y column vector A with the middle sequence number is cached in the thread;
calculating sub-region P3 t Sequence number y each element in the vector D:
D[x]=D[x]-C x A (4)
wherein (j +1) xblocksize is less than or equal to x < m.
Further, in step S35, it is judged that the block B is currently processed j,j Whether submatrix H i The lower right corner in the middle is partitioned, if so, the batch parallel LU decomposition is finished; otherwise, let j equal to j +1, go to step S31, and continue processing the next block.
Compared with the prior art, the batch parallel LU decomposition method of the medium-sized dense matrix based on the GPU has the following beneficial effects:
(1) the invention provides the capability of high-performance batch parallel LU decomposition of medium and small dense matrixes. The method increases the performance with an approximately linear trend along with the increase of the batch number, and the highest peak value is close to 450 Gflops/s. Compared with the CUBLAS library function provided by the Invita company, the maximum acceleration ratio exceeds more than 10;
(2) the method has important significance in the aspect of improving the simulation efficiency of the complex system. When the model containing a large batch of small ordinary differential equations is implicitly solved, the solving speed can be increased, and the simulation time can be effectively shortened.
Drawings
In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flow chart of a batch parallel LU decomposition method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a matrix H and its internal sub-matrices according to an embodiment of the present invention;
FIG. 3 is a block diagram of an internal sub-matrix according to an embodiment of the present invention;
FIG. 4 is a schematic diagram illustrating the selection of pivot elements in steps S312, S313 and S314 according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of the P1, P2 and P3 regions processed during the calculation of the steps S33, S34 and S35 according to an embodiment of the present invention;
FIG. 6 is a diagram illustrating the exchange principal component listing in step S32 according to an embodiment of the present invention;
FIG. 7 is a diagram illustrating the processing of the elements inside the P2 region in step S34 according to an embodiment of the present invention;
FIG. 8 is a diagram illustrating the processing of the elements inside the P3 region in step S35 according to an embodiment of the present invention;
fig. 9 is a schematic diagram of experimental test effect.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
In order to improve the batch parallel LU decomposition efficiency of the medium and small dense matrixes, the invention provides a batch parallel LU decomposition method of the medium and small dense matrixes based on a GPU for a simulation system. The basic idea of the invention is to utilize memory merge access and partial data of a local variable cache submatrix, utilize out-of-order execution technology to update partial key data, delay vector-based external multiplication operation, and convert the vector-based external multiplication operation into matrix-based dense multiplication operation, thereby effectively reducing the access times of the global memory, improving the effective calculation time ratio, and greatly improving the decomposition efficiency.
The batch LU decomposition method of the small and medium dense matrix based on the GPU for the simulation system, as shown in FIG. 1, comprises the following steps:
s1, calculating Jacobian matrixes of model units of the simulation system at each time step, taking the matrixes as sub-matrixes, and splicing the sub-matrixes into a large matrix H;
s2, partitioning the interior of each sub-matrix, and distributing a corresponding number of GPU threads to each sub-matrix according to the size of the partitions to form thread groups;
s3, the GPU thread groups finish the selection of the main elements of all columns in each diagonal block, the updating of the block on the right side of the row where the diagonal block is located, the exchange of the columns in the diagonal block and the main element columns, and the updating of all block data below the column where the diagonal block is located and all block data below the right side of the diagonal block in turn from top left to bottom right according to the diagonal sequence.
In step S1, the model can be directly described in a language, such as an expression in c, and the jacobian matrix of the model unit at each time step can be automatically calculated by symbolic differentiation. Aiming at the simulation of a complex system, in the solving process, a Newton method is adopted for n model units forming the system, a Jacobian matrix of each unit is calculated when each time step arrives, the sub-matrices are spliced into a two-dimensional large matrix H, and the large matrix H is subjected to batch parallel LU decomposition in the following steps.
The Jacobian matrix of each unit is an m-order medium-small dense matrix, and the m-order medium-small dense matrix of the n model units is used as a sub-matrix H i Splicing into a large matrix H as input; i ranges from [0, n-1 ]]。
Specifically, the m acquisition method comprises the following steps: the dimensions of all sub-simulation models to be solved at present are expanded up to an integer power m of 2 (i.e. the jacobian matrix is expanded to the order of m). For increasing rows and columns, all but the diagonal elements have a value of 1, and all are filled with 0's. All the expanded sub-matrices are arranged in a one-dimensional manner, and finally, a two-dimensional large matrix H with dimensions of m × N is formed, where N is m × N, as shown in fig. 2.
In step S2, H is assigned to each submatrix i Internally blocking and dividing each sub-matrix H according to the size of the block i Allocating a corresponding number of GPU threads to form a thread Group i ;
When the rank m of each sub-matrix is larger than or equal to 16, partitioning is performed inside each sub-matrix, the size of the partitioned block size can be selected in two ways, 8 or 16 respectively, and the partitioned block size can be determined by transmitting parameters in actual solution. If the dimension of the sub-matrix is less than 16, the whole sub-matrix is treated as a block, when the blocksize is m. After division, each sub-matrix is internally wrappedContains k × k sub-blocks, where k is m ÷ blocksize, and the blocking result is shown in fig. 3. By blocking the interior of the sub-matrixes, the thread Group in the subsequent processing can be ensured i The data in the whole block can be read in from the global memory at one time, and the access times of the global memory are effectively reduced.
In step S3, LU decomposition is performed on all the sub-matrices in the large matrix H, and the specific steps include:
s31, each thread Group i Stator-selecting matrix H i Diagonal block B with internal index j j,j Determining B j,j The principal element columns corresponding to all the columns in the array, and the pair is located at B j,j Updating all the blocks on the right row; where j is initialized to 0;
s32, all the obtained principal element columns and the diagonal block B j,j Internally corresponding column interchange positions;
s33, aligning at diagonal block B j,j Updating all the blocks below the column;
s34, aligning at diagonal block B j,j Updating all the blocks at the lower right;
s35, if diagonal block B j,j Is a submatrix H i The lower right corner in the middle is blocked, the batch parallel LU decomposition is finished, and a large matrix H is output, wherein each sub-matrix H is arranged in the large matrix H i Are all composed of L i And U i The unit lower triangular matrix and the unit upper triangular matrix are used for quickly solving a subsequent linear equation set; otherwise, let j equal to j +1, go to step S31, and continue processing the next block.
In step S31, for the diagonal block B to be currently processed j,j ,0≤j<k, run through B j,j All diagonal elements b in (1) p,p (j×blocksize≤p<(j +1) xblocksize), determining the serial number of the corresponding main element, specifically:
s311: initializing a one-dimensional matrix To of dimension m i Let To i [l]=m,0≤l<m, wherein,
array index l corresponds to sub-matrix H i Column number of (2), To i [l]For recording the column with sequence number l, its final principal elementA column number;
s312: thread group from current block B j,j Starting, reading all elements of the p-th line in the block in parallel, and storing the elements into a cache in the thread; where p is according to the sub-matrix H i Calculating the line index of (1), and initializing to be j multiplied by blocksize; in this step, due to block B j,j The line elements are closely arranged in the global memory, and the GPU can read all data of the thread group by only transmitting an instruction once, so that the effect of memory merging access is achieved;
s313: thread group continues reading B in parallel j,j And all the data of p rows in the next block adjacent to the right is compared with the data in the cache, and the element with the larger absolute value and the serial number of the column where the element is located are stored in the internal cache. The process is iterated continuously until the last block data of the submatrix is read; the step also realizes the effect of memory merging access;
s314: comparing all elements cached in the thread group, taking the column q corresponding To the element with the maximum absolute value as the principal element column corresponding To the column p, and updating the element To i [p]=q;
The process of steps S312, S313 and S314 is shown in FIG. 4;
s315: submatrix H i From the column index j × block size, traverse all subsequent elements in the rowIf To i [l]>q, then perform the following operations:
The step updates the p rows of elements for the subsequent elimination process;
s316: the elimination is carried out on the P1 area in FIG. 5;
specifically, the P1 region is located in the sub-matrix H i (p +1), (j +1) xblocksize-1]And [ j × blocksize, m-1 ] of rows]Matrix between columns for each element thereinIf To i [y]>q, then perform the following argument operation:
wherein, x < (j +1) × blocksize is more than or equal to p +1, and y is more than or equal to j × blocksize and less than or equal to y < m
S317: if p is block B j,j Last line in, and step S31 ends, at which time B j,j Including the LU decomposition result L in the current block j,j And U j,j (ii) a Otherwise, let p be p +1, go to step S312, and continue processing the next line.
In step S32, the thread Group i At the same time for the current diagonal block B j,j All columns within and their corresponding pivot columns are swapped. As shown in fig. 6 for B j,j Any column p in the thread group, the p-th thread execution column col in the thread group p And column col To i [p] Element exchange between;
due to the diagonal block B j,j All columns in the column are exchanged simultaneously, and the column interval [ j × blocksize, (j +1) × blocksize-1]All elements in the GPU can realize memory merging reading and writing, so that the emission times of GPU memory instruction access are effectively reduced;
in step S33, a tail update is performed on the P2 area in fig. 5, specifically:
the P2 region is located in the sub-matrix H i [ (j +1) x blocksize, m-1)]And [ j × Block size, (j +1) × Block size-1]Matrix between columns for each element thereinAs shown in fig. 7, the following operations are performed:
wherein (j +1) xblocksize is less than or equal to x < m, and j xblocksize is less than or equal to y < (j +1) xblocksize
As can be seen from equation (3), there is a causal relationship between columns in P2, and the update of the latter column depends on the results of all the columns in the front, so that it cannot be completely parallelized, and needs to be updated separately, and thus, it is decoupled from the P3 region in the back, so that the update of the P3 region can be completely parallelized;
in step S34, a tail update is performed on the P3 area in fig. 5, specifically:
the P3 region is located in the sub-matrix H i [ (j +1) x blocksize, m-1)]And row [ (j +1) xblocksize, m-1]A matrix between the columns; the P1 and P3 regions are further divided into a plurality of sub-regions according to blocksize, as shown in fig. 8; thread Group i These sub-regions are processed sequentially, for any of them P1 t ,P3 t ,0≤t<k-j, the following treatment is carried out:
thread Group i Each GPU thread T in (1) y ,0≤y<blocksize, performing the following operations:
read-in sub-region P1 t The y column vector A with the middle sequence number is cached in the thread;
compute sub-region P3 t Sequence number y each element in the vector D:
D[x]=D[x]-C x A (4)
wherein (j +1) xblocksize is less than or equal to x < m;
as can be seen from the formula (4), the vector a is used for calculating each element in the column D, and since the data of the vector a is already read into the local cache of the thread, multiple reading is effectively avoided, and the calculation time ratio is improved; meanwhile, since all the data in the previous P1 and P2 areas are processed, the matrix multiplication and addition operations can be completed in a block form completely and parallelly during the execution of the current step, and the calculation efficiency can be greatly improved.
In step S35, it is judged whether or not the currently processed block B is present j,j Whether submatrix H i Middle bottom right block, if it is, the batch parallel LU decomposition is finished, the sub-matrix H i In which the decomposition result L is included i And U i (ii) a Otherwise, let j equal to j +1, go to step S3, and continue processing the next block.
When the method is finished, outputting a large matrix H, wherein each sub-matrix H i Are all composed of L i And U i The unit lower triangular matrix and the unit upper triangular matrix are used for subsequent batch linear equation set fast solving, and then the state variable values of all simulation units are calculated through a Newton iteration method, so that the simulation process is guaranteed to be fast promoted.
According to the method of the present invention, experimental tests were performed on batch parallel LU decomposition of dense matrices, as shown in fig. 9: the abscissa is the dimension of the large matrix, the ordinate is the data processing capacity, and the lower right corner is marked with different sub-matrix dimensions, represented by curves of corresponding colors in the graph. It can be seen that, no matter the partitioning is performed according to 8 × 8 or 16 × 16, the LU decomposition processing capacity basically increases linearly with the number of sub-matrices, which means that in the execution process of the method of the present invention, the hardware of the GPU is fully utilized; it can also be seen from the figure that when the dimension of the large matrix reaches 8192 multiplied by 8192, the peak floating point operation processing capacity reaches about 450 Gflops/s. Table 1 shows the comparison results of partial experiments of the execution efficiency of the CUBALS library function provided by the England Dart, and when the dimension of the submatrix is 64, the advantages of the method are gradually increased along with the increase of the number of the submatrixes, and the maximum acceleration ratio reaches about 13 times. Similar acceleration ratio results are also possible in other submatrix dimensions.
TABLE 1 comparison of processing speed partial results for the method of the invention and CUBLAS
The invention fully utilizes the hardware characteristics of memory combination access and local variable cache while realizing high parallelization, effectively hides the delay of memory access and improves the calculation time ratio. Therefore, the method has high calculation efficiency, can effectively reduce the LU decomposition time, and has the implicit real-time parallel solving capability of supporting a million-level small simulation model.
Those of ordinary skill in the art will understand that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions and scope of the present invention as defined in the appended claims.
Claims (10)
1. A batch LU decomposition method of a small and medium-sized dense matrix based on a GPU for a simulation system is characterized by comprising the following steps:
s1, calculating Jacobian matrixes of model units of the simulation system at each time step, taking the matrixes as sub-matrixes, and splicing the sub-matrixes into a large matrix H;
s2, partitioning the interior of each sub-matrix, and distributing GPU thread groups for each sub-matrix according to the size of the partitions;
s3, the GPU thread groups finish the principal element selection of all columns in each diagonal block, the updating of the right block of the row where the diagonal block is located, the exchange of the internal columns and the principal element columns in the diagonal block and the updating of all block data below the column where the diagonal block is located and below the right in turn from top left to bottom right according to the diagonal sequence, thereby finishing the LU decomposition of all sub-matrixes in the large matrix H.
2. The batch LU decomposition method according to claim 1, wherein in step S1, the simulation system has n model units, and the Jacobian matrix is m-th order and is smallType dense matrix H i I is in the range of [0, n-1]Wherein, the acquisition method of m is as follows: the dimensions of all model elements to be solved at present are extended up to an integer power m of 2 (i.e. the jacobian matrix is extended to the order m), and for the added rows and columns, all but the value of the diagonal element is 1, are filled with 0.
3. The batch LU decomposition method according to claim 1, comprising, in step S2:
s21, combining the submatrix H i The thread groups are divided into a plurality of sub-blocks so as to ensure that the thread groups can read data in the whole sub-blocks from the global memory at one time in the subsequent processing, and the access times of the global memory are reduced;
s22, according to the size of the block, the sub-matrix is H i Allocating a corresponding number of GPU threads to form a thread Group i 。
4. The batch LU decomposition method of claim 3, wherein the submatrix H i The method for dividing the block into a plurality of sub-blocks comprises the following steps:
if m is larger than or equal to 16, setting the block size to be 8 or 16;
if m is smaller than 16, the entire sub-matrix is treated as a block, at which point the block size is m.
5. The batch LU decomposition method according to claim 2, wherein in the step S3, the specific steps include:
s31, each thread Group i Stator-selecting matrix H i Diagonal block B with internal index j j,j Determining B j,j The principal element columns corresponding to all the columns in the array, and the pair is located at B j,j Updating all the blocks on the right row; where j is initialized to 0;
s32, all the obtained principal element columns and the diagonal block B j,j Internally corresponding column interchange positions;
s33, aligning at diagonal block B j,j Updating all the blocks below the column;
s34, aligning at diagonal block B j,j Updating all the blocks at the lower right;
s35, if diagonal block B j,j Is a sub-matrix H i The lower right corner in the middle is blocked, the batch parallel LU decomposition is finished, and a large matrix H is output, wherein each sub-matrix H i Are all composed of L i And U i The unit lower triangular matrix and the unit upper triangular matrix are used for quickly solving a subsequent linear equation set; otherwise, let j equal to j +1, go to step S31, and continue processing the next block.
6. The batch LU decomposition method of claim 5, wherein the step S31 comprises:
s311: initializing a one-dimensional matrix To of dimension m i Let To i [l]=m,0≤l<m, wherein,
array index l corresponds to sub-matrix H i Column number of (2), To i [l]The sequence number of the final main element column is used for recording the column with the sequence number l;
s312: thread group from current block B j,j Starting, reading all elements of the p-th line in the block in parallel, and storing the elements into a cache in the thread; where p is according to the sub-matrix H i Calculating the line index of (1), and initializing to be j multiplied by blocksize; in this step, due to block B j,j The line elements are closely arranged in the global memory, and the GPU can read all data of the thread group by only transmitting an instruction once, so that the effect of memory merging access is achieved; wherein j × blocksize is less than or equal to p<(j+1)×blocksize);
S313: thread group continues to read B in parallel j,j All data of p rows in the next block adjacent to the right side are compared with the data in the cache, and the element with the larger absolute value and the serial number of the column where the element is located are stored in the internal cache; the process is iterated continuously until the last block data of the submatrix is read;
s314: comparing all elements cached in the thread group, taking the column q corresponding To the element with the maximum absolute value as the principal element column corresponding To the column p, and updating the element To i [p]=q;
S315: pairMatrix H i Starting from the column index j × block size, the current row p in (1) traverses all the subsequent elements in the rowIf To i [l]>q, then perform the following operations:
s316: for each element of the P1 regionIf To i [y]>q, then perform the following argument operation:
wherein the P1 region is located in the sub-matrix H i (p +1), (j +1) xblocksize-1]And [ j × blocksize, m-1 ] of rows]A matrix between the columns; p +1 is less than or equal to x<(j+1)×blocksize,j×blocksize≤y<m;
S317: if p is block B j,j Last line in, and step S31 ends, at which time B j,j Including the LU decomposition result L in the current block j,j And U j,j (ii) a Otherwise, let p be p +1, go to step S312, and continue processing the next line.
8. The batch LU decomposition method of claim 5, wherein in step S33, the P2 region is set to be located in the submatrix H i [ (j +1) x blocksize, m-1)]Line and [ j × blocksize, (j +1) × blocksize-1]Matrix between columns for each element thereinThe following operations are performed:
wherein, x is less than or equal to (j +1) x blocksize and less than or equal to (j +1) x blocksize.
9. The batch LU decomposition method of claim 5, wherein in step S34, the P3 region is set to be located in the submatrix H i [ (j +1) x blocksize, m-1)]And row [ (j +1) xblocksize, m-1]A matrix between the columns; the P1 and P3 areas are further divided into a plurality of sub-areas according to blocksize, and the thread groups are Group i In turn for any sub-region P1 therein t 、P3 t ,0≤t<k-j, the following treatment is carried out:
thread Group i Each GPU thread T in (1) y ,0≤y<blocksize, performing the following operations:
read-in sub-region P1 t The y column vector A with the middle sequence number is cached in the thread;
calculating sub-region P3 t Sequence number y each element in the vector D:
D[x]=D[x]-C x A (4)
wherein (j +1) xblocksize is less than or equal to x < m.
10. The batch LU decomposition method according to claim 5, wherein in step S35, the currently processed partition B is judged j,j Whether submatrix H i The lower right corner in the middle is partitioned, if so, the batch parallel LU decomposition is finished; otherwise, let j equal to j +1, go to step S31, and continue processing the next block.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210522339.2A CN114911619A (en) | 2022-05-13 | 2022-05-13 | Batch parallel LU decomposition method of small and medium-sized dense matrix based on GPU for simulation system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210522339.2A CN114911619A (en) | 2022-05-13 | 2022-05-13 | Batch parallel LU decomposition method of small and medium-sized dense matrix based on GPU for simulation system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114911619A true CN114911619A (en) | 2022-08-16 |
Family
ID=82766173
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210522339.2A Pending CN114911619A (en) | 2022-05-13 | 2022-05-13 | Batch parallel LU decomposition method of small and medium-sized dense matrix based on GPU for simulation system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114911619A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115396065A (en) * | 2022-10-26 | 2022-11-25 | 南京邮电大学 | Low-delay decoding method for sparse random linear network coding |
CN117725348A (en) * | 2024-02-07 | 2024-03-19 | 蓝象智联(杭州)科技有限公司 | Thread management method and system in GPU computing large-scale array summation process |
-
2022
- 2022-05-13 CN CN202210522339.2A patent/CN114911619A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115396065A (en) * | 2022-10-26 | 2022-11-25 | 南京邮电大学 | Low-delay decoding method for sparse random linear network coding |
CN117725348A (en) * | 2024-02-07 | 2024-03-19 | 蓝象智联(杭州)科技有限公司 | Thread management method and system in GPU computing large-scale array summation process |
CN117725348B (en) * | 2024-02-07 | 2024-05-10 | 蓝象智联(杭州)科技有限公司 | Thread management method and system in GPU computing large-scale array summation process |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP3974959B1 (en) | Hardware accelerated machine learning | |
US20210072986A1 (en) | Methods for performing processing-in-memory operations on serially allocated data, and related memory devices and systems | |
CN110826719B (en) | Quantum program processing method and device, storage medium and electronic device | |
CN114911619A (en) | Batch parallel LU decomposition method of small and medium-sized dense matrix based on GPU for simulation system | |
CN108170639B (en) | Tensor CP decomposition implementation method based on distributed environment | |
CN110516810B (en) | Quantum program processing method and device, storage medium and electronic device | |
CN113835758B (en) | Winograd convolution implementation method based on vector instruction accelerated computation | |
CN114090954A (en) | Integer matrix multiplication kernel optimization method based on FT-2000+ | |
CN113987414B (en) | Small and irregular matrix multiplication optimization method based on ARMv8 multi-core processor | |
CN110414672B (en) | Convolution operation method, device and system | |
Nakano | Optimal parallel algorithms for computing the sum, the prefix-sums, and the summed area table on the memory machine models | |
Li et al. | Enabling high performance deep learning networks on embedded systems | |
Kiran et al. | Development of GPU‐based matrix‐free strategies for large‐scale elastoplasticity analysis using conjugate gradient solver | |
US12072799B2 (en) | Programmable multi-level data access address generator | |
CN115485656A (en) | In-memory processing method for convolution operation | |
Hasan et al. | GPU accelerated tensor computation of Hadamard product for machine learning applications | |
US20230244600A1 (en) | Process for Generation of Addresses in Multi-Level Data Access | |
CN115481364A (en) | Parallel computing method for large-scale elliptic curve multi-scalar multiplication based on GPU (graphics processing Unit) acceleration | |
CN113076520A (en) | Heterogeneous large matrix solving method based on GPU | |
Anzt et al. | Variable-size batched gauss-huard for block-jacobi preconditioning | |
Chu et al. | Efficient Concurrent L1-Minimization Solvers on GPUs. | |
Tokura et al. | Gpu-accelerated bulk computation of the eigenvalue problem for many small real non-symmetric matrices | |
Algis et al. | Efficient GPU Implementation of Particle Interactions with Cutoff Radius and Few Particles per Cell | |
CN110765413A (en) | Matrix summation structure and neural network computing platform | |
Hanif et al. | Memgans: Memory management for energy-efficient acceleration of complex computations in hardware architectures for generative adversarial networks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |