CN114911619A - Batch parallel LU decomposition method of small and medium-sized dense matrix based on GPU for simulation system - Google Patents

Batch parallel LU decomposition method of small and medium-sized dense matrix based on GPU for simulation system Download PDF

Info

Publication number
CN114911619A
CN114911619A CN202210522339.2A CN202210522339A CN114911619A CN 114911619 A CN114911619 A CN 114911619A CN 202210522339 A CN202210522339 A CN 202210522339A CN 114911619 A CN114911619 A CN 114911619A
Authority
CN
China
Prior art keywords
matrix
block
sub
column
blocksize
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210522339.2A
Other languages
Chinese (zh)
Inventor
张雪松
闫昭
赫枫龄
雷新丽
马琳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jilin University
Original Assignee
Jilin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jilin University filed Critical Jilin University
Priority to CN202210522339.2A priority Critical patent/CN114911619A/en
Publication of CN114911619A publication Critical patent/CN114911619A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0877Cache access modes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5018Thread allocation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Complex Calculations (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention aims at a small and medium dense matrix of a simulation system to realize a high-performance batch parallel LU decomposition method based on a GPU, which comprises the following steps: s1, calculating Jacobian matrixes of model units of the simulation system at each time step, taking the matrixes as sub-matrixes, and splicing the sub-matrixes into a large matrix H; s2, partitioning the interior of each sub-matrix, and distributing GPU thread groups for each sub-matrix according to the size of the partitions; s3, the GPU thread groups sequentially complete the selection of the main elements of all columns in each diagonal block, the updating of the block on the right side of the row where the diagonal block is located, the exchange of the columns in the diagonal block and the main element columns and the updating of all block data below the column where the diagonal block is located and all block data below the right side of the column where the diagonal block is located according to the diagonal sequence from top left to bottom right, and therefore the LU decomposition of all sub-matrixes in the large matrix H is completed. The invention can increase the performance in an approximately linear trend along with the increase of batch quantity, and can accelerate the solving speed and effectively shorten the simulation time when implicitly solving the model containing a large batch of small ordinary differential equations.

Description

Batch parallel LU decomposition method of small and medium-sized dense matrix based on GPU for simulation system
Technical Field
The invention belongs to the field of modeling simulation and parallel computing cross technology application, and particularly relates to a batch parallel LU Decomposition (Lower-Upper Decomposition) method for a small and medium-sized dense matrix based on a GPU (Graphics Processing Unit) in complex system simulation.
Background
LU decomposition is the most important calculation step in many engineering and scientific calculation problems, and has wide application in the fields of system simulation, machine learning, image processing, information encryption, high-speed information transmission and the like. With the continuous development of computer science, the data volume needing to be processed grows exponentially, and how to efficiently and quickly process experimental data by using a computer becomes a difficult problem. In most critical applications, the parallel solution problem of a large number of small-scale matrixes needs to be solved, namely, a large-batch matrix LU decomposition is completed under a set of operation logic, and a few large-scale linear systems are solved in series.
The complex system usually comprises a large number of medium and small non-linear simulation model units and needs linear form approximate solution, wherein the most time-consuming step is calculation of a linear equation system and LU decomposition of a model unit Jacobian matrix is carried out. The current LU decomposition method based on the GPU is mainly oriented to large matrixes, adopts a blocking mode, disperses processing tasks into a plurality of threads, and belongs to a method for dividing computing tasks inside a large model. However, this method is not suitable for small and medium models, and since the jacobian matrix is small, the time spent on reading and writing matrix elements is far greater than the time spent on calculation, the effective calculation time ratio cannot be improved by the matrix operation based on the block form, and the global memory access delay constitutes a major performance bottleneck. The smaller the matrix, the more serious the problem.
Disclosure of Invention
In order to solve the above technical problems, the present invention provides a batch parallel LU decomposition method for a GPU-based medium and small dense matrix for a simulation system, which can effectively reduce the time consumed by the decomposition process and improve the operation efficiency.
The invention provides a batch LU decomposition method of a small and medium-sized dense matrix based on a GPU (graphics processing unit) for a simulation system, which comprises the following steps:
s1, calculating Jacobian matrixes of model units of the simulation system at each time step, taking the matrixes as sub-matrixes, and splicing the sub-matrixes into a large matrix H;
s2, partitioning the interior of each sub-matrix, and distributing GPU thread groups for each sub-matrix according to the size of the partitions;
s3, the GPU thread groups sequentially complete the selection of the main elements of all columns in each diagonal block, the updating of the block on the right side of the row where the diagonal block is located, the exchange of the columns in the diagonal block and the main element columns and the updating of all block data below the column where the diagonal block is located and all block data below the right side of the column where the diagonal block is located according to the diagonal sequence from top left to bottom right, and therefore the LU decomposition of all sub-matrixes in the large matrix H is completed.
Further, in step S1, the simulation system has n model units, and the jacobian matrix is an m-order medium-small dense matrix H i I is in the range of [0, n-1]Wherein, the acquisition method of m is as follows: the dimensions of all model elements to be solved at present are extended up to an integer power m of 2 (i.e. the jacobian matrix is extended to the order m), and for the added rows and columns, all but the value of the diagonal element is 1, are filled with 0.
Further, in step S2, the method includes:
s21, combining the submatrix H i The thread groups are divided into a plurality of sub-blocks so as to ensure that the thread groups can read data in the whole sub-blocks from the global memory at one time in the subsequent processing, and the access times of the global memory are reduced;
s22, according to the size of the block, the sub-matrix is H i Allocating a corresponding number of GPU threads to form a thread Group i
Further, the submatrix H i The method for dividing the block into a plurality of sub-blocks comprises the following steps:
if m is larger than or equal to 16, setting the block size to be 8 or 16;
if m is smaller than 16, the entire sub-matrix is treated as a block, when the block size is m.
Further, in step S3, the specific steps include:
s31, each thread Group i Stator-selecting matrix H i Diagonal block B with inner index j j,j Determining B j,j The principal element columns corresponding to all the columns in the array, and the pair is located at B j,j Updating all the blocks on the right row; where j is initialized to 0;
s32, all the obtained pivot columns and the diagonal block B j,j Internally corresponding column interchange positions;
s33, aligning at diagonal block B j,j Updating all the blocks below the column;
s34, aligning at diagonal block B j,j Updating all the blocks at the lower right;
s35, if diagonal block B j,j Is a submatrix H i The lower right corner in the middle is blocked, the batch parallel LU decomposition is finished, and a large matrix H is output, wherein each sub-matrix H i Are all composed of L i And U i The unit lower triangular matrix and the unit upper triangular matrix are used for quickly solving a subsequent linear equation set; otherwise, let j equal to j +1, go to step S31, and continue processing the next block.
Further, step S31 includes:
s311: initializing a one-dimensional matrix To of dimension m i Let To i [l]=m,0≤l<m, wherein,
array index l corresponds to sub-matrix H i Column number of (2), To i [l]The sequence number of the final main element column is used for recording the column with the sequence number l;
s312: thread group from current block B j,j Starting, reading all elements of the p-th line in the block in parallel, and storing the elements into a cache in the thread; where p is according to the sub-matrix H i Calculating the line index of (1), and initializing to be j multiplied by blocksize; in this step, due to block B j,j The line elements are closely arranged in the global memory, and the GPU can read all data of the thread group by only transmitting an instruction once, so that the effect of memory merging access is achieved; wherein j × blocksize is less than or equal to p<(j+1)×blocksize);
S313: thread group continues reading B in parallel j,j All data of p rows in the next block adjacent to the right side are compared with the data in the cache, and the element with the larger absolute value and the serial number of the column where the element is located are stored in the internal cache; the process is iterated continuously until the last block data of the submatrix is read;
s314: comparing all elements cached in the thread group, taking the column q corresponding To the element with the maximum absolute value as the principal element column corresponding To the column p, and updating the element To i [p]=q;
S315: submatrix H i From the column index j × block size, traverse all subsequent elements in the row
Figure BDA0003642152360000021
If To i [l]>q, then perform the following operations:
Figure BDA0003642152360000022
wherein,
Figure BDA0003642152360000023
is an element
Figure BDA0003642152360000024
A corresponding true principal element;
s316: for each element of the P1 region
Figure BDA0003642152360000025
If To i [y]>q, then perform the following argument operation:
Figure BDA0003642152360000026
wherein the P1 region is located in the sub-matrix H i (p +1), (j +1) xblocksize-1]And [ j × blocksize, m-1 ] of rows]A matrix between the columns; p +1 is less than or equal to x<(j+1)×blocksize,j×blocksize≤y<m;
S317: if p is block B j,j Last line in, and step S31 ends, at which time B j,j Including the LU decomposition result L in the current block j,j And U j,j (ii) a Otherwise, let p be p +1, go to step S312, and continue processing the next line.
Further, in step S32, for B j,j Any column p in the thread group, the p-th thread execution column col in the thread group p And col To i [p] Element exchange between them.
Further, in step S33, let the P2 region be located in the submatrix H i [ (j +1) x blocksize, m-1)]And [ j × Block size, (j +1) × Block size-1]Matrix between columns for each element therein
Figure BDA0003642152360000031
The following operations are performed:
Figure BDA0003642152360000032
wherein, x is less than or equal to (j +1) x blocksize and less than or equal to (j +1) x blocksize.
Further, in step S34, let the P3 region be located in the submatrix H i [ (j +1) x blocksize, m-1)]And row [ (j +1) xblocksize, m-1]A matrix between the columns; the P1 and P3 areas are further divided into a plurality of sub-areas according to blocksize, and the thread groups are Group i For any sub-region P1 therein in turn t 、P3 t ,0≤t<k-j, the following treatment is carried out:
thread Group i Each GPU thread T in (1) y ,0≤y<blocksize, performing the following operations:
read-in sub-region P1 t The y column vector A with the middle sequence number is cached in the thread;
calculating sub-region P3 t Sequence number y each element in the vector D:
D[x]=D[x]-C x A (4)
wherein (j +1) xblocksize is less than or equal to x < m.
Further, in step S35, it is judged that the block B is currently processed j,j Whether submatrix H i The lower right corner in the middle is partitioned, if so, the batch parallel LU decomposition is finished; otherwise, let j equal to j +1, go to step S31, and continue processing the next block.
Compared with the prior art, the batch parallel LU decomposition method of the medium-sized dense matrix based on the GPU has the following beneficial effects:
(1) the invention provides the capability of high-performance batch parallel LU decomposition of medium and small dense matrixes. The method increases the performance with an approximately linear trend along with the increase of the batch number, and the highest peak value is close to 450 Gflops/s. Compared with the CUBLAS library function provided by the Invita company, the maximum acceleration ratio exceeds more than 10;
(2) the method has important significance in the aspect of improving the simulation efficiency of the complex system. When the model containing a large batch of small ordinary differential equations is implicitly solved, the solving speed can be increased, and the simulation time can be effectively shortened.
Drawings
In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flow chart of a batch parallel LU decomposition method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a matrix H and its internal sub-matrices according to an embodiment of the present invention;
FIG. 3 is a block diagram of an internal sub-matrix according to an embodiment of the present invention;
FIG. 4 is a schematic diagram illustrating the selection of pivot elements in steps S312, S313 and S314 according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of the P1, P2 and P3 regions processed during the calculation of the steps S33, S34 and S35 according to an embodiment of the present invention;
FIG. 6 is a diagram illustrating the exchange principal component listing in step S32 according to an embodiment of the present invention;
FIG. 7 is a diagram illustrating the processing of the elements inside the P2 region in step S34 according to an embodiment of the present invention;
FIG. 8 is a diagram illustrating the processing of the elements inside the P3 region in step S35 according to an embodiment of the present invention;
fig. 9 is a schematic diagram of experimental test effect.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
In order to improve the batch parallel LU decomposition efficiency of the medium and small dense matrixes, the invention provides a batch parallel LU decomposition method of the medium and small dense matrixes based on a GPU for a simulation system. The basic idea of the invention is to utilize memory merge access and partial data of a local variable cache submatrix, utilize out-of-order execution technology to update partial key data, delay vector-based external multiplication operation, and convert the vector-based external multiplication operation into matrix-based dense multiplication operation, thereby effectively reducing the access times of the global memory, improving the effective calculation time ratio, and greatly improving the decomposition efficiency.
The batch LU decomposition method of the small and medium dense matrix based on the GPU for the simulation system, as shown in FIG. 1, comprises the following steps:
s1, calculating Jacobian matrixes of model units of the simulation system at each time step, taking the matrixes as sub-matrixes, and splicing the sub-matrixes into a large matrix H;
s2, partitioning the interior of each sub-matrix, and distributing a corresponding number of GPU threads to each sub-matrix according to the size of the partitions to form thread groups;
s3, the GPU thread groups finish the selection of the main elements of all columns in each diagonal block, the updating of the block on the right side of the row where the diagonal block is located, the exchange of the columns in the diagonal block and the main element columns, and the updating of all block data below the column where the diagonal block is located and all block data below the right side of the diagonal block in turn from top left to bottom right according to the diagonal sequence.
In step S1, the model can be directly described in a language, such as an expression in c, and the jacobian matrix of the model unit at each time step can be automatically calculated by symbolic differentiation. Aiming at the simulation of a complex system, in the solving process, a Newton method is adopted for n model units forming the system, a Jacobian matrix of each unit is calculated when each time step arrives, the sub-matrices are spliced into a two-dimensional large matrix H, and the large matrix H is subjected to batch parallel LU decomposition in the following steps.
The Jacobian matrix of each unit is an m-order medium-small dense matrix, and the m-order medium-small dense matrix of the n model units is used as a sub-matrix H i Splicing into a large matrix H as input; i ranges from [0, n-1 ]]。
Specifically, the m acquisition method comprises the following steps: the dimensions of all sub-simulation models to be solved at present are expanded up to an integer power m of 2 (i.e. the jacobian matrix is expanded to the order of m). For increasing rows and columns, all but the diagonal elements have a value of 1, and all are filled with 0's. All the expanded sub-matrices are arranged in a one-dimensional manner, and finally, a two-dimensional large matrix H with dimensions of m × N is formed, where N is m × N, as shown in fig. 2.
In step S2, H is assigned to each submatrix i Internally blocking and dividing each sub-matrix H according to the size of the block i Allocating a corresponding number of GPU threads to form a thread Group i
When the rank m of each sub-matrix is larger than or equal to 16, partitioning is performed inside each sub-matrix, the size of the partitioned block size can be selected in two ways, 8 or 16 respectively, and the partitioned block size can be determined by transmitting parameters in actual solution. If the dimension of the sub-matrix is less than 16, the whole sub-matrix is treated as a block, when the blocksize is m. After division, each sub-matrix is internally wrappedContains k × k sub-blocks, where k is m ÷ blocksize, and the blocking result is shown in fig. 3. By blocking the interior of the sub-matrixes, the thread Group in the subsequent processing can be ensured i The data in the whole block can be read in from the global memory at one time, and the access times of the global memory are effectively reduced.
In step S3, LU decomposition is performed on all the sub-matrices in the large matrix H, and the specific steps include:
s31, each thread Group i Stator-selecting matrix H i Diagonal block B with internal index j j,j Determining B j,j The principal element columns corresponding to all the columns in the array, and the pair is located at B j,j Updating all the blocks on the right row; where j is initialized to 0;
s32, all the obtained principal element columns and the diagonal block B j,j Internally corresponding column interchange positions;
s33, aligning at diagonal block B j,j Updating all the blocks below the column;
s34, aligning at diagonal block B j,j Updating all the blocks at the lower right;
s35, if diagonal block B j,j Is a submatrix H i The lower right corner in the middle is blocked, the batch parallel LU decomposition is finished, and a large matrix H is output, wherein each sub-matrix H is arranged in the large matrix H i Are all composed of L i And U i The unit lower triangular matrix and the unit upper triangular matrix are used for quickly solving a subsequent linear equation set; otherwise, let j equal to j +1, go to step S31, and continue processing the next block.
In step S31, for the diagonal block B to be currently processed j,j ,0≤j<k, run through B j,j All diagonal elements b in (1) p,p (j×blocksize≤p<(j +1) xblocksize), determining the serial number of the corresponding main element, specifically:
s311: initializing a one-dimensional matrix To of dimension m i Let To i [l]=m,0≤l<m, wherein,
array index l corresponds to sub-matrix H i Column number of (2), To i [l]For recording the column with sequence number l, its final principal elementA column number;
s312: thread group from current block B j,j Starting, reading all elements of the p-th line in the block in parallel, and storing the elements into a cache in the thread; where p is according to the sub-matrix H i Calculating the line index of (1), and initializing to be j multiplied by blocksize; in this step, due to block B j,j The line elements are closely arranged in the global memory, and the GPU can read all data of the thread group by only transmitting an instruction once, so that the effect of memory merging access is achieved;
s313: thread group continues reading B in parallel j,j And all the data of p rows in the next block adjacent to the right is compared with the data in the cache, and the element with the larger absolute value and the serial number of the column where the element is located are stored in the internal cache. The process is iterated continuously until the last block data of the submatrix is read; the step also realizes the effect of memory merging access;
s314: comparing all elements cached in the thread group, taking the column q corresponding To the element with the maximum absolute value as the principal element column corresponding To the column p, and updating the element To i [p]=q;
The process of steps S312, S313 and S314 is shown in FIG. 4;
s315: submatrix H i From the column index j × block size, traverse all subsequent elements in the row
Figure BDA0003642152360000051
If To i [l]>q, then perform the following operations:
Figure BDA0003642152360000052
wherein
Figure BDA0003642152360000053
Is an element
Figure BDA0003642152360000054
The corresponding real pivot, as shown in fig. 4.
The step updates the p rows of elements for the subsequent elimination process;
s316: the elimination is carried out on the P1 area in FIG. 5;
specifically, the P1 region is located in the sub-matrix H i (p +1), (j +1) xblocksize-1]And [ j × blocksize, m-1 ] of rows]Matrix between columns for each element therein
Figure BDA0003642152360000061
If To i [y]>q, then perform the following argument operation:
Figure BDA0003642152360000062
wherein, x < (j +1) × blocksize is more than or equal to p +1, and y is more than or equal to j × blocksize and less than or equal to y < m
S317: if p is block B j,j Last line in, and step S31 ends, at which time B j,j Including the LU decomposition result L in the current block j,j And U j,j (ii) a Otherwise, let p be p +1, go to step S312, and continue processing the next line.
In step S32, the thread Group i At the same time for the current diagonal block B j,j All columns within and their corresponding pivot columns are swapped. As shown in fig. 6 for B j,j Any column p in the thread group, the p-th thread execution column col in the thread group p And column col To i [p] Element exchange between;
due to the diagonal block B j,j All columns in the column are exchanged simultaneously, and the column interval [ j × blocksize, (j +1) × blocksize-1]All elements in the GPU can realize memory merging reading and writing, so that the emission times of GPU memory instruction access are effectively reduced;
in step S33, a tail update is performed on the P2 area in fig. 5, specifically:
the P2 region is located in the sub-matrix H i [ (j +1) x blocksize, m-1)]And [ j × Block size, (j +1) × Block size-1]Matrix between columns for each element therein
Figure BDA0003642152360000063
As shown in fig. 7, the following operations are performed:
Figure BDA0003642152360000064
wherein (j +1) xblocksize is less than or equal to x < m, and j xblocksize is less than or equal to y < (j +1) xblocksize
As can be seen from equation (3), there is a causal relationship between columns in P2, and the update of the latter column depends on the results of all the columns in the front, so that it cannot be completely parallelized, and needs to be updated separately, and thus, it is decoupled from the P3 region in the back, so that the update of the P3 region can be completely parallelized;
in step S34, a tail update is performed on the P3 area in fig. 5, specifically:
the P3 region is located in the sub-matrix H i [ (j +1) x blocksize, m-1)]And row [ (j +1) xblocksize, m-1]A matrix between the columns; the P1 and P3 regions are further divided into a plurality of sub-regions according to blocksize, as shown in fig. 8; thread Group i These sub-regions are processed sequentially, for any of them P1 t ,P3 t ,0≤t<k-j, the following treatment is carried out:
thread Group i Each GPU thread T in (1) y ,0≤y<blocksize, performing the following operations:
read-in sub-region P1 t The y column vector A with the middle sequence number is cached in the thread;
compute sub-region P3 t Sequence number y each element in the vector D:
D[x]=D[x]-C x A (4)
wherein (j +1) xblocksize is less than or equal to x < m;
as can be seen from the formula (4), the vector a is used for calculating each element in the column D, and since the data of the vector a is already read into the local cache of the thread, multiple reading is effectively avoided, and the calculation time ratio is improved; meanwhile, since all the data in the previous P1 and P2 areas are processed, the matrix multiplication and addition operations can be completed in a block form completely and parallelly during the execution of the current step, and the calculation efficiency can be greatly improved.
In step S35, it is judged whether or not the currently processed block B is present j,j Whether submatrix H i Middle bottom right block, if it is, the batch parallel LU decomposition is finished, the sub-matrix H i In which the decomposition result L is included i And U i (ii) a Otherwise, let j equal to j +1, go to step S3, and continue processing the next block.
When the method is finished, outputting a large matrix H, wherein each sub-matrix H i Are all composed of L i And U i The unit lower triangular matrix and the unit upper triangular matrix are used for subsequent batch linear equation set fast solving, and then the state variable values of all simulation units are calculated through a Newton iteration method, so that the simulation process is guaranteed to be fast promoted.
According to the method of the present invention, experimental tests were performed on batch parallel LU decomposition of dense matrices, as shown in fig. 9: the abscissa is the dimension of the large matrix, the ordinate is the data processing capacity, and the lower right corner is marked with different sub-matrix dimensions, represented by curves of corresponding colors in the graph. It can be seen that, no matter the partitioning is performed according to 8 × 8 or 16 × 16, the LU decomposition processing capacity basically increases linearly with the number of sub-matrices, which means that in the execution process of the method of the present invention, the hardware of the GPU is fully utilized; it can also be seen from the figure that when the dimension of the large matrix reaches 8192 multiplied by 8192, the peak floating point operation processing capacity reaches about 450 Gflops/s. Table 1 shows the comparison results of partial experiments of the execution efficiency of the CUBALS library function provided by the England Dart, and when the dimension of the submatrix is 64, the advantages of the method are gradually increased along with the increase of the number of the submatrixes, and the maximum acceleration ratio reaches about 13 times. Similar acceleration ratio results are also possible in other submatrix dimensions.
TABLE 1 comparison of processing speed partial results for the method of the invention and CUBLAS
Figure BDA0003642152360000071
The invention fully utilizes the hardware characteristics of memory combination access and local variable cache while realizing high parallelization, effectively hides the delay of memory access and improves the calculation time ratio. Therefore, the method has high calculation efficiency, can effectively reduce the LU decomposition time, and has the implicit real-time parallel solving capability of supporting a million-level small simulation model.
Those of ordinary skill in the art will understand that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions and scope of the present invention as defined in the appended claims.

Claims (10)

1. A batch LU decomposition method of a small and medium-sized dense matrix based on a GPU for a simulation system is characterized by comprising the following steps:
s1, calculating Jacobian matrixes of model units of the simulation system at each time step, taking the matrixes as sub-matrixes, and splicing the sub-matrixes into a large matrix H;
s2, partitioning the interior of each sub-matrix, and distributing GPU thread groups for each sub-matrix according to the size of the partitions;
s3, the GPU thread groups finish the principal element selection of all columns in each diagonal block, the updating of the right block of the row where the diagonal block is located, the exchange of the internal columns and the principal element columns in the diagonal block and the updating of all block data below the column where the diagonal block is located and below the right in turn from top left to bottom right according to the diagonal sequence, thereby finishing the LU decomposition of all sub-matrixes in the large matrix H.
2. The batch LU decomposition method according to claim 1, wherein in step S1, the simulation system has n model units, and the Jacobian matrix is m-th order and is smallType dense matrix H i I is in the range of [0, n-1]Wherein, the acquisition method of m is as follows: the dimensions of all model elements to be solved at present are extended up to an integer power m of 2 (i.e. the jacobian matrix is extended to the order m), and for the added rows and columns, all but the value of the diagonal element is 1, are filled with 0.
3. The batch LU decomposition method according to claim 1, comprising, in step S2:
s21, combining the submatrix H i The thread groups are divided into a plurality of sub-blocks so as to ensure that the thread groups can read data in the whole sub-blocks from the global memory at one time in the subsequent processing, and the access times of the global memory are reduced;
s22, according to the size of the block, the sub-matrix is H i Allocating a corresponding number of GPU threads to form a thread Group i
4. The batch LU decomposition method of claim 3, wherein the submatrix H i The method for dividing the block into a plurality of sub-blocks comprises the following steps:
if m is larger than or equal to 16, setting the block size to be 8 or 16;
if m is smaller than 16, the entire sub-matrix is treated as a block, at which point the block size is m.
5. The batch LU decomposition method according to claim 2, wherein in the step S3, the specific steps include:
s31, each thread Group i Stator-selecting matrix H i Diagonal block B with internal index j j,j Determining B j,j The principal element columns corresponding to all the columns in the array, and the pair is located at B j,j Updating all the blocks on the right row; where j is initialized to 0;
s32, all the obtained principal element columns and the diagonal block B j,j Internally corresponding column interchange positions;
s33, aligning at diagonal block B j,j Updating all the blocks below the column;
s34, aligning at diagonal block B j,j Updating all the blocks at the lower right;
s35, if diagonal block B j,j Is a sub-matrix H i The lower right corner in the middle is blocked, the batch parallel LU decomposition is finished, and a large matrix H is output, wherein each sub-matrix H i Are all composed of L i And U i The unit lower triangular matrix and the unit upper triangular matrix are used for quickly solving a subsequent linear equation set; otherwise, let j equal to j +1, go to step S31, and continue processing the next block.
6. The batch LU decomposition method of claim 5, wherein the step S31 comprises:
s311: initializing a one-dimensional matrix To of dimension m i Let To i [l]=m,0≤l<m, wherein,
array index l corresponds to sub-matrix H i Column number of (2), To i [l]The sequence number of the final main element column is used for recording the column with the sequence number l;
s312: thread group from current block B j,j Starting, reading all elements of the p-th line in the block in parallel, and storing the elements into a cache in the thread; where p is according to the sub-matrix H i Calculating the line index of (1), and initializing to be j multiplied by blocksize; in this step, due to block B j,j The line elements are closely arranged in the global memory, and the GPU can read all data of the thread group by only transmitting an instruction once, so that the effect of memory merging access is achieved; wherein j × blocksize is less than or equal to p<(j+1)×blocksize);
S313: thread group continues to read B in parallel j,j All data of p rows in the next block adjacent to the right side are compared with the data in the cache, and the element with the larger absolute value and the serial number of the column where the element is located are stored in the internal cache; the process is iterated continuously until the last block data of the submatrix is read;
s314: comparing all elements cached in the thread group, taking the column q corresponding To the element with the maximum absolute value as the principal element column corresponding To the column p, and updating the element To i [p]=q;
S315: pairMatrix H i Starting from the column index j × block size, the current row p in (1) traverses all the subsequent elements in the row
Figure FDA0003642152350000021
If To i [l]>q, then perform the following operations:
Figure FDA0003642152350000022
wherein,
Figure FDA0003642152350000023
is an element
Figure FDA0003642152350000024
A corresponding true principal element;
s316: for each element of the P1 region
Figure FDA0003642152350000025
If To i [y]>q, then perform the following argument operation:
Figure FDA0003642152350000026
wherein the P1 region is located in the sub-matrix H i (p +1), (j +1) xblocksize-1]And [ j × blocksize, m-1 ] of rows]A matrix between the columns; p +1 is less than or equal to x<(j+1)×blocksize,j×blocksize≤y<m;
S317: if p is block B j,j Last line in, and step S31 ends, at which time B j,j Including the LU decomposition result L in the current block j,j And U j,j (ii) a Otherwise, let p be p +1, go to step S312, and continue processing the next line.
7. The batch LU decomposition method according to claim 5, wherein in step S32, for B j,j Any column p in the thread group, the p-th thread execution column col in the thread group p And column (c)
Figure FDA0003642152350000027
Element exchange between them.
8. The batch LU decomposition method of claim 5, wherein in step S33, the P2 region is set to be located in the submatrix H i [ (j +1) x blocksize, m-1)]Line and [ j × blocksize, (j +1) × blocksize-1]Matrix between columns for each element therein
Figure FDA0003642152350000028
The following operations are performed:
Figure FDA0003642152350000029
wherein, x is less than or equal to (j +1) x blocksize and less than or equal to (j +1) x blocksize.
9. The batch LU decomposition method of claim 5, wherein in step S34, the P3 region is set to be located in the submatrix H i [ (j +1) x blocksize, m-1)]And row [ (j +1) xblocksize, m-1]A matrix between the columns; the P1 and P3 areas are further divided into a plurality of sub-areas according to blocksize, and the thread groups are Group i In turn for any sub-region P1 therein t 、P3 t ,0≤t<k-j, the following treatment is carried out:
thread Group i Each GPU thread T in (1) y ,0≤y<blocksize, performing the following operations:
read-in sub-region P1 t The y column vector A with the middle sequence number is cached in the thread;
calculating sub-region P3 t Sequence number y each element in the vector D:
D[x]=D[x]-C x A (4)
wherein (j +1) xblocksize is less than or equal to x < m.
10. The batch LU decomposition method according to claim 5, wherein in step S35, the currently processed partition B is judged j,j Whether submatrix H i The lower right corner in the middle is partitioned, if so, the batch parallel LU decomposition is finished; otherwise, let j equal to j +1, go to step S31, and continue processing the next block.
CN202210522339.2A 2022-05-13 2022-05-13 Batch parallel LU decomposition method of small and medium-sized dense matrix based on GPU for simulation system Pending CN114911619A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210522339.2A CN114911619A (en) 2022-05-13 2022-05-13 Batch parallel LU decomposition method of small and medium-sized dense matrix based on GPU for simulation system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210522339.2A CN114911619A (en) 2022-05-13 2022-05-13 Batch parallel LU decomposition method of small and medium-sized dense matrix based on GPU for simulation system

Publications (1)

Publication Number Publication Date
CN114911619A true CN114911619A (en) 2022-08-16

Family

ID=82766173

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210522339.2A Pending CN114911619A (en) 2022-05-13 2022-05-13 Batch parallel LU decomposition method of small and medium-sized dense matrix based on GPU for simulation system

Country Status (1)

Country Link
CN (1) CN114911619A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115396065A (en) * 2022-10-26 2022-11-25 南京邮电大学 Low-delay decoding method for sparse random linear network coding
CN117725348A (en) * 2024-02-07 2024-03-19 蓝象智联(杭州)科技有限公司 Thread management method and system in GPU computing large-scale array summation process

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115396065A (en) * 2022-10-26 2022-11-25 南京邮电大学 Low-delay decoding method for sparse random linear network coding
CN117725348A (en) * 2024-02-07 2024-03-19 蓝象智联(杭州)科技有限公司 Thread management method and system in GPU computing large-scale array summation process
CN117725348B (en) * 2024-02-07 2024-05-10 蓝象智联(杭州)科技有限公司 Thread management method and system in GPU computing large-scale array summation process

Similar Documents

Publication Publication Date Title
EP3974959B1 (en) Hardware accelerated machine learning
US20210072986A1 (en) Methods for performing processing-in-memory operations on serially allocated data, and related memory devices and systems
CN110826719B (en) Quantum program processing method and device, storage medium and electronic device
CN114911619A (en) Batch parallel LU decomposition method of small and medium-sized dense matrix based on GPU for simulation system
CN108170639B (en) Tensor CP decomposition implementation method based on distributed environment
CN110516810B (en) Quantum program processing method and device, storage medium and electronic device
CN113835758B (en) Winograd convolution implementation method based on vector instruction accelerated computation
CN114090954A (en) Integer matrix multiplication kernel optimization method based on FT-2000+
CN113987414B (en) Small and irregular matrix multiplication optimization method based on ARMv8 multi-core processor
CN110414672B (en) Convolution operation method, device and system
Nakano Optimal parallel algorithms for computing the sum, the prefix-sums, and the summed area table on the memory machine models
Li et al. Enabling high performance deep learning networks on embedded systems
Kiran et al. Development of GPU‐based matrix‐free strategies for large‐scale elastoplasticity analysis using conjugate gradient solver
US12072799B2 (en) Programmable multi-level data access address generator
CN115485656A (en) In-memory processing method for convolution operation
Hasan et al. GPU accelerated tensor computation of Hadamard product for machine learning applications
US20230244600A1 (en) Process for Generation of Addresses in Multi-Level Data Access
CN115481364A (en) Parallel computing method for large-scale elliptic curve multi-scalar multiplication based on GPU (graphics processing Unit) acceleration
CN113076520A (en) Heterogeneous large matrix solving method based on GPU
Anzt et al. Variable-size batched gauss-huard for block-jacobi preconditioning
Chu et al. Efficient Concurrent L1-Minimization Solvers on GPUs.
Tokura et al. Gpu-accelerated bulk computation of the eigenvalue problem for many small real non-symmetric matrices
Algis et al. Efficient GPU Implementation of Particle Interactions with Cutoff Radius and Few Particles per Cell
CN110765413A (en) Matrix summation structure and neural network computing platform
Hanif et al. Memgans: Memory management for energy-efficient acceleration of complex computations in hardware architectures for generative adversarial networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination