CN114911619A

CN114911619A - Batch parallel LU decomposition method of small and medium-sized dense matrix based on GPU for simulation system

Info

Publication number: CN114911619A
Application number: CN202210522339.2A
Authority: CN
Inventors: 张雪松; 闫昭; 赫枫龄; 雷新丽; 马琳
Original assignee: Jilin University
Current assignee: Jilin University
Priority date: 2022-05-13
Filing date: 2022-05-13
Publication date: 2022-08-16

Abstract

The invention aims at a small and medium dense matrix of a simulation system to realize a high-performance batch parallel LU decomposition method based on a GPU, which comprises the following steps: s1, calculating Jacobian matrixes of model units of the simulation system at each time step, taking the matrixes as sub-matrixes, and splicing the sub-matrixes into a large matrix H; s2, partitioning the interior of each sub-matrix, and distributing GPU thread groups for each sub-matrix according to the size of the partitions; s3, the GPU thread groups sequentially complete the selection of the main elements of all columns in each diagonal block, the updating of the block on the right side of the row where the diagonal block is located, the exchange of the columns in the diagonal block and the main element columns and the updating of all block data below the column where the diagonal block is located and all block data below the right side of the column where the diagonal block is located according to the diagonal sequence from top left to bottom right, and therefore the LU decomposition of all sub-matrixes in the large matrix H is completed. The invention can increase the performance in an approximately linear trend along with the increase of batch quantity, and can accelerate the solving speed and effectively shorten the simulation time when implicitly solving the model containing a large batch of small ordinary differential equations.

Description

Batch parallel LU decomposition method of small and medium-sized dense matrix based on GPU for simulation system

Technical Field

The invention belongs to the field of modeling simulation and parallel computing cross technology application, and particularly relates to a batch parallel LU Decomposition (Lower-Upper Decomposition) method for a small and medium-sized dense matrix based on a GPU (Graphics Processing Unit) in complex system simulation.

Background

LU decomposition is the most important calculation step in many engineering and scientific calculation problems, and has wide application in the fields of system simulation, machine learning, image processing, information encryption, high-speed information transmission and the like. With the continuous development of computer science, the data volume needing to be processed grows exponentially, and how to efficiently and quickly process experimental data by using a computer becomes a difficult problem. In most critical applications, the parallel solution problem of a large number of small-scale matrixes needs to be solved, namely, a large-batch matrix LU decomposition is completed under a set of operation logic, and a few large-scale linear systems are solved in series.

The complex system usually comprises a large number of medium and small non-linear simulation model units and needs linear form approximate solution, wherein the most time-consuming step is calculation of a linear equation system and LU decomposition of a model unit Jacobian matrix is carried out. The current LU decomposition method based on the GPU is mainly oriented to large matrixes, adopts a blocking mode, disperses processing tasks into a plurality of threads, and belongs to a method for dividing computing tasks inside a large model. However, this method is not suitable for small and medium models, and since the jacobian matrix is small, the time spent on reading and writing matrix elements is far greater than the time spent on calculation, the effective calculation time ratio cannot be improved by the matrix operation based on the block form, and the global memory access delay constitutes a major performance bottleneck. The smaller the matrix, the more serious the problem.

Disclosure of Invention

In order to solve the above technical problems, the present invention provides a batch parallel LU decomposition method for a GPU-based medium and small dense matrix for a simulation system, which can effectively reduce the time consumed by the decomposition process and improve the operation efficiency.

The invention provides a batch LU decomposition method of a small and medium-sized dense matrix based on a GPU (graphics processing unit) for a simulation system, which comprises the following steps:

s1, calculating Jacobian matrixes of model units of the simulation system at each time step, taking the matrixes as sub-matrixes, and splicing the sub-matrixes into a large matrix H;

s2, partitioning the interior of each sub-matrix, and distributing GPU thread groups for each sub-matrix according to the size of the partitions;

s3, the GPU thread groups sequentially complete the selection of the main elements of all columns in each diagonal block, the updating of the block on the right side of the row where the diagonal block is located, the exchange of the columns in the diagonal block and the main element columns and the updating of all block data below the column where the diagonal block is located and all block data below the right side of the column where the diagonal block is located according to the diagonal sequence from top left to bottom right, and therefore the LU decomposition of all sub-matrixes in the large matrix H is completed.

Further, in step S1, the simulation system has n model units, and the jacobian matrix is an m-order medium-small dense matrix H ⁱ I is in the range of [0, n-1]Wherein, the acquisition method of m is as follows: the dimensions of all model elements to be solved at present are extended up to an integer power m of 2 (i.e. the jacobian matrix is extended to the order m), and for the added rows and columns, all but the value of the diagonal element is 1, are filled with 0.

Further, in step S2, the method includes:

s21, combining the submatrix H ⁱ The thread groups are divided into a plurality of sub-blocks so as to ensure that the thread groups can read data in the whole sub-blocks from the global memory at one time in the subsequent processing, and the access times of the global memory are reduced;

s22, according to the size of the block, the sub-matrix is H ⁱ Allocating a corresponding number of GPU threads to form a thread Group ⁱ 。

Further, the submatrix H ⁱ The method for dividing the block into a plurality of sub-blocks comprises the following steps:

if m is larger than or equal to 16, setting the block size to be 8 or 16;

if m is smaller than 16, the entire sub-matrix is treated as a block, when the block size is m.

Further, in step S3, the specific steps include:

s31, each thread Group ⁱ Stator-selecting matrix H ⁱ Diagonal block B with inner index j _j,j Determining B _j,j The principal element columns corresponding to all the columns in the array, and the pair is located at B _j,j Updating all the blocks on the right row; where j is initialized to 0;

s32, all the obtained pivot columns and the diagonal block B _j,j Internally corresponding column interchange positions;

s33, aligning at diagonal block B _j,j Updating all the blocks below the column;

s34, aligning at diagonal block B _j,j Updating all the blocks at the lower right;

s35, if diagonal block B _j,j Is a submatrix H ⁱ The lower right corner in the middle is blocked, the batch parallel LU decomposition is finished, and a large matrix H is output, wherein each sub-matrix H ⁱ Are all composed of L ⁱ And U ⁱ The unit lower triangular matrix and the unit upper triangular matrix are used for quickly solving a subsequent linear equation set; otherwise, let j equal to j +1, go to step S31, and continue processing the next block.

Further, step S31 includes:

s311: initializing a one-dimensional matrix To of dimension m ⁱ Let To ⁱ [l]＝m,0≤l<m, wherein,

array index l corresponds to sub-matrix H ⁱ Column number of (2), To ⁱ [l]The sequence number of the final main element column is used for recording the column with the sequence number l;

s312: thread group from current block B _j,j Starting, reading all elements of the p-th line in the block in parallel, and storing the elements into a cache in the thread; where p is according to the sub-matrix H ⁱ Calculating the line index of (1), and initializing to be j multiplied by blocksize; in this step, due to block B _j,j The line elements are closely arranged in the global memory, and the GPU can read all data of the thread group by only transmitting an instruction once, so that the effect of memory merging access is achieved; wherein j × blocksize is less than or equal to p<(j+1)×blocksize)；

S313: thread group continues reading B in parallel _j,j All data of p rows in the next block adjacent to the right side are compared with the data in the cache, and the element with the larger absolute value and the serial number of the column where the element is located are stored in the internal cache; the process is iterated continuously until the last block data of the submatrix is read;

s314: comparing all elements cached in the thread group, taking the column q corresponding To the element with the maximum absolute value as the principal element column corresponding To the column p, and updating the element To ⁱ [p]＝q；

S315: submatrix H ⁱ From the column index j × block size, traverse all subsequent elements in the row

If To ⁱ [l]>q, then perform the following operations:

wherein,

is an element

A corresponding true principal element;

s316: for each element of the P1 region

If To ⁱ [y]>q, then perform the following argument operation:

wherein the P1 region is located in the sub-matrix H ⁱ (p +1), (j +1) xblocksize-1]And [ j × blocksize, m-1 ] of rows]A matrix between the columns; p +1 is less than or equal to x<(j+1)×blocksize,j×blocksize≤y<m；

S317: if p is block B _j,j Last line in, and step S31 ends, at which time B _j,j Including the LU decomposition result L in the current block _j,j And U _j,j (ii) a Otherwise, let p be p +1, go to step S312, and continue processing the next line.

Further, in step S32, for B _j,j Any column p in the thread group, the p-th thread execution column col in the thread group _p And col _To i _[p] Element exchange between them.

Further, in step S33, let the P2 region be located in the submatrix H ⁱ [ (j +1) x blocksize, m-1)]And [ j × Block size, (j +1) × Block size-1]Matrix between columns for each element therein

The following operations are performed:

wherein, x is less than or equal to (j +1) x blocksize and less than or equal to (j +1) x blocksize.

Further, in step S34, let the P3 region be located in the submatrix H ⁱ [ (j +1) x blocksize, m-1)]And row [ (j +1) xblocksize, m-1]A matrix between the columns; the P1 and P3 areas are further divided into a plurality of sub-areas according to blocksize, and the thread groups are Group ⁱ For any sub-region P1 therein in turn _t 、P3 _t ,0≤t<k-j, the following treatment is carried out:

thread Group ⁱ Each GPU thread T in (1) _y ,0≤y<blocksize, performing the following operations:

read-in sub-region P1 _t The y column vector A with the middle sequence number is cached in the thread;

calculating sub-region P3 _t Sequence number y each element in the vector D:

D[x]＝D[x]-C _x A (4)

wherein (j +1) xblocksize is less than or equal to x < m.

Further, in step S35, it is judged that the block B is currently processed _j,j Whether submatrix H ⁱ The lower right corner in the middle is partitioned, if so, the batch parallel LU decomposition is finished; otherwise, let j equal to j +1, go to step S31, and continue processing the next block.

Compared with the prior art, the batch parallel LU decomposition method of the medium-sized dense matrix based on the GPU has the following beneficial effects:

(1) the invention provides the capability of high-performance batch parallel LU decomposition of medium and small dense matrixes. The method increases the performance with an approximately linear trend along with the increase of the batch number, and the highest peak value is close to 450 Gflops/s. Compared with the CUBLAS library function provided by the Invita company, the maximum acceleration ratio exceeds more than 10;

(2) the method has important significance in the aspect of improving the simulation efficiency of the complex system. When the model containing a large batch of small ordinary differential equations is implicitly solved, the solving speed can be increased, and the simulation time can be effectively shortened.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a flow chart of a batch parallel LU decomposition method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a matrix H and its internal sub-matrices according to an embodiment of the present invention;

FIG. 3 is a block diagram of an internal sub-matrix according to an embodiment of the present invention;

FIG. 4 is a schematic diagram illustrating the selection of pivot elements in steps S312, S313 and S314 according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of the P1, P2 and P3 regions processed during the calculation of the steps S33, S34 and S35 according to an embodiment of the present invention;

FIG. 6 is a diagram illustrating the exchange principal component listing in step S32 according to an embodiment of the present invention;

FIG. 7 is a diagram illustrating the processing of the elements inside the P2 region in step S34 according to an embodiment of the present invention;

FIG. 8 is a diagram illustrating the processing of the elements inside the P3 region in step S35 according to an embodiment of the present invention;

fig. 9 is a schematic diagram of experimental test effect.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

In order to improve the batch parallel LU decomposition efficiency of the medium and small dense matrixes, the invention provides a batch parallel LU decomposition method of the medium and small dense matrixes based on a GPU for a simulation system. The basic idea of the invention is to utilize memory merge access and partial data of a local variable cache submatrix, utilize out-of-order execution technology to update partial key data, delay vector-based external multiplication operation, and convert the vector-based external multiplication operation into matrix-based dense multiplication operation, thereby effectively reducing the access times of the global memory, improving the effective calculation time ratio, and greatly improving the decomposition efficiency.

The batch LU decomposition method of the small and medium dense matrix based on the GPU for the simulation system, as shown in FIG. 1, comprises the following steps:

s2, partitioning the interior of each sub-matrix, and distributing a corresponding number of GPU threads to each sub-matrix according to the size of the partitions to form thread groups;

s3, the GPU thread groups finish the selection of the main elements of all columns in each diagonal block, the updating of the block on the right side of the row where the diagonal block is located, the exchange of the columns in the diagonal block and the main element columns, and the updating of all block data below the column where the diagonal block is located and all block data below the right side of the diagonal block in turn from top left to bottom right according to the diagonal sequence.

In step S1, the model can be directly described in a language, such as an expression in c, and the jacobian matrix of the model unit at each time step can be automatically calculated by symbolic differentiation. Aiming at the simulation of a complex system, in the solving process, a Newton method is adopted for n model units forming the system, a Jacobian matrix of each unit is calculated when each time step arrives, the sub-matrices are spliced into a two-dimensional large matrix H, and the large matrix H is subjected to batch parallel LU decomposition in the following steps.

The Jacobian matrix of each unit is an m-order medium-small dense matrix, and the m-order medium-small dense matrix of the n model units is used as a sub-matrix H ⁱ Splicing into a large matrix H as input; i ranges from [0, n-1 ]]。

Specifically, the m acquisition method comprises the following steps: the dimensions of all sub-simulation models to be solved at present are expanded up to an integer power m of 2 (i.e. the jacobian matrix is expanded to the order of m). For increasing rows and columns, all but the diagonal elements have a value of 1, and all are filled with 0's. All the expanded sub-matrices are arranged in a one-dimensional manner, and finally, a two-dimensional large matrix H with dimensions of m × N is formed, where N is m × N, as shown in fig. 2.

In step S2, H is assigned to each submatrix ⁱ Internally blocking and dividing each sub-matrix H according to the size of the block ⁱ Allocating a corresponding number of GPU threads to form a thread Group ⁱ ；

When the rank m of each sub-matrix is larger than or equal to 16, partitioning is performed inside each sub-matrix, the size of the partitioned block size can be selected in two ways, 8 or 16 respectively, and the partitioned block size can be determined by transmitting parameters in actual solution. If the dimension of the sub-matrix is less than 16, the whole sub-matrix is treated as a block, when the blocksize is m. After division, each sub-matrix is internally wrappedContains k × k sub-blocks, where k is m ÷ blocksize, and the blocking result is shown in fig. 3. By blocking the interior of the sub-matrixes, the thread Group in the subsequent processing can be ensured ⁱ The data in the whole block can be read in from the global memory at one time, and the access times of the global memory are effectively reduced.

In step S3, LU decomposition is performed on all the sub-matrices in the large matrix H, and the specific steps include:

s31, each thread Group ⁱ Stator-selecting matrix H ⁱ Diagonal block B with internal index j _j,j Determining B _j,j The principal element columns corresponding to all the columns in the array, and the pair is located at B _j,j Updating all the blocks on the right row; where j is initialized to 0;

s32, all the obtained principal element columns and the diagonal block B _j,j Internally corresponding column interchange positions;

s35, if diagonal block B _j,j Is a submatrix H ⁱ The lower right corner in the middle is blocked, the batch parallel LU decomposition is finished, and a large matrix H is output, wherein each sub-matrix H is arranged in the large matrix H ⁱ Are all composed of L ⁱ And U ⁱ The unit lower triangular matrix and the unit upper triangular matrix are used for quickly solving a subsequent linear equation set; otherwise, let j equal to j +1, go to step S31, and continue processing the next block.

In step S31, for the diagonal block B to be currently processed _j,j ,0≤j<k, run through B _j,j All diagonal elements b in (1) _p,p (j×blocksize≤p<(j +1) xblocksize), determining the serial number of the corresponding main element, specifically:

array index l corresponds to sub-matrix H ⁱ Column number of (2), To ⁱ [l]For recording the column with sequence number l, its final principal elementA column number;

s312: thread group from current block B _j,j Starting, reading all elements of the p-th line in the block in parallel, and storing the elements into a cache in the thread; where p is according to the sub-matrix H ⁱ Calculating the line index of (1), and initializing to be j multiplied by blocksize; in this step, due to block B _j,j The line elements are closely arranged in the global memory, and the GPU can read all data of the thread group by only transmitting an instruction once, so that the effect of memory merging access is achieved;

s313: thread group continues reading B in parallel _j,j And all the data of p rows in the next block adjacent to the right is compared with the data in the cache, and the element with the larger absolute value and the serial number of the column where the element is located are stored in the internal cache. The process is iterated continuously until the last block data of the submatrix is read; the step also realizes the effect of memory merging access;

The process of steps S312, S313 and S314 is shown in FIG. 4;

If To ⁱ [l]>q, then perform the following operations:

wherein

Is an element

The corresponding real pivot, as shown in fig. 4.

The step updates the p rows of elements for the subsequent elimination process;

s316: the elimination is carried out on the P1 area in FIG. 5;

specifically, the P1 region is located in the sub-matrix H ⁱ (p +1), (j +1) xblocksize-1]And [ j × blocksize, m-1 ] of rows]Matrix between columns for each element therein

If To ⁱ [y]>q, then perform the following argument operation:

wherein, x < (j +1) × blocksize is more than or equal to p +1, and y is more than or equal to j × blocksize and less than or equal to y < m

In step S32, the thread Group ⁱ At the same time for the current diagonal block B _j,j All columns within and their corresponding pivot columns are swapped. As shown in fig. 6 for B _j,j Any column p in the thread group, the p-th thread execution column col in the thread group _p And column col _To i _[p] Element exchange between;

due to the diagonal block B _j,j All columns in the column are exchanged simultaneously, and the column interval [ j × blocksize, (j +1) × blocksize-1]All elements in the GPU can realize memory merging reading and writing, so that the emission times of GPU memory instruction access are effectively reduced;

in step S33, a tail update is performed on the P2 area in fig. 5, specifically:

the P2 region is located in the sub-matrix H ⁱ [ (j +1) x blocksize, m-1)]And [ j × Block size, (j +1) × Block size-1]Matrix between columns for each element therein

As shown in fig. 7, the following operations are performed:

wherein (j +1) xblocksize is less than or equal to x < m, and j xblocksize is less than or equal to y < (j +1) xblocksize

As can be seen from equation (3), there is a causal relationship between columns in P2, and the update of the latter column depends on the results of all the columns in the front, so that it cannot be completely parallelized, and needs to be updated separately, and thus, it is decoupled from the P3 region in the back, so that the update of the P3 region can be completely parallelized;

in step S34, a tail update is performed on the P3 area in fig. 5, specifically:

the P3 region is located in the sub-matrix H ⁱ [ (j +1) x blocksize, m-1)]And row [ (j +1) xblocksize, m-1]A matrix between the columns; the P1 and P3 regions are further divided into a plurality of sub-regions according to blocksize, as shown in fig. 8; thread Group ⁱ These sub-regions are processed sequentially, for any of them P1 _t ,P3 _t ,0≤t<k-j, the following treatment is carried out:

compute sub-region P3 _t Sequence number y each element in the vector D:

D[x]＝D[x]-C _x A (4)

wherein (j +1) xblocksize is less than or equal to x < m;

as can be seen from the formula (4), the vector a is used for calculating each element in the column D, and since the data of the vector a is already read into the local cache of the thread, multiple reading is effectively avoided, and the calculation time ratio is improved; meanwhile, since all the data in the previous P1 and P2 areas are processed, the matrix multiplication and addition operations can be completed in a block form completely and parallelly during the execution of the current step, and the calculation efficiency can be greatly improved.

In step S35, it is judged whether or not the currently processed block B is present _j,j Whether submatrix H ⁱ Middle bottom right block, if it is, the batch parallel LU decomposition is finished, the sub-matrix H ⁱ In which the decomposition result L is included ⁱ And U ⁱ (ii) a Otherwise, let j equal to j +1, go to step S3, and continue processing the next block.

When the method is finished, outputting a large matrix H, wherein each sub-matrix H ⁱ Are all composed of L ⁱ And U ⁱ The unit lower triangular matrix and the unit upper triangular matrix are used for subsequent batch linear equation set fast solving, and then the state variable values of all simulation units are calculated through a Newton iteration method, so that the simulation process is guaranteed to be fast promoted.

According to the method of the present invention, experimental tests were performed on batch parallel LU decomposition of dense matrices, as shown in fig. 9: the abscissa is the dimension of the large matrix, the ordinate is the data processing capacity, and the lower right corner is marked with different sub-matrix dimensions, represented by curves of corresponding colors in the graph. It can be seen that, no matter the partitioning is performed according to 8 × 8 or 16 × 16, the LU decomposition processing capacity basically increases linearly with the number of sub-matrices, which means that in the execution process of the method of the present invention, the hardware of the GPU is fully utilized; it can also be seen from the figure that when the dimension of the large matrix reaches 8192 multiplied by 8192, the peak floating point operation processing capacity reaches about 450 Gflops/s. Table 1 shows the comparison results of partial experiments of the execution efficiency of the CUBALS library function provided by the England Dart, and when the dimension of the submatrix is 64, the advantages of the method are gradually increased along with the increase of the number of the submatrixes, and the maximum acceleration ratio reaches about 13 times. Similar acceleration ratio results are also possible in other submatrix dimensions.

TABLE 1 comparison of processing speed partial results for the method of the invention and CUBLAS

The invention fully utilizes the hardware characteristics of memory combination access and local variable cache while realizing high parallelization, effectively hides the delay of memory access and improves the calculation time ratio. Therefore, the method has high calculation efficiency, can effectively reduce the LU decomposition time, and has the implicit real-time parallel solving capability of supporting a million-level small simulation model.

Those of ordinary skill in the art will understand that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions and scope of the present invention as defined in the appended claims.

Claims

1. A batch LU decomposition method of a small and medium-sized dense matrix based on a GPU for a simulation system is characterized by comprising the following steps:

s3, the GPU thread groups finish the principal element selection of all columns in each diagonal block, the updating of the right block of the row where the diagonal block is located, the exchange of the internal columns and the principal element columns in the diagonal block and the updating of all block data below the column where the diagonal block is located and below the right in turn from top left to bottom right according to the diagonal sequence, thereby finishing the LU decomposition of all sub-matrixes in the large matrix H.

2. The batch LU decomposition method according to claim 1, wherein in step S1, the simulation system has n model units, and the Jacobian matrix is m-th order and is smallType dense matrix H ⁱ I is in the range of [0, n-1]Wherein, the acquisition method of m is as follows: the dimensions of all model elements to be solved at present are extended up to an integer power m of 2 (i.e. the jacobian matrix is extended to the order m), and for the added rows and columns, all but the value of the diagonal element is 1, are filled with 0.

3. The batch LU decomposition method according to claim 1, comprising, in step S2:

4. The batch LU decomposition method of claim 3, wherein the submatrix H ⁱ The method for dividing the block into a plurality of sub-blocks comprises the following steps:

if m is larger than or equal to 16, setting the block size to be 8 or 16;

if m is smaller than 16, the entire sub-matrix is treated as a block, at which point the block size is m.

5. The batch LU decomposition method according to claim 2, wherein in the step S3, the specific steps include:

s35, if diagonal block B _j,j Is a sub-matrix H ⁱ The lower right corner in the middle is blocked, the batch parallel LU decomposition is finished, and a large matrix H is output, wherein each sub-matrix H ⁱ Are all composed of L ⁱ And U ⁱ The unit lower triangular matrix and the unit upper triangular matrix are used for quickly solving a subsequent linear equation set; otherwise, let j equal to j +1, go to step S31, and continue processing the next block.

6. The batch LU decomposition method of claim 5, wherein the step S31 comprises:

S313: thread group continues to read B in parallel _j,j All data of p rows in the next block adjacent to the right side are compared with the data in the cache, and the element with the larger absolute value and the serial number of the column where the element is located are stored in the internal cache; the process is iterated continuously until the last block data of the submatrix is read;

S315: pairMatrix H ⁱ Starting from the column index j × block size, the current row p in (1) traverses all the subsequent elements in the row

If To ⁱ [l]>q, then perform the following operations:

wherein,

is an element

A corresponding true principal element;

s316: for each element of the P1 region

If To ⁱ [y]>q, then perform the following argument operation:

7. The batch LU decomposition method according to claim 5, wherein in step S32, for B _j,j Any column p in the thread group, the p-th thread execution column col in the thread group _p And column (c)

Element exchange between them.

8. The batch LU decomposition method of claim 5, wherein in step S33, the P2 region is set to be located in the submatrix H ⁱ [ (j +1) x blocksize, m-1)]Line and [ j × blocksize, (j +1) × blocksize-1]Matrix between columns for each element therein

The following operations are performed:

9. The batch LU decomposition method of claim 5, wherein in step S34, the P3 region is set to be located in the submatrix H ⁱ [ (j +1) x blocksize, m-1)]And row [ (j +1) xblocksize, m-1]A matrix between the columns; the P1 and P3 areas are further divided into a plurality of sub-areas according to blocksize, and the thread groups are Group ⁱ In turn for any sub-region P1 therein _t 、P3 _t ,0≤t<k-j, the following treatment is carried out:

calculating sub-region P3 _t Sequence number y each element in the vector D:

D[x]＝D[x]-C _x A (4)

wherein (j +1) xblocksize is less than or equal to x < m.

10. The batch LU decomposition method according to claim 5, wherein in step S35, the currently processed partition B is judged _j,j Whether submatrix H ⁱ The lower right corner in the middle is partitioned, if so, the batch parallel LU decomposition is finished; otherwise, let j equal to j +1, go to step S31, and continue processing the next block.