US9170836B2 - System and method for re-factorizing a square matrix into lower and upper triangular matrices on a parallel processor - Google Patents

System and method for re-factorizing a square matrix into lower and upper triangular matrices on a parallel processor Download PDF

Info

Publication number
US9170836B2
US9170836B2 US13/737,287 US201313737287A US9170836B2 US 9170836 B2 US9170836 B2 US 9170836B2 US 201313737287 A US201313737287 A US 201313737287A US 9170836 B2 US9170836 B2 US 9170836B2
Authority
US
United States
Prior art keywords
matrix
operable
recited
factorization
factorizer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active, expires
Application number
US13/737,287
Other versions
US20140196043A1 (en
Inventor
Maxim Naumov
Sharanyan Chetlur
Lung Sheng Chien
Robert STRZODKA
Philippe Vandermersch
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nvidia Corp
Original Assignee
Nvidia Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nvidia Corp filed Critical Nvidia Corp
Priority to US13/737,287 priority Critical patent/US9170836B2/en
Assigned to NVIDIA CORPORATION reassignment NVIDIA CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: STRZODKA, ROBERT, VANDERMERSCH, PHILIPPE, CHETLUR, SHARANYAN, CHIEN, LUNG SHENG, NAUMOV, MAXIM
Priority to DE102013020608.1A priority patent/DE102013020608A1/en
Priority to TW102146864A priority patent/TWI518592B/en
Priority to CN201310746800.3A priority patent/CN103914433B/en
Publication of US20140196043A1 publication Critical patent/US20140196043A1/en
Application granted granted Critical
Publication of US9170836B2 publication Critical patent/US9170836B2/en
Active legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization

Definitions

  • This application is directed, in general, to parallel processors (i.e., computers having at least two processors capable of cooperating to carry out parallel processing) and, more specifically, to a system and method for re-factorizing a square matrix into lower and upper triangular matrices on a parallel processor.
  • parallel processors i.e., computers having at least two processors capable of cooperating to carry out parallel processing
  • GPUs graphics processing units
  • Sparse linear systems are systems of linear equations with sparse coefficient matrices. These systems arise in the context of computational mechanics, geophysics, biology, circuit simulation and many other contexts in the fields of computational science and engineering.
  • the most common general and direct technique to solve a sparse linear system is to decompose its coefficient matrix into the product of a lower triangular matrix, L, and an upper triangular matrix, U, a process called “factorization.” Then, conventional forward and backward substitution techniques can be used to solve the linear systems with L and U triangular matrices and thereby obtain the solution of the sparse linear system.
  • the system includes: (1) a matrix generator operable to generate an intermediate matrix by embedding a permuted form of the input matrix in a zeroed-out sparsity pattern of a combination of lower and upper triangular matrices resulting from an LU factorization of a previous matrix having a same sparsity pattern, reordering to minimize fill-in and pivoting strategy as the input matrix and (2) a re-factorizer associated with the matrix generator and operable to use parallel threads to apply an incomplete-LU factorization with zero fill-in (ILU0) on the intermediate matrix.
  • a matrix generator operable to generate an intermediate matrix by embedding a permuted form of the input matrix in a zeroed-out sparsity pattern of a combination of lower and upper triangular matrices resulting from an LU factorization of a previous matrix having a same sparsity pattern, reordering to minimize fill-in and pivoting strategy as the input matrix
  • a re-factorizer associated with the matrix generator and operable to
  • Another aspect provides a method of re-factorizing a square input matrix on a parallel processor.
  • the method includes: (1) generating an intermediate matrix by embedding a permuted form of the input matrix in a zeroed-out sparsity pattern of a combination of lower and upper triangular matrices resulting from an LU factorization of a previous matrix having a same sparsity pattern, reordering to minimize fill-in and pivoting strategy as the input matrix and (2) using parallel threads to apply an ILU0 on the intermediate matrix.
  • the SIMD processor includes: (1) lanes operable to process parallel threads, (2) a pipeline control unit system operable to control thread processing in the lanes and (3) a system for employing the lanes and the pipeline control unit to re-factorize a square input matrix, having: (3a) a matrix generator operable to generate an intermediate matrix by embedding a permuted form of the input matrix in a zeroed-out sparsity pattern of a combination of lower and upper triangular matrices resulting from an LU factorization of a previous matrix having a same sparsity pattern, reordering to minimize fill-in and pivoting strategy as the input matrix and (3b) a re-factorizer associated with the matrix generator and operable to transmit parallel threads to the pipeline control unit to apply an ILU0 on the intermediate matrix.
  • FIG. 1 is a block diagram of a SIMD processor operable to contain or carry out a system or method for re-factorizing a square matrix into lower and upper triangular matrices on a parallel processor;
  • FIG. 2 is a flow diagram of one embodiment of a system for re-factorizing a square matrix into lower and upper triangular matrices on a parallel processor;
  • FIG. 3 is a block diagram of one embodiment of a method of re-factorizing a square matrix into lower and upper triangular matrices on a parallel processor.
  • the most common general technique to solve a sparse linear system is to factorize its coefficient matrix into the product of lower and upper triangular matrices, L and U, respectively. It is realized herein that existing techniques are not well-suited to take advantage of the architecture of a parallel processor, such as a GPU.
  • Certain embodiments of a system and method are applicable when (1) the sparsity pattern of the coefficient matrices A i , (2) the reordering to minimize fill-in and (3) the pivoting strategy used during the factorization remain the same across all the linear systems. In this case the sparsity pattern of the resulting lower (L i ) and upper (U i ) triangular factors for each of the linear systems also remains the same. These conditions frequently arise in the simulation of integrated circuits using the well-known Simulation Program with Integrated Circuit Emphasis (SPICE).
  • SPICE Simulation Program with Integrated Circuit Emphasis
  • FIG. 1 is a block diagram of a SIMD processor 100 operable to contain or carry out a system or method for re-factorizing a square matrix into lower and upper triangular matrices on a parallel processor.
  • SIMD processor 100 includes multiple thread processors, or cores 106 , organized into thread groups 104 , or “warps.”
  • SIMD processor 100 contains J thread groups 104 - 1 through 104 -J, each having K cores 106 - 1 through 106 -K.
  • thread groups 104 - 1 through 104 -J may be further organized into one or more thread blocks 102 .
  • One specific embodiment has thirty-two cores 106 per thread group 104 .
  • embodiments may include as few as four cores in a thread group and as many as several tens of thousands. Certain embodiments organize cores 106 into a single thread group 104 , while other embodiments may have hundreds or even thousands of thread groups 104 . Alternate embodiments of SIMD processor 100 may organize cores 106 into thread groups 104 only, omitting the thread block organization level.
  • SIMD processor 100 further includes a pipeline control unit 108 , shared memory 110 and an array of local memory 112 - 1 through 112 -J associated with thread groups 104 - 1 through 104 -J.
  • Thread execution control unit 108 distributes tasks to the various thread groups 104 - 1 through 104 -J over a data bus 114 .
  • Cores 106 within a thread group execute in parallel with each other.
  • Thread groups 104 - 1 through 104 -J communicate with shared memory 110 over a memory bus 116 .
  • Thread groups 104 - 1 through 104 -J respectively communicate with local memory 112 - 1 through 112 -J over local buses 118 - 1 through 118 -J.
  • a thread group 104 -J utilizes local memory 112 -J by communicating over a local bus 118 -J.
  • Certain embodiments of SIMD processor 100 allocate a shared portion of shared memory 110 to each thread block 102 and allow access to shared portions of shared memory 110 by all thread groups 104 within a thread block 102 .
  • Certain embodiments include thread groups 104 that use only local memory 112 .
  • Many other embodiments include thread groups 104 that balance use of local memory 112 and shared memory 110 .
  • FIG. 2 is a block diagram of one embodiment of a system for re-factorizing a square matrix into lower and upper triangular matrices on a parallel processor.
  • a factorizer 210 which is not part of many of the embodiments of the system, is operable to receive the square matrix A 1 as an input matrix and perform a LU factorization of the matrix to yield a lower triangular matrix, L 1 , and an upper triangular matrix, U 1 , as well as the reordering to minimize fill-in and pivoting permutations P and Q used in the factorization.
  • row update is essentially a step of Gaussian elimination, where a reference (pivot) row is scaled by a multiplier and added to the current row in order to create zeros below the main diagonal.
  • the multipliers are computed and stored in place of zeros created by this update.
  • columns are grouped into levels instead of rows.
  • the conventional technique of grouping rows into levels not only is more intuitive, but also makes the process of finding data dependencies more straightforward.
  • grouping rows into levels substantially hinders processing speed. It is specifically realized that rows near the bottom of the matrix tend to have far more nonzero elements than other rows, making updates of those rows relatively slow.
  • grouping columns into levels often allows us to achieve a better load balancing of work by more evenly distributing the computation associated with the bottom rows across multiple columns. Grouping columns into levels can result in a significant improvement in processing speed, perhaps by several orders of magnitude.
  • the bottom right corner of the matrix may be packaged into dense storage and dense LU factorization (without pivoting) may be employed to process it.
  • the packaging happens only if the bottom right corner contains enough elements per row to justify the extra computation in the dense format. However, it is realized herein that this is often the case.
  • the matrix is kept in a merged Compressed Sparse Row/Compressed Sparse Column (CSR-CSC) format.
  • CSR-CSC Compressed Sparse Row/Compressed Sparse Column
  • the merged CSR-CSC format is composed of both standard CSR and CSC formats.
  • the CSR format is conventional and contains an array of matrix values, among other things.
  • the CSC format is conventional except that, instead of the actual matrix values, it contains pointers to the values contained in the CSR format.
  • the merged CSR-CSC format allows the update that takes place during the ILU0 to occur in place with respect to the single array of values in the CSR format.
  • the CSC format pointers require no updating.
  • the analysis phase computes individual grid and block launch parameters for every level because of a huge difference in the number of elements per row early and late in the matrix M i .
  • the analysis may schedule two kernel launches per level instead of using a single kernel launch to process a level (with a single thread block in x-dimension per row). If two kernels are launched, the first kernel may update the multipliers (with a single thread block in x-dimension per row), and the second kernel may update the remaining elements in a row (with multiple thread blocks in x-dimension per row).
  • a search for the elements of the reference (pivot) row in the current row is undertaken. Searching in this manner yields two potential advantages. First, if the number of elements in the reference (pivot) row is “n” and in the current row is “m”, then m ⁇ n given the construction of M i . Thus, the former and the latter approaches involve O(m*log(n)) and O(n*log(m)) steps, respectively. Therefore, the latter approach involves fewer steps and is less computationally expensive. Also, by the construction of M i , the elements of the reference (pivot) row are known always to be present in the current row, which minimizes thread divergence.
  • FIG. 3 is a flow diagram of one embodiment of a method of re-factorizing a square matrix into lower and upper triangular matrices on a parallel processor.
  • the method begins in a start step 310 .
  • the input matrix A 1 is LU factorized to yield a lower triangular matrix, L 1 , and an upper triangular matrix, U 1 , as well as the reordering to minimize fill-in and pivoting permutations P and Q used in the factorization.
  • an intermediate matrix M i is generated by embedding a permuted form of the input matrix P T *A i *Q in a zeroed-out sparsity pattern of a combination of lower and upper triangular matrices L 1 (z) +U 1 (Z) resulting from the prior LU factorization.
  • parallel threads are used to apply an ILU0 on the generated intermediate matrix M i .
  • the method ends in an end step 350 .

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Pure & Applied Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Complex Calculations (AREA)
  • Advance Control (AREA)

Abstract

A system and method for re-factorizing a square input matrix on a parallel processor. In one embodiment, the system includes: (1) a matrix generator operable to generate an intermediate matrix by embedding a permuted form of the input matrix in a zeroed-out sparsity pattern of a combination of lower and upper triangular matrices resulting from a prior LU factorization of a previous matrix having a same sparsity pattern, reordering to minimize fill-in and pivoting strategy as the input matrix and (2) a re-factorizer associated with the matrix generator and operable to use parallel threads to apply an incomplete-LU factorization with zero fill-in on the intermediate matrix.

Description

TECHNICAL FIELD
This application is directed, in general, to parallel processors (i.e., computers having at least two processors capable of cooperating to carry out parallel processing) and, more specifically, to a system and method for re-factorizing a square matrix into lower and upper triangular matrices on a parallel processor.
BACKGROUND
Assuming they can be effectively programmed, parallel processors such as graphics processing units (GPUs) have the potential to be remarkably adept at processing numerical algorithms, and particularly algorithms for directly solving large sparse linear systems.
Sparse linear systems are systems of linear equations with sparse coefficient matrices. These systems arise in the context of computational mechanics, geophysics, biology, circuit simulation and many other contexts in the fields of computational science and engineering.
The most common general and direct technique to solve a sparse linear system is to decompose its coefficient matrix into the product of a lower triangular matrix, L, and an upper triangular matrix, U, a process called “factorization.” Then, conventional forward and backward substitution techniques can be used to solve the linear systems with L and U triangular matrices and thereby obtain the solution of the sparse linear system.
SUMMARY
One aspect provides a system for re-factorizing a square input matrix on a parallel processor. In one embodiment, the system includes: (1) a matrix generator operable to generate an intermediate matrix by embedding a permuted form of the input matrix in a zeroed-out sparsity pattern of a combination of lower and upper triangular matrices resulting from an LU factorization of a previous matrix having a same sparsity pattern, reordering to minimize fill-in and pivoting strategy as the input matrix and (2) a re-factorizer associated with the matrix generator and operable to use parallel threads to apply an incomplete-LU factorization with zero fill-in (ILU0) on the intermediate matrix.
Another aspect provides a method of re-factorizing a square input matrix on a parallel processor. In one embodiment, the method includes: (1) generating an intermediate matrix by embedding a permuted form of the input matrix in a zeroed-out sparsity pattern of a combination of lower and upper triangular matrices resulting from an LU factorization of a previous matrix having a same sparsity pattern, reordering to minimize fill-in and pivoting strategy as the input matrix and (2) using parallel threads to apply an ILU0 on the intermediate matrix.
Yet another aspect provides a SIMD processor. In one embodiment, the SIMD processor includes: (1) lanes operable to process parallel threads, (2) a pipeline control unit system operable to control thread processing in the lanes and (3) a system for employing the lanes and the pipeline control unit to re-factorize a square input matrix, having: (3a) a matrix generator operable to generate an intermediate matrix by embedding a permuted form of the input matrix in a zeroed-out sparsity pattern of a combination of lower and upper triangular matrices resulting from an LU factorization of a previous matrix having a same sparsity pattern, reordering to minimize fill-in and pivoting strategy as the input matrix and (3b) a re-factorizer associated with the matrix generator and operable to transmit parallel threads to the pipeline control unit to apply an ILU0 on the intermediate matrix.
BRIEF DESCRIPTION
Reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:
FIG. 1 is a block diagram of a SIMD processor operable to contain or carry out a system or method for re-factorizing a square matrix into lower and upper triangular matrices on a parallel processor;
FIG. 2 is a flow diagram of one embodiment of a system for re-factorizing a square matrix into lower and upper triangular matrices on a parallel processor; and
FIG. 3 is a block diagram of one embodiment of a method of re-factorizing a square matrix into lower and upper triangular matrices on a parallel processor.
DETAILED DESCRIPTION
As stated above, the most common general technique to solve a sparse linear system is to factorize its coefficient matrix into the product of lower and upper triangular matrices, L and U, respectively. It is realized herein that existing techniques are not well-suited to take advantage of the architecture of a parallel processor, such as a GPU.
Accordingly, introduced herein are various embodiments of a system and method for re-factorizing a square matrix into lower and upper triangular matrices. In general, the various embodiments are operable to accelerate applications for solving a set of a sparse linear system taking the form of:
A i x i =f i for i=1, . . . ,K,
where coefficient matrices Ai ε
Figure US09170836-20151027-P00001
n×n, the right-hand-sides fi ε
Figure US09170836-20151027-P00001
n, and the solutions xi ε
Figure US09170836-20151027-P00001
n.
Certain embodiments of a system and method are applicable when (1) the sparsity pattern of the coefficient matrices Ai, (2) the reordering to minimize fill-in and (3) the pivoting strategy used during the factorization remain the same across all the linear systems. In this case the sparsity pattern of the resulting lower (Li) and upper (Ui) triangular factors for each of the linear systems also remains the same. These conditions frequently arise in the simulation of integrated circuits using the well-known Simulation Program with Integrated Circuit Emphasis (SPICE).
In these certain embodiments, LU factorization needs to be performed only the first time (for i=1). Subsequently, (for i=2, . . . , k) only LU re-factorization needs to be performed. Since applying the LU factorization on the sparsity pattern of Li and Ui factors generates no additional fill-in, the memory needed for the re-factorization can be statically allocated. Therefore, LU re-factorization is in general much faster than the LU factorization. In some embodiments, re-factorization can be carried out in tens of seconds or a few minutes on a modern SIMD processor, whereas factorization may require many hours.
Thus, it is realized herein that for i=2, . . . , k the coefficient matrix Ai can be embedded in the sparsity pattern of the zeroed-out lower and upper triangular factors resulting from the first (i=1) factorization, viz.:
M i =L 1 (Z) +U 1 (Z) +A i,
where L1 (z)=Ll and U1 (z)=U1 filled in with zeros. Since the application of LU factorization on the matrices Li and Ui does not generate additional fill-in, it is realized herein that ILU0 can then be applied to this newly generated intermediate matrix Mi to produce the LU re-factorization of the coefficient matrix Ai.
In certain embodiments, the reordering to minimize fill-in and pivoting can be accounted for by embedding the permuted matrix PT*Ai*Q, instead of the coefficient matrix Ai, so that:
M i =L 1 (Z) +U 1 (Z) +P T *A i *Q,
where PT and Q are the permutation matrices corresponding to the reordering to minimize fill-in and pivoting used in the first (i=1) LU factorization.
The problem of LU re-factorization of the coefficient matrix Ai has therefore been recast as that of an ILU0 of the intermediate matrix Mi, for i=2, . . . , k. It is realized that the latter can be effectively carried out on a parallel processor, such as GPU. Before describing the novel system and method in greater detail, a representative computing system containing a GPU will now be described.
FIG. 1 is a block diagram of a SIMD processor 100 operable to contain or carry out a system or method for re-factorizing a square matrix into lower and upper triangular matrices on a parallel processor. SIMD processor 100 includes multiple thread processors, or cores 106, organized into thread groups 104, or “warps.” SIMD processor 100 contains J thread groups 104-1 through 104-J, each having K cores 106-1 through 106-K. In certain embodiments, thread groups 104-1 through 104-J may be further organized into one or more thread blocks 102. One specific embodiment has thirty-two cores 106 per thread group 104. Other embodiments may include as few as four cores in a thread group and as many as several tens of thousands. Certain embodiments organize cores 106 into a single thread group 104, while other embodiments may have hundreds or even thousands of thread groups 104. Alternate embodiments of SIMD processor 100 may organize cores 106 into thread groups 104 only, omitting the thread block organization level.
SIMD processor 100 further includes a pipeline control unit 108, shared memory 110 and an array of local memory 112-1 through 112-J associated with thread groups 104-1 through 104-J. Thread execution control unit 108 distributes tasks to the various thread groups 104-1 through 104-J over a data bus 114. Cores 106 within a thread group execute in parallel with each other. Thread groups 104-1 through 104-J communicate with shared memory 110 over a memory bus 116. Thread groups 104-1 through 104-J respectively communicate with local memory 112-1 through 112-J over local buses 118-1 through 118-J. For example, a thread group 104-J utilizes local memory 112-J by communicating over a local bus 118-J. Certain embodiments of SIMD processor 100 allocate a shared portion of shared memory 110 to each thread block 102 and allow access to shared portions of shared memory 110 by all thread groups 104 within a thread block 102. Certain embodiments include thread groups 104 that use only local memory 112. Many other embodiments include thread groups 104 that balance use of local memory 112 and shared memory 110.
Having described a representative computing system containing a GPU, various embodiments of the novel system and method will now be described in greater detail. Various embodiments of the system and method employ various novel techniques.
FIG. 2 is a block diagram of one embodiment of a system for re-factorizing a square matrix into lower and upper triangular matrices on a parallel processor. A factorizer 210, which is not part of many of the embodiments of the system, is operable to receive the square matrix A1 as an input matrix and perform a LU factorization of the matrix to yield a lower triangular matrix, L1, and an upper triangular matrix, U1, as well as the reordering to minimize fill-in and pivoting permutations P and Q used in the factorization.
A matrix generator 220 is associated with the factorizer 210 and is operable for i=2, . . . , k to generate the intermediate matrix Mi by embedding the permuted input matrix PT*Ai*Q in a zeroed-out sparsity pattern of a combination of lower and upper triangular matrices L1 (z)+U1 (z) resulting from the first LU factorization. A re-factorizer 230 is associated with the matrix generator 220 and is operable for i=2, . . . , k to use parallel threads to apply an ILU0 on the generated intermediate matrix Mi.
It is realized herein that row update is essentially a step of Gaussian elimination, where a reference (pivot) row is scaled by a multiplier and added to the current row in order to create zeros below the main diagonal. The multipliers are computed and stored in place of zeros created by this update. In some embodiments of the novel system and method, in order to explore the available parallelism, columns are grouped into levels instead of rows. It is realized herein that the conventional technique of grouping rows into levels not only is more intuitive, but also makes the process of finding data dependencies more straightforward. However, it is realized herein that grouping rows into levels substantially hinders processing speed. It is specifically realized that rows near the bottom of the matrix tend to have far more nonzero elements than other rows, making updates of those rows relatively slow. In contrast, grouping columns into levels often allows us to achieve a better load balancing of work by more evenly distributing the computation associated with the bottom rows across multiple columns. Grouping columns into levels can result in a significant improvement in processing speed, perhaps by several orders of magnitude.
Grouping columns into levels also yields another potential advantage. The bottom right corner of the matrix may be packaged into dense storage and dense LU factorization (without pivoting) may be employed to process it. In one embodiment, the packaging happens only if the bottom right corner contains enough elements per row to justify the extra computation in the dense format. However, it is realized herein that this is often the case.
In other embodiments of the systems and method, the matrix is kept in a merged Compressed Sparse Row/Compressed Sparse Column (CSR-CSC) format. The merged CSR-CSC format is composed of both standard CSR and CSC formats. The CSR format is conventional and contains an array of matrix values, among other things. The CSC format is conventional except that, instead of the actual matrix values, it contains pointers to the values contained in the CSR format. The merged CSR-CSC format allows the update that takes place during the ILU0 to occur in place with respect to the single array of values in the CSR format. The CSC format pointers require no updating.
In yet other embodiments of the system and method, the analysis phase computes individual grid and block launch parameters for every level because of a huge difference in the number of elements per row early and late in the matrix Mi.
In still other embodiments of the system and method, the analysis may schedule two kernel launches per level instead of using a single kernel launch to process a level (with a single thread block in x-dimension per row). If two kernels are launched, the first kernel may update the multipliers (with a single thread block in x-dimension per row), and the second kernel may update the remaining elements in a row (with multiple thread blocks in x-dimension per row).
In certain embodiments, during the row update in the LU re-factorization, a search for the elements of the reference (pivot) row in the current row, rather than searching for elements of the current row in the reference (pivot) row, is undertaken. Searching in this manner yields two potential advantages. First, if the number of elements in the reference (pivot) row is “n” and in the current row is “m”, then m≧n given the construction of Mi. Thus, the former and the latter approaches involve O(m*log(n)) and O(n*log(m)) steps, respectively. Therefore, the latter approach involves fewer steps and is less computationally expensive. Also, by the construction of Mi, the elements of the reference (pivot) row are known always to be present in the current row, which minimizes thread divergence.
FIG. 3 is a flow diagram of one embodiment of a method of re-factorizing a square matrix into lower and upper triangular matrices on a parallel processor. The method begins in a start step 310. In a step 320, the input matrix A1 is LU factorized to yield a lower triangular matrix, L1, and an upper triangular matrix, U1, as well as the reordering to minimize fill-in and pivoting permutations P and Q used in the factorization. In a step 330, for i=2, . . . , k an intermediate matrix Mi is generated by embedding a permuted form of the input matrix PT*Ai*Q in a zeroed-out sparsity pattern of a combination of lower and upper triangular matrices L1 (z)+U1 (Z) resulting from the prior LU factorization. In a step 340, parallel threads are used to apply an ILU0 on the generated intermediate matrix Mi. The method ends in an end step 350.
Those skilled in the art to which this application relates will appreciate that other and further additions, deletions, substitutions and modifications may be made to the described embodiments.

Claims (20)

What is claimed is:
1. A system for re-factorizing a square input matrix on a parallel processor, comprising:
a matrix generator operable to generate an intermediate matrix by embedding a permuted form of said input matrix in a zeroed-out sparsity pattern of a combination of lower and upper triangular matrices resulting from an LU factorization of a previous matrix having a same sparsity pattern, reordering to minimize fill-in and pivoting strategy as said input matrix; and
a re-factorizer associated with said matrix generator and operable to use parallel threads to apply an incomplete-LU factorization with zero fill-in on said intermediate matrix.
2. The system as recited in claim 1 wherein said intermediate matrix generator is operable to embed said permuted form in said sparsity pattern.
3. The system as recited in claim 1 wherein said re-factorizer is further operable to performing a symbolic analysis on said input matrix to group independent nodes representing columns thereof into levels representing data dependencies between said nodes.
4. The system as recited in claim 1 wherein said re-factorizer is further operable to employ a merged compressed sparse row-compressed sparse column format in carrying out said incomplete-LU factorization with zero fill-in.
5. The system as recited in claim 1 wherein said re-factorizer is further operable to compute grid and block launch parameters particular to each level in defining a configuration of said parallel threads.
6. The system as recited in claim 1 wherein said re-factorizer is further operable to launch said parallel threads as two concurrent kernels per level of said subsequent matrix, one of said kernels operable to update multipliers in said level and another of said kernels operable to update remaining elements in said level.
7. The system as recited in claim 6 wherein said re-factorizer is further operable with a SIMD processor to launch said parallel threads.
8. A method of re-factorizing a square input matrix on a parallel processor, comprising:
generating an intermediate matrix by embedding a permuted form of said input matrix in a zeroed-out sparsity pattern of a combination of lower and upper triangular matrices resulting from an LU factorization of a previous matrix having a same sparsity pattern, reordering to minimize fill-in and pivoting strategy as said input matrix; and
using parallel threads to apply an incomplete-LU factorization with zero fill-in on said intermediate matrix.
9. The method as recited in claim 8 wherein said embedding comprises embedding said permuted form in said sparsity pattern.
10. The method as recited in claim 8 wherein said using comprises performing a symbolic analysis on said input matrix to group independent nodes representing columns thereof into levels representing data dependencies between said nodes.
11. The method as recited in claim 8 wherein said using comprises employing a merged compressed sparse row-compressed sparse column format in carrying out said incomplete-LU factorization with zero fill-in.
12. The method as recited in claim 8 wherein said using comprises computing grid and block launch parameters particular to each level in defining a configuration of said parallel threads.
13. The method as recited in claim 8 wherein using comprises launching said parallel threads as two concurrent kernels per level of said subsequent matrix, one of said kernels operable to update multipliers in said level and another of said kernels operable to update remaining elements in said level.
14. The method as recited in claim 13 wherein said using comprises employing a SIMD processor to launch said parallel threads.
15. A SIMD processor, comprising:
lanes operable to process parallel threads;
a pipeline control unit system operable to control thread processing in said lanes; and
a system for employing said lanes and said pipeline control unit to re-factorize a square input matrix, including:
a matrix generator operable to generate an intermediate matrix by embedding a permuted form of said input matrix in a zeroed-out sparsity pattern of a combination of lower and upper triangular matrices resulting from an LU factorization of a previous matrix having a same sparsity pattern, reordering to minimize fill-in and pivoting strategy as said input matrix, and
a re-factorizer associated with said matrix generator and operable to transmit parallel threads to said pipeline control unit to apply an incomplete-LU factorization with zero fill-in on said intermediate matrix.
16. The processor as recited in claim 15 wherein said intermediate matrix generator is operable to embed said permuted form in said sparsity pattern.
17. The processor as recited in claim 15 wherein said re-factorizer is further operable to performing a symbolic analysis on said square matrix to group independent nodes representing columns thereof into levels representing data dependencies between said nodes.
18. The processor as recited in claim 15 wherein said re-factorizer is further operable to employ a merged compressed sparse row-compressed sparse column format in carrying out said incomplete-LU factorization with zero fill-in.
19. The processor as recited in claim 15 wherein said re-factorizer is further operable to compute grid and block launch parameters particular to each level in defining configuration of said parallel threads.
20. The processor as recited in claim 15 wherein said re-factorizer is further operable to launch said parallel threads as two concurrent kernels per level of said intermediate matrix, one of said kernels operable to update multipliers in said level and another of said kernels operable to update remaining elements in said level.
US13/737,287 2013-01-09 2013-01-09 System and method for re-factorizing a square matrix into lower and upper triangular matrices on a parallel processor Active 2034-05-25 US9170836B2 (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
US13/737,287 US9170836B2 (en) 2013-01-09 2013-01-09 System and method for re-factorizing a square matrix into lower and upper triangular matrices on a parallel processor
DE102013020608.1A DE102013020608A1 (en) 2013-01-09 2013-12-15 System and method for re-factoring a square matrix into lower and upper triangular matrices in a parallel processor
TW102146864A TWI518592B (en) 2013-01-09 2013-12-18 System and processor for re-factorizing a square matrix into lower and upper triangular matrices on a parallel processor
CN201310746800.3A CN103914433B (en) 2013-01-09 2013-12-30 For the system and method for factorization square matrix again on parallel processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US13/737,287 US9170836B2 (en) 2013-01-09 2013-01-09 System and method for re-factorizing a square matrix into lower and upper triangular matrices on a parallel processor

Publications (2)

Publication Number Publication Date
US20140196043A1 US20140196043A1 (en) 2014-07-10
US9170836B2 true US9170836B2 (en) 2015-10-27

Family

ID=51019128

Family Applications (1)

Application Number Title Priority Date Filing Date
US13/737,287 Active 2034-05-25 US9170836B2 (en) 2013-01-09 2013-01-09 System and method for re-factorizing a square matrix into lower and upper triangular matrices on a parallel processor

Country Status (4)

Country Link
US (1) US9170836B2 (en)
CN (1) CN103914433B (en)
DE (1) DE102013020608A1 (en)
TW (1) TWI518592B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9805001B2 (en) 2016-02-05 2017-10-31 Google Inc. Matrix processing apparatus
US10394930B2 (en) * 2016-10-01 2019-08-27 Intel Corporation Binary vector factorization
CN109359247B (en) * 2018-12-07 2021-07-06 广州市百果园信息技术有限公司 Content pushing method, storage medium and computer equipment
CN111338695B (en) * 2018-12-19 2022-05-17 中科寒武纪科技股份有限公司 Data processing method based on pipeline technology and related product
CN110704023B (en) * 2019-09-26 2021-10-22 北京华大九天科技股份有限公司 Matrix block division method and device based on topological sorting
CN113704674B (en) * 2021-08-20 2022-07-12 广东省水利水电第三工程局有限公司 Pipeline anticorrosion method and anticorrosion material
CN115396065B (en) * 2022-10-26 2023-04-28 南京邮电大学 Low-delay decoding method for sparse random linear network coding

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060265445A1 (en) * 2005-05-20 2006-11-23 International Business Machines Corporation Method and structure for improving processing efficiency in parallel processing machines for rectangular and triangular matrix routines
US7181384B1 (en) 2004-08-16 2007-02-20 Altera Corporation Method and apparatus for simulating a hybrid system with registered and concurrent nodes
US7783465B2 (en) 2005-12-19 2010-08-24 Synopsys, Inc. Parallel multi-rate circuit simulation
US7822590B2 (en) 2002-12-17 2010-10-26 Cadence Design Systems, Inc. Method and system for implementing, controlling, and interfacing with circuit simulators
US20110257955A1 (en) 2010-04-20 2011-10-20 The Regents Of The University Of Michigan Gate-Level Logic Simulator Using Multiple Processor Architectures
US8156457B2 (en) 2009-09-24 2012-04-10 Synopsys, Inc. Concurrent simulation of hardware designs with behavioral characteristics
US20130226535A1 (en) * 2012-02-24 2013-08-29 Jeh-Fu Tuan Concurrent simulation system using graphic processing units (gpu) and method thereof

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010039325A1 (en) * 2008-09-30 2010-04-08 Exxonmobil Upstream Reseach Company Method for solving reservoir simulation matrix equation using parallel multi-level incomplete factorizations
GB2501829B (en) * 2011-02-24 2019-07-17 Chevron Usa Inc System and method for performing reservoir simulation using preconditioning

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7822590B2 (en) 2002-12-17 2010-10-26 Cadence Design Systems, Inc. Method and system for implementing, controlling, and interfacing with circuit simulators
US7181384B1 (en) 2004-08-16 2007-02-20 Altera Corporation Method and apparatus for simulating a hybrid system with registered and concurrent nodes
US20060265445A1 (en) * 2005-05-20 2006-11-23 International Business Machines Corporation Method and structure for improving processing efficiency in parallel processing machines for rectangular and triangular matrix routines
US7783465B2 (en) 2005-12-19 2010-08-24 Synopsys, Inc. Parallel multi-rate circuit simulation
US8156457B2 (en) 2009-09-24 2012-04-10 Synopsys, Inc. Concurrent simulation of hardware designs with behavioral characteristics
US20110257955A1 (en) 2010-04-20 2011-10-20 The Regents Of The University Of Michigan Gate-Level Logic Simulator Using Multiple Processor Architectures
US20130226535A1 (en) * 2012-02-24 2013-08-29 Jeh-Fu Tuan Concurrent simulation system using graphic processing units (gpu) and method thereof

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Ren, L., et al. "Sparse LU Factorization for Parallel Circuit Simulation on GPU"; Department of Electronic Engineering; Tsinghua National Laboratory for Information Science and Technology; Tsinghua University; Beijing, China; Design Automation Conference; Jun. 2-6, 2012; pp. 1125-1130.
Ren, L., et al. "Sparse LU Factorization for Parallel Circuit Simulation on GPU"; Power Point Slide Show; Department of Electronic Engineering; Tsinghua National Laboratory for Information Science and Technology; Tsinghua University; Beijing, China; Design Automation Conference; Jun. 2-6, 2012; 38 pages.

Also Published As

Publication number Publication date
CN103914433B (en) 2017-07-21
US20140196043A1 (en) 2014-07-10
TWI518592B (en) 2016-01-21
DE102013020608A1 (en) 2014-07-10
TW201443780A (en) 2014-11-16
CN103914433A (en) 2014-07-09

Similar Documents

Publication Publication Date Title
US9170836B2 (en) System and method for re-factorizing a square matrix into lower and upper triangular matrices on a parallel processor
US11687763B2 (en) Method, apparatus and computer program to carry out a training procedure in a convolutional neural network
US8676874B2 (en) Data structure for tiling and packetizing a sparse matrix
US8762655B2 (en) Optimizing output vector data generation using a formatted matrix data structure
US7937567B1 (en) Methods for scalably exploiting parallelism in a parallel processing system
CN110415157B (en) Matrix multiplication calculation method and device
US9146902B2 (en) Parallel computation of matrix problems
Peng et al. GLU3. 0: Fast GPU-based parallel sparse LU factorization for circuit simulation
JP2007317179A (en) Matrix multiplication with reduced bandwidth requirements
CN103885893A (en) Technique For Accessing Content-Addressable Memory
US10120717B2 (en) Method for optimizing the size of a data subset of a processing space for improved execution performance
CN114579929B (en) Accelerator execution method and electronic equipment
US9513923B2 (en) System and method for context migration across CPU threads
Zubair et al. An optimized multicolor point-implicit solver for unstructured grid applications on graphics processing units
US20100318758A1 (en) Efficient transfer of matrices for matrix based operations
US20090326880A1 (en) Parallel physics solver
US9600446B2 (en) Parallel multicolor incomplete LU factorization preconditioning processor and method of use thereof
Zou et al. Supernodal sparse Cholesky factorization on graphics processing units
US11409840B2 (en) Dynamically adaptable arrays for vector and matrix operations
Sakai et al. Towards automating multi-dimensional data decomposition for executing a single-GPU code on a multi-GPU system
DE112020007283T5 (en) Docking board for a multi-format graphics processing unit
US20240086719A1 (en) Sparse encoding and decoding at mixture-of-experts layer
CN117851742B (en) Data storage method, data processing method, data memory and data processor
US20230289398A1 (en) Efficient Matrix Multiply and Add with a Group of Warps
Islam et al. A Hierarchical Jacobi Iteration for Structured Matrices on GPUs using Shared Memory

Legal Events

Date Code Title Description
AS Assignment

Owner name: NVIDIA CORPORATION, CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:NAUMOV, MAXIM;CHETLUR, SHARANYAN;CHIEN, LUNG SHENG;AND OTHERS;SIGNING DATES FROM 20121217 TO 20121226;REEL/FRAME:029595/0669

STCF Information on status: patent grant

Free format text: PATENTED CASE

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 4

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 8