US9170836B2 - System and method for re-factorizing a square matrix into lower and upper triangular matrices on a parallel processor - Google Patents
System and method for re-factorizing a square matrix into lower and upper triangular matrices on a parallel processor Download PDFInfo
- Publication number
- US9170836B2 US9170836B2 US13/737,287 US201313737287A US9170836B2 US 9170836 B2 US9170836 B2 US 9170836B2 US 201313737287 A US201313737287 A US 201313737287A US 9170836 B2 US9170836 B2 US 9170836B2
- Authority
- US
- United States
- Prior art keywords
- matrix
- operable
- recited
- factorization
- factorizer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active, expires
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
Definitions
- This application is directed, in general, to parallel processors (i.e., computers having at least two processors capable of cooperating to carry out parallel processing) and, more specifically, to a system and method for re-factorizing a square matrix into lower and upper triangular matrices on a parallel processor.
- parallel processors i.e., computers having at least two processors capable of cooperating to carry out parallel processing
- GPUs graphics processing units
- Sparse linear systems are systems of linear equations with sparse coefficient matrices. These systems arise in the context of computational mechanics, geophysics, biology, circuit simulation and many other contexts in the fields of computational science and engineering.
- the most common general and direct technique to solve a sparse linear system is to decompose its coefficient matrix into the product of a lower triangular matrix, L, and an upper triangular matrix, U, a process called “factorization.” Then, conventional forward and backward substitution techniques can be used to solve the linear systems with L and U triangular matrices and thereby obtain the solution of the sparse linear system.
- the system includes: (1) a matrix generator operable to generate an intermediate matrix by embedding a permuted form of the input matrix in a zeroed-out sparsity pattern of a combination of lower and upper triangular matrices resulting from an LU factorization of a previous matrix having a same sparsity pattern, reordering to minimize fill-in and pivoting strategy as the input matrix and (2) a re-factorizer associated with the matrix generator and operable to use parallel threads to apply an incomplete-LU factorization with zero fill-in (ILU0) on the intermediate matrix.
- a matrix generator operable to generate an intermediate matrix by embedding a permuted form of the input matrix in a zeroed-out sparsity pattern of a combination of lower and upper triangular matrices resulting from an LU factorization of a previous matrix having a same sparsity pattern, reordering to minimize fill-in and pivoting strategy as the input matrix
- a re-factorizer associated with the matrix generator and operable to
- Another aspect provides a method of re-factorizing a square input matrix on a parallel processor.
- the method includes: (1) generating an intermediate matrix by embedding a permuted form of the input matrix in a zeroed-out sparsity pattern of a combination of lower and upper triangular matrices resulting from an LU factorization of a previous matrix having a same sparsity pattern, reordering to minimize fill-in and pivoting strategy as the input matrix and (2) using parallel threads to apply an ILU0 on the intermediate matrix.
- the SIMD processor includes: (1) lanes operable to process parallel threads, (2) a pipeline control unit system operable to control thread processing in the lanes and (3) a system for employing the lanes and the pipeline control unit to re-factorize a square input matrix, having: (3a) a matrix generator operable to generate an intermediate matrix by embedding a permuted form of the input matrix in a zeroed-out sparsity pattern of a combination of lower and upper triangular matrices resulting from an LU factorization of a previous matrix having a same sparsity pattern, reordering to minimize fill-in and pivoting strategy as the input matrix and (3b) a re-factorizer associated with the matrix generator and operable to transmit parallel threads to the pipeline control unit to apply an ILU0 on the intermediate matrix.
- FIG. 1 is a block diagram of a SIMD processor operable to contain or carry out a system or method for re-factorizing a square matrix into lower and upper triangular matrices on a parallel processor;
- FIG. 2 is a flow diagram of one embodiment of a system for re-factorizing a square matrix into lower and upper triangular matrices on a parallel processor;
- FIG. 3 is a block diagram of one embodiment of a method of re-factorizing a square matrix into lower and upper triangular matrices on a parallel processor.
- the most common general technique to solve a sparse linear system is to factorize its coefficient matrix into the product of lower and upper triangular matrices, L and U, respectively. It is realized herein that existing techniques are not well-suited to take advantage of the architecture of a parallel processor, such as a GPU.
- Certain embodiments of a system and method are applicable when (1) the sparsity pattern of the coefficient matrices A i , (2) the reordering to minimize fill-in and (3) the pivoting strategy used during the factorization remain the same across all the linear systems. In this case the sparsity pattern of the resulting lower (L i ) and upper (U i ) triangular factors for each of the linear systems also remains the same. These conditions frequently arise in the simulation of integrated circuits using the well-known Simulation Program with Integrated Circuit Emphasis (SPICE).
- SPICE Simulation Program with Integrated Circuit Emphasis
- FIG. 1 is a block diagram of a SIMD processor 100 operable to contain or carry out a system or method for re-factorizing a square matrix into lower and upper triangular matrices on a parallel processor.
- SIMD processor 100 includes multiple thread processors, or cores 106 , organized into thread groups 104 , or “warps.”
- SIMD processor 100 contains J thread groups 104 - 1 through 104 -J, each having K cores 106 - 1 through 106 -K.
- thread groups 104 - 1 through 104 -J may be further organized into one or more thread blocks 102 .
- One specific embodiment has thirty-two cores 106 per thread group 104 .
- embodiments may include as few as four cores in a thread group and as many as several tens of thousands. Certain embodiments organize cores 106 into a single thread group 104 , while other embodiments may have hundreds or even thousands of thread groups 104 . Alternate embodiments of SIMD processor 100 may organize cores 106 into thread groups 104 only, omitting the thread block organization level.
- SIMD processor 100 further includes a pipeline control unit 108 , shared memory 110 and an array of local memory 112 - 1 through 112 -J associated with thread groups 104 - 1 through 104 -J.
- Thread execution control unit 108 distributes tasks to the various thread groups 104 - 1 through 104 -J over a data bus 114 .
- Cores 106 within a thread group execute in parallel with each other.
- Thread groups 104 - 1 through 104 -J communicate with shared memory 110 over a memory bus 116 .
- Thread groups 104 - 1 through 104 -J respectively communicate with local memory 112 - 1 through 112 -J over local buses 118 - 1 through 118 -J.
- a thread group 104 -J utilizes local memory 112 -J by communicating over a local bus 118 -J.
- Certain embodiments of SIMD processor 100 allocate a shared portion of shared memory 110 to each thread block 102 and allow access to shared portions of shared memory 110 by all thread groups 104 within a thread block 102 .
- Certain embodiments include thread groups 104 that use only local memory 112 .
- Many other embodiments include thread groups 104 that balance use of local memory 112 and shared memory 110 .
- FIG. 2 is a block diagram of one embodiment of a system for re-factorizing a square matrix into lower and upper triangular matrices on a parallel processor.
- a factorizer 210 which is not part of many of the embodiments of the system, is operable to receive the square matrix A 1 as an input matrix and perform a LU factorization of the matrix to yield a lower triangular matrix, L 1 , and an upper triangular matrix, U 1 , as well as the reordering to minimize fill-in and pivoting permutations P and Q used in the factorization.
- row update is essentially a step of Gaussian elimination, where a reference (pivot) row is scaled by a multiplier and added to the current row in order to create zeros below the main diagonal.
- the multipliers are computed and stored in place of zeros created by this update.
- columns are grouped into levels instead of rows.
- the conventional technique of grouping rows into levels not only is more intuitive, but also makes the process of finding data dependencies more straightforward.
- grouping rows into levels substantially hinders processing speed. It is specifically realized that rows near the bottom of the matrix tend to have far more nonzero elements than other rows, making updates of those rows relatively slow.
- grouping columns into levels often allows us to achieve a better load balancing of work by more evenly distributing the computation associated with the bottom rows across multiple columns. Grouping columns into levels can result in a significant improvement in processing speed, perhaps by several orders of magnitude.
- the bottom right corner of the matrix may be packaged into dense storage and dense LU factorization (without pivoting) may be employed to process it.
- the packaging happens only if the bottom right corner contains enough elements per row to justify the extra computation in the dense format. However, it is realized herein that this is often the case.
- the matrix is kept in a merged Compressed Sparse Row/Compressed Sparse Column (CSR-CSC) format.
- CSR-CSC Compressed Sparse Row/Compressed Sparse Column
- the merged CSR-CSC format is composed of both standard CSR and CSC formats.
- the CSR format is conventional and contains an array of matrix values, among other things.
- the CSC format is conventional except that, instead of the actual matrix values, it contains pointers to the values contained in the CSR format.
- the merged CSR-CSC format allows the update that takes place during the ILU0 to occur in place with respect to the single array of values in the CSR format.
- the CSC format pointers require no updating.
- the analysis phase computes individual grid and block launch parameters for every level because of a huge difference in the number of elements per row early and late in the matrix M i .
- the analysis may schedule two kernel launches per level instead of using a single kernel launch to process a level (with a single thread block in x-dimension per row). If two kernels are launched, the first kernel may update the multipliers (with a single thread block in x-dimension per row), and the second kernel may update the remaining elements in a row (with multiple thread blocks in x-dimension per row).
- a search for the elements of the reference (pivot) row in the current row is undertaken. Searching in this manner yields two potential advantages. First, if the number of elements in the reference (pivot) row is “n” and in the current row is “m”, then m ⁇ n given the construction of M i . Thus, the former and the latter approaches involve O(m*log(n)) and O(n*log(m)) steps, respectively. Therefore, the latter approach involves fewer steps and is less computationally expensive. Also, by the construction of M i , the elements of the reference (pivot) row are known always to be present in the current row, which minimizes thread divergence.
- FIG. 3 is a flow diagram of one embodiment of a method of re-factorizing a square matrix into lower and upper triangular matrices on a parallel processor.
- the method begins in a start step 310 .
- the input matrix A 1 is LU factorized to yield a lower triangular matrix, L 1 , and an upper triangular matrix, U 1 , as well as the reordering to minimize fill-in and pivoting permutations P and Q used in the factorization.
- an intermediate matrix M i is generated by embedding a permuted form of the input matrix P T *A i *Q in a zeroed-out sparsity pattern of a combination of lower and upper triangular matrices L 1 (z) +U 1 (Z) resulting from the prior LU factorization.
- parallel threads are used to apply an ILU0 on the generated intermediate matrix M i .
- the method ends in an end step 350 .
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Mathematical Optimization (AREA)
- Mathematical Analysis (AREA)
- Pure & Applied Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computational Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Computing Systems (AREA)
- Complex Calculations (AREA)
- Advance Control (AREA)
Abstract
Description
A i x i =f i for i=1, . . . ,K,
where coefficient matrices Ai ε n×n, the right-hand-sides fi ε n, and the solutions xi ε n.
M i =L 1 (Z) +U 1 (Z) +A i,
where L1 (z)=Ll and U1 (z)=U1 filled in with zeros. Since the application of LU factorization on the matrices Li and Ui does not generate additional fill-in, it is realized herein that ILU0 can then be applied to this newly generated intermediate matrix Mi to produce the LU re-factorization of the coefficient matrix Ai.
M i =L 1 (Z) +U 1 (Z) +P T *A i *Q,
where PT and Q are the permutation matrices corresponding to the reordering to minimize fill-in and pivoting used in the first (i=1) LU factorization.
Claims (20)
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/737,287 US9170836B2 (en) | 2013-01-09 | 2013-01-09 | System and method for re-factorizing a square matrix into lower and upper triangular matrices on a parallel processor |
DE102013020608.1A DE102013020608A1 (en) | 2013-01-09 | 2013-12-15 | System and method for re-factoring a square matrix into lower and upper triangular matrices in a parallel processor |
TW102146864A TWI518592B (en) | 2013-01-09 | 2013-12-18 | System and processor for re-factorizing a square matrix into lower and upper triangular matrices on a parallel processor |
CN201310746800.3A CN103914433B (en) | 2013-01-09 | 2013-12-30 | For the system and method for factorization square matrix again on parallel processor |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/737,287 US9170836B2 (en) | 2013-01-09 | 2013-01-09 | System and method for re-factorizing a square matrix into lower and upper triangular matrices on a parallel processor |
Publications (2)
Publication Number | Publication Date |
---|---|
US20140196043A1 US20140196043A1 (en) | 2014-07-10 |
US9170836B2 true US9170836B2 (en) | 2015-10-27 |
Family
ID=51019128
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/737,287 Active 2034-05-25 US9170836B2 (en) | 2013-01-09 | 2013-01-09 | System and method for re-factorizing a square matrix into lower and upper triangular matrices on a parallel processor |
Country Status (4)
Country | Link |
---|---|
US (1) | US9170836B2 (en) |
CN (1) | CN103914433B (en) |
DE (1) | DE102013020608A1 (en) |
TW (1) | TWI518592B (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9805001B2 (en) | 2016-02-05 | 2017-10-31 | Google Inc. | Matrix processing apparatus |
US10394930B2 (en) * | 2016-10-01 | 2019-08-27 | Intel Corporation | Binary vector factorization |
CN109359247B (en) * | 2018-12-07 | 2021-07-06 | 广州市百果园信息技术有限公司 | Content pushing method, storage medium and computer equipment |
CN111338695B (en) * | 2018-12-19 | 2022-05-17 | 中科寒武纪科技股份有限公司 | Data processing method based on pipeline technology and related product |
CN110704023B (en) * | 2019-09-26 | 2021-10-22 | 北京华大九天科技股份有限公司 | Matrix block division method and device based on topological sorting |
CN113704674B (en) * | 2021-08-20 | 2022-07-12 | 广东省水利水电第三工程局有限公司 | Pipeline anticorrosion method and anticorrosion material |
CN115396065B (en) * | 2022-10-26 | 2023-04-28 | 南京邮电大学 | Low-delay decoding method for sparse random linear network coding |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060265445A1 (en) * | 2005-05-20 | 2006-11-23 | International Business Machines Corporation | Method and structure for improving processing efficiency in parallel processing machines for rectangular and triangular matrix routines |
US7181384B1 (en) | 2004-08-16 | 2007-02-20 | Altera Corporation | Method and apparatus for simulating a hybrid system with registered and concurrent nodes |
US7783465B2 (en) | 2005-12-19 | 2010-08-24 | Synopsys, Inc. | Parallel multi-rate circuit simulation |
US7822590B2 (en) | 2002-12-17 | 2010-10-26 | Cadence Design Systems, Inc. | Method and system for implementing, controlling, and interfacing with circuit simulators |
US20110257955A1 (en) | 2010-04-20 | 2011-10-20 | The Regents Of The University Of Michigan | Gate-Level Logic Simulator Using Multiple Processor Architectures |
US8156457B2 (en) | 2009-09-24 | 2012-04-10 | Synopsys, Inc. | Concurrent simulation of hardware designs with behavioral characteristics |
US20130226535A1 (en) * | 2012-02-24 | 2013-08-29 | Jeh-Fu Tuan | Concurrent simulation system using graphic processing units (gpu) and method thereof |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2010039325A1 (en) * | 2008-09-30 | 2010-04-08 | Exxonmobil Upstream Reseach Company | Method for solving reservoir simulation matrix equation using parallel multi-level incomplete factorizations |
GB2501829B (en) * | 2011-02-24 | 2019-07-17 | Chevron Usa Inc | System and method for performing reservoir simulation using preconditioning |
-
2013
- 2013-01-09 US US13/737,287 patent/US9170836B2/en active Active
- 2013-12-15 DE DE102013020608.1A patent/DE102013020608A1/en not_active Withdrawn
- 2013-12-18 TW TW102146864A patent/TWI518592B/en active
- 2013-12-30 CN CN201310746800.3A patent/CN103914433B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7822590B2 (en) | 2002-12-17 | 2010-10-26 | Cadence Design Systems, Inc. | Method and system for implementing, controlling, and interfacing with circuit simulators |
US7181384B1 (en) | 2004-08-16 | 2007-02-20 | Altera Corporation | Method and apparatus for simulating a hybrid system with registered and concurrent nodes |
US20060265445A1 (en) * | 2005-05-20 | 2006-11-23 | International Business Machines Corporation | Method and structure for improving processing efficiency in parallel processing machines for rectangular and triangular matrix routines |
US7783465B2 (en) | 2005-12-19 | 2010-08-24 | Synopsys, Inc. | Parallel multi-rate circuit simulation |
US8156457B2 (en) | 2009-09-24 | 2012-04-10 | Synopsys, Inc. | Concurrent simulation of hardware designs with behavioral characteristics |
US20110257955A1 (en) | 2010-04-20 | 2011-10-20 | The Regents Of The University Of Michigan | Gate-Level Logic Simulator Using Multiple Processor Architectures |
US20130226535A1 (en) * | 2012-02-24 | 2013-08-29 | Jeh-Fu Tuan | Concurrent simulation system using graphic processing units (gpu) and method thereof |
Non-Patent Citations (2)
Title |
---|
Ren, L., et al. "Sparse LU Factorization for Parallel Circuit Simulation on GPU"; Department of Electronic Engineering; Tsinghua National Laboratory for Information Science and Technology; Tsinghua University; Beijing, China; Design Automation Conference; Jun. 2-6, 2012; pp. 1125-1130. |
Ren, L., et al. "Sparse LU Factorization for Parallel Circuit Simulation on GPU"; Power Point Slide Show; Department of Electronic Engineering; Tsinghua National Laboratory for Information Science and Technology; Tsinghua University; Beijing, China; Design Automation Conference; Jun. 2-6, 2012; 38 pages. |
Also Published As
Publication number | Publication date |
---|---|
CN103914433B (en) | 2017-07-21 |
US20140196043A1 (en) | 2014-07-10 |
TWI518592B (en) | 2016-01-21 |
DE102013020608A1 (en) | 2014-07-10 |
TW201443780A (en) | 2014-11-16 |
CN103914433A (en) | 2014-07-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9170836B2 (en) | System and method for re-factorizing a square matrix into lower and upper triangular matrices on a parallel processor | |
US11687763B2 (en) | Method, apparatus and computer program to carry out a training procedure in a convolutional neural network | |
US8676874B2 (en) | Data structure for tiling and packetizing a sparse matrix | |
US8762655B2 (en) | Optimizing output vector data generation using a formatted matrix data structure | |
US7937567B1 (en) | Methods for scalably exploiting parallelism in a parallel processing system | |
CN110415157B (en) | Matrix multiplication calculation method and device | |
US9146902B2 (en) | Parallel computation of matrix problems | |
Peng et al. | GLU3. 0: Fast GPU-based parallel sparse LU factorization for circuit simulation | |
JP2007317179A (en) | Matrix multiplication with reduced bandwidth requirements | |
CN103885893A (en) | Technique For Accessing Content-Addressable Memory | |
US10120717B2 (en) | Method for optimizing the size of a data subset of a processing space for improved execution performance | |
CN114579929B (en) | Accelerator execution method and electronic equipment | |
US9513923B2 (en) | System and method for context migration across CPU threads | |
Zubair et al. | An optimized multicolor point-implicit solver for unstructured grid applications on graphics processing units | |
US20100318758A1 (en) | Efficient transfer of matrices for matrix based operations | |
US20090326880A1 (en) | Parallel physics solver | |
US9600446B2 (en) | Parallel multicolor incomplete LU factorization preconditioning processor and method of use thereof | |
Zou et al. | Supernodal sparse Cholesky factorization on graphics processing units | |
US11409840B2 (en) | Dynamically adaptable arrays for vector and matrix operations | |
Sakai et al. | Towards automating multi-dimensional data decomposition for executing a single-GPU code on a multi-GPU system | |
DE112020007283T5 (en) | Docking board for a multi-format graphics processing unit | |
US20240086719A1 (en) | Sparse encoding and decoding at mixture-of-experts layer | |
CN117851742B (en) | Data storage method, data processing method, data memory and data processor | |
US20230289398A1 (en) | Efficient Matrix Multiply and Add with a Group of Warps | |
Islam et al. | A Hierarchical Jacobi Iteration for Structured Matrices on GPUs using Shared Memory |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: NVIDIA CORPORATION, CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:NAUMOV, MAXIM;CHETLUR, SHARANYAN;CHIEN, LUNG SHENG;AND OTHERS;SIGNING DATES FROM 20121217 TO 20121226;REEL/FRAME:029595/0669 |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 4 |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 8 |