WO2009037684A2 - Sparse matrix by vector multiplication - Google Patents

Sparse matrix by vector multiplication Download PDF

Info

Publication number
WO2009037684A2
WO2009037684A2 PCT/IE2008/000089 IE2008000089W WO2009037684A2 WO 2009037684 A2 WO2009037684 A2 WO 2009037684A2 IE 2008000089 W IE2008000089 W IE 2008000089W WO 2009037684 A2 WO2009037684 A2 WO 2009037684A2
Authority
WO
WIPO (PCT)
Prior art keywords
matrix
vector
buffer
data
words
Prior art date
Application number
PCT/IE2008/000089
Other languages
French (fr)
Other versions
WO2009037684A3 (en
Inventor
Thomas Dermot Geraghty
David Gregg
Bartley Mcelroy
Fergal Connor
Ciarán McELROY
Original Assignee
Provost Fellows And Scholars Of The College Of The Holy And Undivided Trinity Of Queen Elizabeth Near Dublin
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Provost Fellows And Scholars Of The College Of The Holy And Undivided Trinity Of Queen Elizabeth Near Dublin filed Critical Provost Fellows And Scholars Of The College Of The Holy And Undivided Trinity Of Queen Elizabeth Near Dublin
Publication of WO2009037684A2 publication Critical patent/WO2009037684A2/en
Publication of WO2009037684A3 publication Critical patent/WO2009037684A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization

Definitions

  • the invention relates to data processing and particularly sparse matrix by vector multiplication.
  • Matrix by vector multiplication data processing is required for finite element analysis and scientific and engineering computing.
  • a method for matrix-by-vector multiplication termed "SPAR” is described in [4]. This method uses column-based multiplication operations.
  • the invention is directed towards providing improved efficiency in operation of data processing hardware.
  • the blocks are arranged in rows. In one embodiment, all blocks of a row are processed before processing blocks of a next row.
  • the y-buffer is initialized before the start of a row.
  • the matrix words are streamed in from a memory without data processor storage.
  • the y-buffer is initialized and a write-out from the y-buffer is performed in a single cycle using two buffer ports.
  • relevant entries of the x-vector are written from external memory to the x-buffer in burst fashion.
  • the multiplication is controlled by a state machine.
  • state machine commands are embedded in the matrix words.
  • the matrix block structure is represented by the matrix words.
  • at least some matrix words have a command, an index, and a payload.
  • matrix words include data, row, and column block coordinate information.
  • the method comprises the further step of pre-processing the matrix according to an encoding scheme to generate the matrix words. -A -
  • the matrix words are encoded to avoid Read after Write (RAW) data hazards by any available data interleaving method such as modulo arithmetic.
  • RAW Read after Write
  • the matrix elements within a tile are not stored in a row or column-oriented format, the x- and y-coordinate within the tile of each matrix element being specified explicitly by the matrix words, so that the elements of a tile can be reordered arbitrarily when encoding the matrix.
  • floating point units are never forced to stall.
  • the method comprises the steps of exploiting symmetry in matrices to reduce processing.
  • the symmetry is about the main diagonal at the block level.
  • the method comprises implementing a triangular solve where the diagonal elements in the matrix are replaced by their reciprocals in a pre-processing step.
  • the invention provides a data processing system adapted to perform any method as defined above.
  • the system is adapted to perform the sparse matrix by vector multiplication when performing finite element analysis.
  • the invention provides a computer program product comprising software code for performing steps of any method defined above when executing on a digital processor.
  • FIG. 1 is a diagram illustrating a mechanism of the invention for multiplying a matrix by a vector
  • Fig. 2 is a diagram illustrating the hardware architecture
  • Fig. 3 shows a sample (10 x 10) matrix.
  • SCAR Software Controlled Arbitrary Reordering
  • A An NxM matrix, A, is divided into a regular grid composed of rectangular KxL blocks. These blocks are usually square but they do not have to be.
  • the blocks are processed one at a time. Each block is related to a fragment of the x and y vectors.
  • the hardware has local memory buffers to act as a cache for these fragments. 3. All blocks in a horizontal row of the grid are processed before moving vertically in the grid. 4. At the start of a horizontal grid row the y-buffer is zeroed.
  • the invention involves pre-processing the matrix according to a "SCAR" scheme whereby the non-zero data (in any numerical format), blocking information, the row and column offset indices within a block and state machine control words are combined in a single data stream.
  • SCAR SCAR
  • a single vector is used to store all of the matrix information required to compute a sparse matrix by vector multiplication. Therefore, the system can be used effectively with a single memory channel. Also, it can be used in parallel with multiple independent memory channels.
  • Very high FPU utilization can be achieved for low bandwidth matrices such as those from finite element calculations.
  • the local memory buffers are simple, there is no need for a complex cache architecture.
  • FIG. 2 A high level illustration of the SCAR hardware architecture is shown in Fig. 2.
  • the architecture is composed of four main sections, viz. the y-buffer, x-buffer, arithmetic data path and a controlling state machine.
  • the data structure represents compressed sparse matrix data in a single stream where data, row and column coordinates and control words are in a single stream rather than in separate data structures.
  • the commands for the state machine controlling the sparse matrix by vector multiplication are embedded in the data structure.
  • a data stream is explicitly constructed to avoid Read after Write (RAW) data hazards by any available data interleaving method such as modulo arithmetic.
  • RAW Read after Write
  • Other advantageous aspects of the invention are: the floating point units are never forced to stall, the sparse matrix is pre-processed to generate the data structure, and the method exploits symmetry in matrices to reduce processing, and it exploits symmetry about the main diagonal at the block level to reduce processing
  • the y-buffer stores the fragment of the product vector that is currently being computed. It is simply a dual port memory and every entry in the memory is initialized to zero at startup. As the matrix is being processed the current matrix entry is multiplied by the appropriate entry in the x-vector and the result is added to the appropriate entry in the y-vector.
  • the x-buffer is a cache for the x-vector. At the start of a new block of the matrix all the relevant entries in the x-vector are loaded from the external vector memory into the x-buffer. This allows burst operations, which tend to be the most efficient, from the external memory.
  • the arithmetic data path consists of a floating-point multiplier and adder.
  • the matrix elements within a tile are not stored in a row or column-oriented format. Instead, the x- and y-coordinate within the tile of each matrix element is specified explicitly. This means that the elements of a tile can be re-ordered arbitrarily when constructing the matrix, giving rise to the name software controlled arbitrary re- ordering (SCAR). This ability to re-order the matrix has two advantages. First, compared to row or column-oriented formats, there is no need to have additional control words within the tile to identify the end of a row or column.
  • the state machine coordinates the operation of the hardware. It is controlled by embedded commands in the matrix data structure.
  • the data structure and the embedded commands are explained below.
  • a single entry in the matrix stream contains more than just the data point in the matrix. It also contains the location for the entry and embedded commands.
  • the embedded commands tell the controlling state machine what to do and also mark the data as valid or a NOP.
  • the matrix word is split into 3 sub-words, viz. a command, an index and a 64-bit payload.
  • the payload is assumed to be 64-bits to allow for IEEE double-precision numbers but it does not have to be.
  • Table 1 The 2-bit command sub-word.
  • the first word in the matrix stream must be a block update (BU).
  • BU block update
  • the role of the block update command is to tell the SCAR unit that a new matrix block is about to begin. It gives the information that is required to update the vector buffers. Instead of a 64-bit double the payload now contains two 32-bit integers. They give the coordinates in the matrix of the top left had corner of the block. This index is also split into two. This is information for the x buffer and specifies the range of x vector elements that will be used in processing the block. The limits are addressed relative to the block itself. To get an absolute index they need to be added to the column coordinate from the payload. For example if we are about to process a block that has an upper left hand comer on row 2048 and column 1024 with the first non-zero entry in column 1056 and last nonzero entry in column 2040 then the command word is:
  • the block update can occur at any point in the stream.
  • the controlling state machine stops reading in the matrix data and updates the buffers. If the row coordinate has changed then the new block is from a different horizontal row in the matrix, so the y-buffer is flushed into the external vector memory. Next the required range of x-vector entries is loaded into the x-buffer and when this process is completed the matrix processing resumes.
  • the index word specifies the coordinates of the entry in the block (i.e. is the address in the x and y buffers) and the payload is the entry itself.
  • both the index word and payload can be anything.
  • the data will pass through the data path but nothing will be written into the y-buffer.
  • choosing a value for the index word and payload that minimizes the Hamming distance between the NOP word and the previous and following words may reduce the dynamic power consumed by the SCAR unit when fetching and executing the NOP.
  • the matrix data within each block comes with sufficient information to determine its exact location within the block the processing is not confined to following rows or columns.
  • the data can be organized to maximize FPU utilization.
  • the only ordering constraint is to prevent RAW hazards.
  • the SCAR unit loads the first entry in the data structure. This will tell the system which range of x- vector elements to load into the x- buffer. When they are in the x-buffer the matrix-by-vector multiplication can begin. Each matrix entry is read and if it is a valid entry it is multiplied by the appropriate x entry and added to the y entry. If it is a NOP it passes through the data-path but the result will not be written in the y-buffer.
  • the y-buffer contains the final result for that section of the result vector and is written into the external vector memory.
  • the y-vector is reinitialized during this process. Once this is complete the process repeats for the next matrix block.
  • a sample SCAR data stream for the matrix shown in Fig. 3 is shown in Table 2.
  • the adder feedback path is assumed to be 3 clock cycles in this example. This means that there must be at least 3 data words between entries from the same row. These "padding" data words can be entries from other rows or NOPs if there are no other entries available.
  • Stream 0 contains rows ⁇ 0,3,6,9 ⁇
  • stream 1 contains rows ⁇ 1,4,7 ⁇
  • stream 2 contains rows ⁇ 2,5,8 ⁇ .
  • NOPs are then appended to the end of these data streams to ensure that they are all the same length.
  • the streams can then be interleaved to construct the full matrix stream. This is the method that was used to construct the matrix stream in Table 2.
  • Other techniques for constructing the matrix data stream are equally valid.
  • the number of NOPs inserted in the stream can be reduced by using a more sophisticated scheduling strategy (e.g. list scheduling) rather than restricting the method to modulo(k) row interleaving where k is the adder pipeline depth.
  • Table 2 A typical SCAR data stream for a 10x10 block from Fig. 3 with a latency of
  • Performance measurements A proof of concept system has been implemented on an FPGA and tested using a range of square finite element matrices. Matrices from different application spaces may have different performance results.
  • the system used in the test has a 64-bit vector bus and a 96-bit matrix bus.
  • the buffer size is 2048 words for both the x and y buffers.
  • the latency of the adder/y buffer feedback path is 14 clock cycles.
  • the results are shown in Table 3.
  • the median FPU utilization is 85.81% that equates to 172MFLOPS @ 100MHz.
  • the median vector bus utilization is 8.20% so if there are multiple independent matrix memories then several SCAR units could share a single vector bus.
  • Symmetry support can take two forms, viz. full symmetry and block symmetry.
  • the basic SCAR has two caches (x and y). The region of values in the y cache stays constant across an entire row of blocks. The region of values in the x cache changes every time we move to the next block across a row.
  • the symmetric SCAR has four caches, viz. horizontal x, horizontal y, vertical x and vertical y.
  • the range of values in the horizontal caches stays constant, whereas the range of values in the vertical caches changes as we move across a row of blocks.
  • a set of x values is read in at the start of every block.
  • a set of x values must also be read into the vertical x cache at the start of a block.
  • the vertical y cache must also write out a set of y values and then read in a new set of y values. This will produce approximately three times as much memory traffic on the vector memory bus as in the basic SCAR.
  • the horizontal y cache must also be updated at the end of every row of blocks but this traffic should be insignificant compared with the traffic produced by the vertical caches.
  • the operation of the symmetric SCAR is simple for most blocks.
  • a non-zero from the matrix is read in.
  • Two x values (one from each x cache) and two y values (one from each y cache) are fed into two parallel arithmetic pipelines along with the matrix entry. The results are then written back to the y caches.
  • L is an NxN lower triangular matrix
  • x is a known vector
  • y is an unknown vector. The elements of y are determined as follows:
  • the divisions can be eliminated by the replacing the diagonal entries of the matrix with their reciprocals on the host when the matrix is being constructed. This is a very similar operation to matrix-by-vector multiplication. The main difference is that the addition is now a subtraction.
  • the invention is not limited to the embodiments described but may be varied in construction and detail. For example when reading the x vector data for a tile there is no particular reason why x vector data needs to be read in a single burst. In fact it is a viable option to read in separate parts of the x-vector and process them together.
  • the method can be implemented either in hardware or in software

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Complex Calculations (AREA)

Abstract

The invention involves pre-processing the matrix according to an encoding scheme whereby the non-zero data (in any numerical format), blocking information, the row an column offset indices within a block are represented by state machine control words which are combined in a single data stream. Thus, a single vector may be used to store all of the matrix information required to compute a sparse matrix by vector multiplication. Therefore, the system can be used effectively with a single memory channel. Also, it can be used in parallel with multiple independent memory channels. This method of matrix-by-vector multiplication achieves allows very high FPU utilization to be achieved for low bandwidth matrices such as those from finite element calculations. Also, it allows local memory buffers to be simple, and so there is no need for a complex cache architecture.

Description

"Sparse matrix by vector multiplication"
INTRODUCTION
Field of the Invention
The invention relates to data processing and particularly sparse matrix by vector multiplication.
Prior Art Discussion
Matrix by vector multiplication data processing is required for finite element analysis and scientific and engineering computing.
The matrices that occur in many applications such as finite element analysis are often very large and sparse. Since many of the entries in the matrix are zero a variety of sparse matrix compression methods are used when storing them. An overview of sparse matrix storage methods is given in [I]. They usually involve storing just the non-zero entries in the matrix along with the coordinates of the entries within the matrix. This does have some performance implications. A matrix entry can no longer be accessed with a single memory read as the coordinates of the entry within the matrix must be read before the entry itself. Also consecutive entries in the matrix data structure probably do not correspond to consecutive entries in the vector. The net effect of these issues is very poor performance when using general purpose processors. It is common for a processor to achieve only 10% of its peak performance [2].
A method for matrix-by-vector multiplication, termed "SPAR" is described in [4]. This method uses column-based multiplication operations.
One of the main reasons that general purpose processors do not perform well in sparse matrix by vector calculations is the poor data locality of the matrix in memory. This results in an excessive cache miss rate. When the main memory is commodity memory (such as DDR or DDR2) the penalty for a cache miss is very severe and the fraction of peak floating point performance which can be achieved is dramatically reduced.
The invention is directed towards providing improved efficiency in operation of data processing hardware.
References
[1] I. S. Duff, A. M. Erisman and J. K. Reid, Direct Methods for Sparse Matrices, Oxford University Press, London, 1986.
[2] Eun-Jin Im, K.A. Yelick, R. Vuduc. "SPARSITY: An Optimization
Framework for Sparse Matrix Kernels" International Journal of High
Performance Computing Applications, 18 (1), pp. 135-158, February 2004.
[3] V.E. Taylor, A. Ranade, D. G. Messerschmitt. SPAR: a new architecture for large finite element computations, IEEE Transactions on Computers, Vol. 44,
No. 4, April 1995, pp.531 - 545 [4] Taylor; Valerie E., Method and apparatus for optimized processing of sparse matrices, US Patent 5,206,822, April. 27, 1993
[5] L. Zhuo and V. K. Prasanna. Sparse Matrix-Vector Multiplication on FPGAs. In Proceedings of the 13th International Symposium on Field Programmable
Gate Arrays, 2005.
[6] Ling Zhuo, Viktor K. Prasanna, High-Performance and Area-Efficient Reduction Circuits on FPGAs, in Proceedings of the 17th International Symposium on Computer Architecture and High Performance Computing, October 2005, Rio de Janeiro, Brazil.
[7] M. DeLorimier and A. DeHon. Floating-point sparse matrix-vector multiply for FPGAs. In Proceedings of the 2005 ACM/SIGDA 13th international Symposium on Field-Programmable Gate Arrays (Monterey, California, USA, February 20 - 22, 2005). FPGA '05. ACM Press, New York, NY, 75-85.
SUMMARY OF THE INVENTION According to the invention, there is provided a method performed by a data processor for Ax = y sparse matrix-by-vector multiplication, the method comprising the steps of an arithmetic data path multiplying each of a plurality of blocks of the matrix A at a time with a fragment of the x vector to provide a fragment of the y vector, wherein the matrix is encoded with matrix words and the words are processed as a stream for multiplication.
In one embodiment, the blocks are arranged in rows. In one embodiment, all blocks of a row are processed before processing blocks of a next row. Preferably, there is an x-buffer for the x vector and a y-buffer for the y vector. Preferably, the y-buffer is initialized before the start of a row.
In one embodiment, the matrix words are streamed in from a memory without data processor storage.
In one embodiment, the y-buffer is initialized and a write-out from the y-buffer is performed in a single cycle using two buffer ports.
In another embodiment, relevant entries of the x-vector are written from external memory to the x-buffer in burst fashion.
In one embodiment, the multiplication is controlled by a state machine. Preferably, state machine commands are embedded in the matrix words. In one embodiment, the matrix block structure is represented by the matrix words. Preferably, at least some matrix words have a command, an index, and a payload.
In one embodiment, matrix words include data, row, and column block coordinate information.
In a further embodiment, the method comprises the further step of pre-processing the matrix according to an encoding scheme to generate the matrix words. -A -
In one embodiment, the matrix words are encoded to avoid Read after Write (RAW) data hazards by any available data interleaving method such as modulo arithmetic.
In one embodiment, the matrix elements within a tile are not stored in a row or column-oriented format, the x- and y-coordinate within the tile of each matrix element being specified explicitly by the matrix words, so that the elements of a tile can be reordered arbitrarily when encoding the matrix.
In one embodiment, floating point units are never forced to stall.
In one embodiment, the method comprises the steps of exploiting symmetry in matrices to reduce processing. Preferably, the symmetry is about the main diagonal at the block level.
In one embodiment, the method comprises implementing a triangular solve where the diagonal elements in the matrix are replaced by their reciprocals in a pre-processing step.
In another aspect, the invention provides a data processing system adapted to perform any method as defined above. In one embodiment, the system is adapted to perform the sparse matrix by vector multiplication when performing finite element analysis.
In a further aspect, the invention provides a computer program product comprising software code for performing steps of any method defined above when executing on a digital processor.
DETAILED DESCRIPTION OF THE INVENTION
Brief Description of the Invention The invention will be more clearly understood from the following description of some embodiments thereof, given by way of example only with reference to the accompanying drawings in which;- Fig. 1 is a diagram illustrating a mechanism of the invention for multiplying a matrix by a vector;
Fig. 2 is a diagram illustrating the hardware architecture; and
Fig. 3 shows a sample (10 x 10) matrix.
Description of the Embodiments
The invention provides a data processing system and method (called Software Controlled Arbitrary Reordering ("SCAR")) for multiplying a matrix by a vector (Ax=y). Referring to Fig. 1 the method may be described at a high level as follows: 1. An NxM matrix, A, is divided into a regular grid composed of rectangular KxL blocks. These blocks are usually square but they do not have to be. 2. The blocks are processed one at a time. Each block is related to a fragment of the x and y vectors. The hardware has local memory buffers to act as a cache for these fragments. 3. All blocks in a horizontal row of the grid are processed before moving vertically in the grid. 4. At the start of a horizontal grid row the y-buffer is zeroed.
5. At the beginning of a block all the x values that will be used in the block are loaded into the local x-buffer in a single burst from external memory.
6. The matrix values are then processed. Each matrix entry streams in from external memory. There is no need for a local store for the matrix data. 7. When all the matrix entries in the block are processed, repeat the procedure for the next block starting at step 5.
8. At the end of a horizontal row of blocks in the grid the fragment of the y vector that has been computed is written into the external vector memory, the y-buffer is zeroed and processing of the next horizontal grid row can start. Since the y- buffer is dual port the zeroing does not incur a time penalty. Blocks within the same horizontal grid row can be processed in any order. Likewise horizontal rows can be processed in any order but a row must be completed before a block from another row is processed.
In general terms, the invention involves pre-processing the matrix according to a "SCAR" scheme whereby the non-zero data (in any numerical format), blocking information, the row and column offset indices within a block and state machine control words are combined in a single data stream. Thus, a single vector is used to store all of the matrix information required to compute a sparse matrix by vector multiplication. Therefore, the system can be used effectively with a single memory channel. Also, it can be used in parallel with multiple independent memory channels.
This method of matrix-by- vector multiplication has the following advantages:
- Very high FPU utilization can be achieved for low bandwidth matrices such as those from finite element calculations.
- The local memory buffers are simple, there is no need for a complex cache architecture.
A high level illustration of the SCAR hardware architecture is shown in Fig. 2. The architecture is composed of four main sections, viz. the y-buffer, x-buffer, arithmetic data path and a controlling state machine.
The data structure represents compressed sparse matrix data in a single stream where data, row and column coordinates and control words are in a single stream rather than in separate data structures.
The commands for the state machine controlling the sparse matrix by vector multiplication are embedded in the data structure. Also, a data stream is explicitly constructed to avoid Read after Write (RAW) data hazards by any available data interleaving method such as modulo arithmetic. Other advantageous aspects of the invention are: the floating point units are never forced to stall, the sparse matrix is pre-processed to generate the data structure, and the method exploits symmetry in matrices to reduce processing, and it exploits symmetry about the main diagonal at the block level to reduce processing
The y-buffer
As its name suggests the y-buffer stores the fragment of the product vector that is currently being computed. It is simply a dual port memory and every entry in the memory is initialized to zero at startup. As the matrix is being processed the current matrix entry is multiplied by the appropriate entry in the x-vector and the result is added to the appropriate entry in the y-vector.
When every y-vector entry for a row of blocks is computed the entire contents of the y-buffer are written into the external vector memory and the contents of the buffer are re-initialized to zero to be ready to compute the next section of the y-vector. Since the y-buffer has two ports the re-initialization of the buffer incurs no time penalty.
The x-buffer
The x-buffer is a cache for the x-vector. At the start of a new block of the matrix all the relevant entries in the x-vector are loaded from the external vector memory into the x-buffer. This allows burst operations, which tend to be the most efficient, from the external memory.
The arithmetic data path
The arithmetic data path consists of a floating-point multiplier and adder. The matrix entry is multiplied by the appropriate x-vector entry and added to the appropriate y- vector entry (i.e.
Figure imgf000009_0001
where y® = 0 ).
It is highly unlikely that the multiplier and adder will have a delay of just one clock cycle so the "Y index" and "Valid operation" delay lines are used to synchronize the reading and writing of the y-buffer. The matrix elements within a tile are not stored in a row or column-oriented format. Instead, the x- and y-coordinate within the tile of each matrix element is specified explicitly. This means that the elements of a tile can be re-ordered arbitrarily when constructing the matrix, giving rise to the name software controlled arbitrary re- ordering (SCAR). This ability to re-order the matrix has two advantages. First, compared to row or column-oriented formats, there is no need to have additional control words within the tile to identify the end of a row or column. This is especially important if tiles are small. Secondly, where there is a delay in the adder pipeline, the matrix entries within a tile can be re-ordered to make sure that true data dependencies (RAW hazards) in the computation of intermediate values of the y vector are respected. In many processors data dependences are detected and handled by additional hardware. In the SCAR architecture, no such additional hardware is required.
Since the latency of the feedback path between the y-buffer and the adder is more than one clock cycle there is a possibility of a Read- After- Write (RAW) hazard occurring with the y vector entry. This occurs when a particular y value is requested from the y- buffer but its true value has not yet exited the adder pipeline. The value read from the y-buffer is out of date and so the result will be incorrect. There is no hardware in the arithmetic data path to detect this scenario so it is up to the software that constructs the matrix data structure to ensure that this does not occur. It is a relatively simple task to do this since a data value from another row in the matrix can be started while the first row is in the pipeline or a NOP can be inserted in the matrix data stream.
While a block of the matrix is being processed the data-path never stalls. A new matrix entry or a NOP enters the data-path at every clock cycle. This is possible because all the relevant vector entries are already in the x and y buffers.
Controlling State Machine
The state machine coordinates the operation of the hardware. It is controlled by embedded commands in the matrix data structure. The data structure and the embedded commands are explained below. The data structure
A single entry in the matrix stream contains more than just the data point in the matrix. It also contains the location for the entry and embedded commands. The embedded commands tell the controlling state machine what to do and also mark the data as valid or a NOP.
The matrix word is split into 3 sub-words, viz. a command, an index and a 64-bit payload. The payload is assumed to be 64-bits to allow for IEEE double-precision numbers but it does not have to be.
Descriptions of the command sub-word are shown in Table 1.
Figure imgf000011_0001
Table 1 : The 2-bit command sub-word.
The first word in the matrix stream must be a block update (BU). The word is structured as follows:
Figure imgf000011_0002
The role of the block update command is to tell the SCAR unit that a new matrix block is about to begin. It gives the information that is required to update the vector buffers. Instead of a 64-bit double the payload now contains two 32-bit integers. They give the coordinates in the matrix of the top left had corner of the block. This index is also split into two. This is information for the x buffer and specifies the range of x vector elements that will be used in processing the block. The limits are addressed relative to the block itself. To get an absolute index they need to be added to the column coordinate from the payload. For example if we are about to process a block that has an upper left hand comer on row 2048 and column 1024 with the first non-zero entry in column 1056 and last nonzero entry in column 2040 then the command word is:
Figure imgf000012_0001
Apart from the restriction that the first entry in the matrix data stream must be a block update command the block update can occur at any point in the stream. When it is encountered the controlling state machine stops reading in the matrix data and updates the buffers. If the row coordinate has changed then the new block is from a different horizontal row in the matrix, so the y-buffer is flushed into the external vector memory. Next the required range of x-vector entries is loaded into the x-buffer and when this process is completed the matrix processing resumes.
If the matrix word is a valid matrix entry (VAL) then the index word specifies the coordinates of the entry in the block (i.e. is the address in the x and y buffers) and the payload is the entry itself.
If the matrix word is a NOP then both the index word and payload can be anything. The data will pass through the data path but nothing will be written into the y-buffer. However, choosing a value for the index word and payload that minimizes the Hamming distance between the NOP word and the previous and following words may reduce the dynamic power consumed by the SCAR unit when fetching and executing the NOP.
Since the matrix data within each block comes with sufficient information to determine its exact location within the block the processing is not confined to following rows or columns. The data can be organized to maximize FPU utilization. The only ordering constraint is to prevent RAW hazards.
State Machine Operation Once the SCAR unit has been given the start signal it loads the first entry in the data structure. This will tell the system which range of x- vector elements to load into the x- buffer. When they are in the x-buffer the matrix-by-vector multiplication can begin. Each matrix entry is read and if it is a valid entry it is multiplied by the appropriate x entry and added to the y entry. If it is a NOP it passes through the data-path but the result will not be written in the y-buffer.
When a block is finished the next x-vector range is loaded into the x-buffer and the process repeats.
When all the matrix blocks making up a horizontal section of the matrix have been processed the y-buffer contains the final result for that section of the result vector and is written into the external vector memory. The y-vector is reinitialized during this process. Once this is complete the process repeats for the next matrix block.
Example
A sample SCAR data stream for the matrix shown in Fig. 3 is shown in Table 2. The adder feedback path is assumed to be 3 clock cycles in this example. This means that there must be at least 3 data words between entries from the same row. These "padding" data words can be entries from other rows or NOPs if there are no other entries available.
One simple way of ensuring this ordering is to allocate rows to each pipeline stage using modulo arithmetic. So for a 3 stage pipeline and a 10x10 block we can construct
3 data streams. Stream 0 contains rows {0,3,6,9}, stream 1 contains rows {1,4,7} and stream 2 contains rows {2,5,8}. NOPs are then appended to the end of these data streams to ensure that they are all the same length. The streams can then be interleaved to construct the full matrix stream. This is the method that was used to construct the matrix stream in Table 2. Other techniques for constructing the matrix data stream are equally valid. In some cases the number of NOPs inserted in the stream can be reduced by using a more sophisticated scheduling strategy (e.g. list scheduling) rather than restricting the method to modulo(k) row interleaving where k is the adder pipeline depth.
Figure imgf000014_0001
Table 2: A typical SCAR data stream for a 10x10 block from Fig. 3 with a latency of
3 cycles in the adder/y buffer path.
Performance measurements A proof of concept system has been implemented on an FPGA and tested using a range of square finite element matrices. Matrices from different application spaces may have different performance results. The system used in the test has a 64-bit vector bus and a 96-bit matrix bus. The buffer size is 2048 words for both the x and y buffers. The latency of the adder/y buffer feedback path is 14 clock cycles. The results are shown in Table 3. The median FPU utilization is 85.81% that equates to 172MFLOPS @ 100MHz. The median vector bus utilization is 8.20% so if there are multiple independent matrix memories then several SCAR units could share a single vector bus.
Figure imgf000015_0001
Figure imgf000016_0001
Figure imgf000017_0001
Table 3: SCAR performance.
This following outlines some possible extensions/improvements to the invention.
Early restart
In the above system all of the relevant entries in the x vector are loaded into the x- buffer at the start of a block. While this process is underway the arithmetic units are idle. It would be more efficient to have the arithmetic units only wait until the x vector element that is needed by the first matrix entry is loaded into the buffer. This is relatively simple to implement. It does require the inclusion of stall hardware in the arithmetic units to deal with the case where x entries are missing and miss detection hardware in the x-buffer.
In terms of overall performance early restart will not have a significant impact on matrices that consist of a relatively small number of non-zero blocks which are relatively dense. This type of matrix is common in finite element calculations. However, matrices with a large number of relatively sparse blocks will benefit significantly from early restart.
Single memory interface
The system that was implemented used two independent interfaces to external memory, one for the matrix and one for the vector. However, in the SCAR unit without early restart the two memory interfaces are never active at the same time. This means that a single memory interface could be used. The only performance penalty with this approach is an additional external memory startup delay at the beginning of each block as the matrix interface is activated. If this delay is short relative to the number of non-zero matrix entries in the block then its impact on overall performance will be minimal.
Multiple arithmetic units
The description above assumes that there is only one multiplier and adder in the system. This does not have to be the case and it is a relatively simple matter to build a system with two multipliers and adders. The approach to constructing the matrix data structure is exactly the same as before.
The main complication with this approach is that the number of ports in the y buffer increases. Two adders need a 4-port y buffer. This can be implemented by running the 2-port buffer at twice the clock rate as the rest of the system or the y buffer could be divided into multiple parallel banks. It is an easy matter for the software to take account of this. In fact many of the algorithms to construct the matrix data stream will naturally implement this split.
Symmetry support
Many algorithms produce symmetric matrices. It is possible to take advantage of this in the matrix-by- vector multiplication. Symmetry support can take two forms, viz. full symmetry and block symmetry.
Full Symmetry
Support for symmetric matrices has several significant advantages. It approximately halves the memory bandwidth required to load the matrix and also halves the amount of memory required to store the matrix. Also since multiple arithmetic units can operate in parallel it has the potential to nearly double the performance of the hardware. The basic SCAR has two caches (x and y). The region of values in the y cache stays constant across an entire row of blocks. The region of values in the x cache changes every time we move to the next block across a row.
The symmetric SCAR has four caches, viz. horizontal x, horizontal y, vertical x and vertical y. The range of values in the horizontal caches stays constant, whereas the range of values in the vertical caches changes as we move across a row of blocks.
In the basic SCAR a set of x values is read in at the start of every block. In the symmetric SCAR a set of x values must also be read into the vertical x cache at the start of a block. The vertical y cache must also write out a set of y values and then read in a new set of y values. This will produce approximately three times as much memory traffic on the vector memory bus as in the basic SCAR. The horizontal y cache must also be updated at the end of every row of blocks but this traffic should be insignificant compared with the traffic produced by the vertical caches.
The operation of the symmetric SCAR is simple for most blocks. A non-zero from the matrix is read in. Two x values (one from each x cache) and two y values (one from each y cache) are fed into two parallel arithmetic pipelines along with the matrix entry. The results are then written back to the y caches.
As with the basic SCAR, it's up to the software to eliminate RAW hazards. However, given that two y values are being updated, the software must schedule the operations to avoid RAW hazards on both values. This is a more difficult problem than in the basic SCAR and will probably require more padding with NOPs.
Block Symmetry
Full symmetric matrix support requires a more complicated caching arrangement than in the basic SCAR architecture. However, many real-world symmetric matrices have a narrow bandwidth where most of the non-zero entries are clustered around the main diagonal. This means that the blocks in the SCAR data stream around the diagonal are themselves symmetric. A large proportion of the total non-zeros will be contained in these blocks.
The symmetry of these blocks can be exploited while the rest of the matrix is processed as if it was not symmetric.
Backward/forward substitution
Backward/forward substitution or a triangular solve is another common but time consuming operation.
Consider:
Ly = x
Where L is an NxN lower triangular matrix, x is a known vector and y is an unknown vector. The elements of y are determined as follows:
Figure imgf000020_0001
Σ (=0
The divisions can be eliminated by the replacing the diagonal entries of the matrix with their reciprocals on the host when the matrix is being constructed. This is a very similar operation to matrix-by-vector multiplication. The main difference is that the addition is now a subtraction.
A similar argument can be applied to an upper triangular system.
The invention is not limited to the embodiments described but may be varied in construction and detail. For example when reading the x vector data for a tile there is no particular reason why x vector data needs to be read in a single burst. In fact it is a viable option to read in separate parts of the x-vector and process them together. The method can be implemented either in hardware or in software

Claims

Claims
1. A method performed by a data processor for Ax = y sparse matrix-by- vector multiplication, the method comprising the steps of an arithmetic data path multiplying each of a plurality of blocks of the matrix A at a time with a fragment of the x vector to provide a fragment of the y vector, wherein the matrix is encoded with matrix words and the words are processed as a stream for multiplication.
2. A method as claimed in claim 1, wherein the blocks are arranged in rows.
3. A method as claimed in claim 2, wherein all blocks of a row are processed before processing blocks of a next row.
4. A method as claimed in any preceding claim, wherein there is an x-buffer for the x vector and a y-buffer for the y vector.
5. A method as claimed in claim 4, wherein the y-buffer is initialized before the start of a row.
6. A method as claimed in any preceding claim, wherein the matrix words are streamed in from a memory without data processor storage.
7. A method as claimed in any of claims 4 to 6, wherein the y-buffer is initialized and a write-out from the y-buffer is performed in a single cycle using two buffer ports.
8. A method as claimed in any of claims 4 to 7, wherein relevant entries of the x- vector are written from external memory to the x-buffer in burst fashion.
9. A method as claimed in any preceding claim, wherein the multiplication is controlled by a state machine.
10. A method as claimed in claim 9, wherein state machine commands are embedded in the matrix words.
11. A method as claimed in any preceding claim, wherein the matrix block structure is represented by the matrix words.
12. A method as claimed in claim 11, wherein at least some matrix words have a command, an index, and a payload.
13. A method as claimed in either of claims 11 or 12, wherein matrix words include data, row, and column block coordinate information.
14. A method as claimed in any preceding claim, comprising the further step of pre-processing the matrix according to an encoding scheme to generate the matrix words.
15. A method as claimed in claim 14, wherein the matrix words are encoded to avoid Read after Write (RAW) data hazards by any available data interleaving method such as modulo arithmetic.
16. A method as claimed in either of claims 14 or 15, wherein the matrix elements within a tile are not stored in a row or column-oriented format, the x- and y- coordinate within the tile of each matrix element being specified explicitly by the matrix words, so that the elements of a tile can be re-ordered arbitrarily when encoding the matrix.
17. A method as claimed in any of claims 14 to 16, wherein floating point units are never forced to stall.
18. A method as claimed in any preceding claim, comprising the steps of exploiting symmetry in matrices to reduce processing.
19. A method as claimed in claim 18, wherein the symmetry is about the main diagonal at the block level.
20. A method as claimed in claim 19, comprising implementing a triangular solve where the diagonal elements in the matrix are replaced by their reciprocals in a pre-processing step.
21. A data processing system comprising means for performing a method as claimed in any preceding claim.
22. A data processing system as claimed in claim 21 , adapted to perform the sparse matrix by vector multiplication when performing finite element analysis.
23. A computer program product comprising software code for performing steps of a method of any of claims 1 to 20 when executing on a digital processor.
PCT/IE2008/000089 2007-09-19 2008-09-19 Sparse matrix by vector multiplication WO2009037684A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US96017607P 2007-09-19 2007-09-19
US60/960,176 2007-09-19

Publications (2)

Publication Number Publication Date
WO2009037684A2 true WO2009037684A2 (en) 2009-03-26
WO2009037684A3 WO2009037684A3 (en) 2010-05-06

Family

ID=40468549

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IE2008/000089 WO2009037684A2 (en) 2007-09-19 2008-09-19 Sparse matrix by vector multiplication

Country Status (1)

Country Link
WO (1) WO2009037684A2 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012076379A3 (en) * 2010-12-06 2013-01-10 International Business Machines Corporation Data structure for tiling and packetizing a sparse matrix
WO2012076377A3 (en) * 2010-12-06 2013-01-17 International Business Machines Corporation Optimizing output vector data generation using a formatted matrix data structure
US9367519B2 (en) 2013-08-30 2016-06-14 Microsoft Technology Licensing, Llc Sparse matrix data structure
US10055383B1 (en) 2017-04-28 2018-08-21 Hewlett Packard Enterprise Development Lp Matrix circuits
CN108733348A (en) * 2017-04-21 2018-11-02 上海寒武纪信息科技有限公司 The method for merging vector multiplier and carrying out operation using it
CN112257372A (en) * 2020-12-21 2021-01-22 北京智芯仿真科技有限公司 Method and system for extracting impedance network model of integrated circuit
CN112991142A (en) * 2021-03-31 2021-06-18 腾讯科技(深圳)有限公司 Matrix operation method, device, equipment and storage medium of image data

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006120664A2 (en) * 2005-05-13 2006-11-16 Provost Fellows And Scholars Of The College Of The Holy And Undivided Trinity Of Queen Elizabeth Near Dublin A data processing system and method

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006120664A2 (en) * 2005-05-13 2006-11-16 Provost Fellows And Scholars Of The College Of The Holy And Undivided Trinity Of Queen Elizabeth Near Dublin A data processing system and method

Non-Patent Citations (9)

* Cited by examiner, † Cited by third party
Title
DELORIMIER M ET AL: "Floating-point sparse matrix-vector multiply for FPGAs" PROCEEDINGS OF THE 2005 ACM/SIGDA 13TH INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE GATE ARRAYS (FPGA'05), 20-22 FEBRUARY 2005, MONTEREY, CALIFORNIA, USA, 2005, XP002571772 cited in the application *
EUN-JIN IM ET AL: "SPARSITY: Optimization framework for sparse matrix kernels" INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, vol. 18, no. 1, 2004, pages 135-158, XP002571764 cited in the application *
GREGG D ET AL: "FPGA based sparse matrix vector multiplication using commodity DRAM memory" PROCEEDINGS OF INTERNATIONAL CONFERENCE ON FIELD PROGRAMMABLE LOGIC AND APPLICATIONS 2007 (FPL 2007), 27-29 AUGUST 2007, AMSTERDAM, NETHERLANDS, 27 August 2007 (2007-08-27), pages 786-791, XP031159191 ISBN: 978-1-4244-1059-0 *
LEE B C ET AL: "Performance models for evaluation and automatic tuning of symmetric sparse matrix-vector multiply" INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING 2004 (ICPP 2004), 15-18 AUGUST 2004, MONTREAL, QC, CANADA, 15 August 2004 (2004-08-15), pages 169-176, XP010718617 ISBN: 978-0-7695-2197-8 *
MCGETTRICK S ET AL: "An FPGA architecture for the Pagerank eigenvector problem" PROCEEDINGS OF INTERNATIONAL CONFERENCE ON FIELD PROGRAMMABLE LOGIC AND APPLICATIONS 2008 (FPL 2008), 8-10 SEPTEMBER 2008, HEIDELBERG, GERMANY, 8 September 2008 (2008-09-08), pages 523-526, XP031324414 ISBN: 978-1-4244-1960-9 *
MOLONEY D ET AL: "Streaming sparse matrix compression/decompression" PROCEEDINGS OF THE FIRST INTERNATIONAL CONFERENCE ON HIGH-PERFORMANCE EMBEDDED ARCHITECTURES AND COMPILERS (HIPEAC 2005), 17-18 NOVEMBER 2005, BARCELONA, SPAIN, LECTURE NOTES IN COMPUTER SCIENCE, vol. 3793, November 2005 (2005-11), pages 116-129, XP019024259 ISBN: 978-3-540-30317-6 *
SMAILBEGOVIC F ET AL: "Sparse matrix storage format" PROCEEDINGS OF THE 16TH ANNUAL WORKSHOP ON CIRCUITS, SYSTEMS AND SIGNAL PROCESSING, NOVEMBER 2005, VELDHOVEN, NETHERLANDS, November 2005 (2005-11), pages 445-448, XP002571766 *
SUN J ET AL: "Mapping sparse matrix-vector multiplication on FPGAs" PROCEEDINGS OF THE THIRD ANNUAL RECONFIGURABLE SYSTEMS SUMMER INSTITUTE (RSSI'07), 17-20 JULY 2007, URBANA, ILLINOIS, USA, [Online] 17 July 2007 (2007-07-17), XP002571763 Retrieved from the Internet: URL:http://rssi.ncsa.illinois.edu/proceedings/papers/rssi07_12_paper.pdf> [retrieved on 2010-03-05] *
ZHUO L ET AL: "Sparse matrix-vector multplication on FPGAs" PROCEEDINGS OF THE 2005 ACM/SIGDA 13TH INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE GATE ARRAYS (FPGA'05), 20-22 FEBRUARY 2005, MONTEREY, CALIFORNIA, USA, 2005, XP002571765 cited in the application *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012076379A3 (en) * 2010-12-06 2013-01-10 International Business Machines Corporation Data structure for tiling and packetizing a sparse matrix
WO2012076377A3 (en) * 2010-12-06 2013-01-17 International Business Machines Corporation Optimizing output vector data generation using a formatted matrix data structure
US8676874B2 (en) 2010-12-06 2014-03-18 International Business Machines Corporation Data structure for tiling and packetizing a sparse matrix
US8762655B2 (en) 2010-12-06 2014-06-24 International Business Machines Corporation Optimizing output vector data generation using a formatted matrix data structure
US8769216B2 (en) 2010-12-06 2014-07-01 International Business Machines Corporation Optimizing output vector data generation using a formatted matrix data structure
US8959135B2 (en) 2010-12-06 2015-02-17 International Business Machines Corporation Data structure for tiling and packetizing a sparse matrix
US9367519B2 (en) 2013-08-30 2016-06-14 Microsoft Technology Licensing, Llc Sparse matrix data structure
CN108733348A (en) * 2017-04-21 2018-11-02 上海寒武纪信息科技有限公司 The method for merging vector multiplier and carrying out operation using it
US10055383B1 (en) 2017-04-28 2018-08-21 Hewlett Packard Enterprise Development Lp Matrix circuits
CN112257372A (en) * 2020-12-21 2021-01-22 北京智芯仿真科技有限公司 Method and system for extracting impedance network model of integrated circuit
CN112991142A (en) * 2021-03-31 2021-06-18 腾讯科技(深圳)有限公司 Matrix operation method, device, equipment and storage medium of image data
CN112991142B (en) * 2021-03-31 2023-06-16 腾讯科技(深圳)有限公司 Matrix operation method, device, equipment and storage medium for image data

Also Published As

Publication number Publication date
WO2009037684A3 (en) 2010-05-06

Similar Documents

Publication Publication Date Title
US8984043B2 (en) Multiplying and adding matrices
CN109992743B (en) Matrix multiplier
US9104633B2 (en) Hardware for performing arithmetic operations
WO2009037684A2 (en) Sparse matrix by vector multiplication
AU2008202591B2 (en) High speed and efficient matrix multiplication hardware module
US8713080B2 (en) Circuit for compressing data and a processor employing same
CN114391135A (en) Method for performing in-memory processing operations on contiguously allocated data, and related memory device and system
US20190171448A1 (en) Stream processor with low power parallel matrix multiply pipeline
US8375196B2 (en) Vector processor with vector register file configured as matrix of data cells each selecting input from generated vector data or data from other cell via predetermined rearrangement path
US20070239970A1 (en) Apparatus For Cooperative Sharing Of Operand Access Port Of A Banked Register File
KR20220051006A (en) Method of performing PIM (PROCESSING-IN-MEMORY) operation, and related memory device and system
US10664552B2 (en) Stream processing for LU decomposition
WO2018182994A2 (en) Apparatuses and methods for in-memory operations
US20190004807A1 (en) Stream processor with overlapping execution
KR20190028426A (en) Shuffler circuit for rain shuffle in SIMD architecture
US11475102B2 (en) Adaptive matrix multiplication accelerator for machine learning and deep learning applications
Gregg et al. FPGA based sparse matrix vector multiplication using commodity dram memory
CN114945984A (en) Extended memory communication
US6729168B2 (en) Circuit for determining the number of logical one values on a data bus
US6665691B2 (en) Circuit for detecting numbers equal to a power of two on a data bus
CN111158757B (en) Parallel access device and method and chip
IE20080761A1 (en) Sparse matrix by vector multiplication
US9582473B1 (en) Instruction set to enable efficient implementation of fixed point fast fourier transform (FFT) algorithms
Tian et al. A memory-based FFT processor using modified signal flow graph with novel conflict-free address schemes
WO2023160930A1 (en) Looping instruction

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08807996

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 08807996

Country of ref document: EP

Kind code of ref document: A2