WO2009037684A2

WO2009037684A2 - Sparse matrix by vector multiplication

Info

Publication number: WO2009037684A2
Application number: PCT/IE2008/000089
Authority: WO
Inventors: Thomas Dermot Geraghty; David Gregg; Bartley Mcelroy; Fergal Connor; Ciarán McELROY
Original assignee: Provost Fellows And Scholars Of The College Of The Holy And Undivided Trinity Of Queen Elizabeth Near Dublin
Priority date: 2007-09-19
Filing date: 2008-09-19
Publication date: 2009-03-26
Also published as: WO2009037684A3

Abstract

The invention involves pre-processing the matrix according to an encoding scheme whereby the non-zero data (in any numerical format), blocking information, the row an column offset indices within a block are represented by state machine control words which are combined in a single data stream. Thus, a single vector may be used to store all of the matrix information required to compute a sparse matrix by vector multiplication. Therefore, the system can be used effectively with a single memory channel. Also, it can be used in parallel with multiple independent memory channels. This method of matrix-by-vector multiplication achieves allows very high FPU utilization to be achieved for low bandwidth matrices such as those from finite element calculations. Also, it allows local memory buffers to be simple, and so there is no need for a complex cache architecture.

Description

"Sparse matrix by vector multiplication"

INTRODUCTION

Field of the Invention

The invention relates to data processing and particularly sparse matrix by vector multiplication.

Prior Art Discussion

Matrix by vector multiplication data processing is required for finite element analysis and scientific and engineering computing.

The matrices that occur in many applications such as finite element analysis are often very large and sparse. Since many of the entries in the matrix are zero a variety of sparse matrix compression methods are used when storing them. An overview of sparse matrix storage methods is given in [I]. They usually involve storing just the non-zero entries in the matrix along with the coordinates of the entries within the matrix. This does have some performance implications. A matrix entry can no longer be accessed with a single memory read as the coordinates of the entry within the matrix must be read before the entry itself. Also consecutive entries in the matrix data structure probably do not correspond to consecutive entries in the vector. The net effect of these issues is very poor performance when using general purpose processors. It is common for a processor to achieve only 10% of its peak performance [2].

A method for matrix-by-vector multiplication, termed "SPAR" is described in [4]. This method uses column-based multiplication operations.

One of the main reasons that general purpose processors do not perform well in sparse matrix by vector calculations is the poor data locality of the matrix in memory. This results in an excessive cache miss rate. When the main memory is commodity memory (such as DDR or DDR2) the penalty for a cache miss is very severe and the fraction of peak floating point performance which can be achieved is dramatically reduced.

The invention is directed towards providing improved efficiency in operation of data processing hardware.

References

[1] I. S. Duff, A. M. Erisman and J. K. Reid, Direct Methods for Sparse Matrices, Oxford University Press, London, 1986.

[2] Eun-Jin Im, K.A. Yelick, R. Vuduc. "SPARSITY: An Optimization

Framework for Sparse Matrix Kernels" International Journal of High

Performance Computing Applications, 18 (1), pp. 135-158, February 2004.

[3] V.E. Taylor, A. Ranade, D. G. Messerschmitt. SPAR: a new architecture for large finite element computations, IEEE Transactions on Computers, Vol. 44,

No. 4, April 1995, pp.531 - 545 [4] Taylor; Valerie E., Method and apparatus for optimized processing of sparse matrices, US Patent 5,206,822, April. 27, 1993

[5] L. Zhuo and V. K. Prasanna. Sparse Matrix-Vector Multiplication on FPGAs. In Proceedings of the 13th International Symposium on Field Programmable

Gate Arrays, 2005.

[6] Ling Zhuo, Viktor K. Prasanna, High-Performance and Area-Efficient Reduction Circuits on FPGAs, in Proceedings of the 17th International Symposium on Computer Architecture and High Performance Computing, October 2005, Rio de Janeiro, Brazil.

[7] M. DeLorimier and A. DeHon. Floating-point sparse matrix-vector multiply for FPGAs. In Proceedings of the 2005 ACM/SIGDA 13th international Symposium on Field-Programmable Gate Arrays (Monterey, California, USA, February 20 - 22, 2005). FPGA '05. ACM Press, New York, NY, 75-85.

SUMMARY OF THE INVENTION According to the invention, there is provided a method performed by a data processor for Ax = y sparse matrix-by-vector multiplication, the method comprising the steps of an arithmetic data path multiplying each of a plurality of blocks of the matrix A at a time with a fragment of the x vector to provide a fragment of the y vector, wherein the matrix is encoded with matrix words and the words are processed as a stream for multiplication.

In one embodiment, the blocks are arranged in rows. In one embodiment, all blocks of a row are processed before processing blocks of a next row. Preferably, there is an x-buffer for the x vector and a y-buffer for the y vector. Preferably, the y-buffer is initialized before the start of a row.

In one embodiment, the matrix words are streamed in from a memory without data processor storage.

In one embodiment, the y-buffer is initialized and a write-out from the y-buffer is performed in a single cycle using two buffer ports.

In another embodiment, relevant entries of the x-vector are written from external memory to the x-buffer in burst fashion.

In one embodiment, the multiplication is controlled by a state machine. Preferably, state machine commands are embedded in the matrix words. In one embodiment, the matrix block structure is represented by the matrix words. Preferably, at least some matrix words have a command, an index, and a payload.

In one embodiment, matrix words include data, row, and column block coordinate information.

In a further embodiment, the method comprises the further step of pre-processing the matrix according to an encoding scheme to generate the matrix words. -A -

In one embodiment, the matrix words are encoded to avoid Read after Write (RAW) data hazards by any available data interleaving method such as modulo arithmetic.

In one embodiment, the matrix elements within a tile are not stored in a row or column-oriented format, the x- and y-coordinate within the tile of each matrix element being specified explicitly by the matrix words, so that the elements of a tile can be reordered arbitrarily when encoding the matrix.

In one embodiment, floating point units are never forced to stall.

In one embodiment, the method comprises the steps of exploiting symmetry in matrices to reduce processing. Preferably, the symmetry is about the main diagonal at the block level.

In one embodiment, the method comprises implementing a triangular solve where the diagonal elements in the matrix are replaced by their reciprocals in a pre-processing step.

In another aspect, the invention provides a data processing system adapted to perform any method as defined above. In one embodiment, the system is adapted to perform the sparse matrix by vector multiplication when performing finite element analysis.

In a further aspect, the invention provides a computer program product comprising software code for performing steps of any method defined above when executing on a digital processor.

DETAILED DESCRIPTION OF THE INVENTION

Brief Description of the Invention The invention will be more clearly understood from the following description of some embodiments thereof, given by way of example only with reference to the accompanying drawings in which;- Fig. 1 is a diagram illustrating a mechanism of the invention for multiplying a matrix by a vector;

Fig. 2 is a diagram illustrating the hardware architecture; and

Fig. 3 shows a sample (10 x 10) matrix.

Description of the Embodiments

The invention provides a data processing system and method (called Software Controlled Arbitrary Reordering ("SCAR")) for multiplying a matrix by a vector (Ax=y). Referring to Fig. 1 the method may be described at a high level as follows: 1. An NxM matrix, A, is divided into a regular grid composed of rectangular KxL blocks. These blocks are usually square but they do not have to be. 2. The blocks are processed one at a time. Each block is related to a fragment of the x and y vectors. The hardware has local memory buffers to act as a cache for these fragments. 3. All blocks in a horizontal row of the grid are processed before moving vertically in the grid. 4. At the start of a horizontal grid row the y-buffer is zeroed.

5. At the beginning of a block all the x values that will be used in the block are loaded into the local x-buffer in a single burst from external memory.

6. The matrix values are then processed. Each matrix entry streams in from external memory. There is no need for a local store for the matrix data. 7. When all the matrix entries in the block are processed, repeat the procedure for the next block starting at step 5.

8. At the end of a horizontal row of blocks in the grid the fragment of the y vector that has been computed is written into the external vector memory, the y-buffer is zeroed and processing of the next horizontal grid row can start. Since the y- buffer is dual port the zeroing does not incur a time penalty. Blocks within the same horizontal grid row can be processed in any order. Likewise horizontal rows can be processed in any order but a row must be completed before a block from another row is processed.

In general terms, the invention involves pre-processing the matrix according to a "SCAR" scheme whereby the non-zero data (in any numerical format), blocking information, the row and column offset indices within a block and state machine control words are combined in a single data stream. Thus, a single vector is used to store all of the matrix information required to compute a sparse matrix by vector multiplication. Therefore, the system can be used effectively with a single memory channel. Also, it can be used in parallel with multiple independent memory channels.

This method of matrix-by- vector multiplication has the following advantages:

- Very high FPU utilization can be achieved for low bandwidth matrices such as those from finite element calculations.

- The local memory buffers are simple, there is no need for a complex cache architecture.

A high level illustration of the SCAR hardware architecture is shown in Fig. 2. The architecture is composed of four main sections, viz. the y-buffer, x-buffer, arithmetic data path and a controlling state machine.

The data structure represents compressed sparse matrix data in a single stream where data, row and column coordinates and control words are in a single stream rather than in separate data structures.

The commands for the state machine controlling the sparse matrix by vector multiplication are embedded in the data structure. Also, a data stream is explicitly constructed to avoid Read after Write (RAW) data hazards by any available data interleaving method such as modulo arithmetic. Other advantageous aspects of the invention are: the floating point units are never forced to stall, the sparse matrix is pre-processed to generate the data structure, and the method exploits symmetry in matrices to reduce processing, and it exploits symmetry about the main diagonal at the block level to reduce processing

The y-buffer

As its name suggests the y-buffer stores the fragment of the product vector that is currently being computed. It is simply a dual port memory and every entry in the memory is initialized to zero at startup. As the matrix is being processed the current matrix entry is multiplied by the appropriate entry in the x-vector and the result is added to the appropriate entry in the y-vector.

When every y-vector entry for a row of blocks is computed the entire contents of the y-buffer are written into the external vector memory and the contents of the buffer are re-initialized to zero to be ready to compute the next section of the y-vector. Since the y-buffer has two ports the re-initialization of the buffer incurs no time penalty.

The x-buffer

The x-buffer is a cache for the x-vector. At the start of a new block of the matrix all the relevant entries in the x-vector are loaded from the external vector memory into the x-buffer. This allows burst operations, which tend to be the most efficient, from the external memory.

The arithmetic data path

The arithmetic data path consists of a floating-point multiplier and adder. The matrix entry is multiplied by the appropriate x-vector entry and added to the appropriate y- vector entry (i.e.

where y® = 0 ).

It is highly unlikely that the multiplier and adder will have a delay of just one clock cycle so the "Y index" and "Valid operation" delay lines are used to synchronize the reading and writing of the y-buffer. The matrix elements within a tile are not stored in a row or column-oriented format. Instead, the x- and y-coordinate within the tile of each matrix element is specified explicitly. This means that the elements of a tile can be re-ordered arbitrarily when constructing the matrix, giving rise to the name software controlled arbitrary re- ordering (SCAR). This ability to re-order the matrix has two advantages. First, compared to row or column-oriented formats, there is no need to have additional control words within the tile to identify the end of a row or column. This is especially important if tiles are small. Secondly, where there is a delay in the adder pipeline, the matrix entries within a tile can be re-ordered to make sure that true data dependencies (RAW hazards) in the computation of intermediate values of the y vector are respected. In many processors data dependences are detected and handled by additional hardware. In the SCAR architecture, no such additional hardware is required.

Since the latency of the feedback path between the y-buffer and the adder is more than one clock cycle there is a possibility of a Read- After- Write (RAW) hazard occurring with the y vector entry. This occurs when a particular y value is requested from the y- buffer but its true value has not yet exited the adder pipeline. The value read from the y-buffer is out of date and so the result will be incorrect. There is no hardware in the arithmetic data path to detect this scenario so it is up to the software that constructs the matrix data structure to ensure that this does not occur. It is a relatively simple task to do this since a data value from another row in the matrix can be started while the first row is in the pipeline or a NOP can be inserted in the matrix data stream.

While a block of the matrix is being processed the data-path never stalls. A new matrix entry or a NOP enters the data-path at every clock cycle. This is possible because all the relevant vector entries are already in the x and y buffers.

Controlling State Machine

The state machine coordinates the operation of the hardware. It is controlled by embedded commands in the matrix data structure. The data structure and the embedded commands are explained below. The data structure

A single entry in the matrix stream contains more than just the data point in the matrix. It also contains the location for the entry and embedded commands. The embedded commands tell the controlling state machine what to do and also mark the data as valid or a NOP.

The matrix word is split into 3 sub-words, viz. a command, an index and a 64-bit payload. The payload is assumed to be 64-bits to allow for IEEE double-precision numbers but it does not have to be.

Descriptions of the command sub-word are shown in Table 1.

Table 1 : The 2-bit command sub-word.

The first word in the matrix stream must be a block update (BU). The word is structured as follows:

The role of the block update command is to tell the SCAR unit that a new matrix block is about to begin. It gives the information that is required to update the vector buffers. Instead of a 64-bit double the payload now contains two 32-bit integers. They give the coordinates in the matrix of the top left had corner of the block. This index is also split into two. This is information for the x buffer and specifies the range of x vector elements that will be used in processing the block. The limits are addressed relative to the block itself. To get an absolute index they need to be added to the column coordinate from the payload. For example if we are about to process a block that has an upper left hand comer on row 2048 and column 1024 with the first non-zero entry in column 1056 and last nonzero entry in column 2040 then the command word is:

Apart from the restriction that the first entry in the matrix data stream must be a block update command the block update can occur at any point in the stream. When it is encountered the controlling state machine stops reading in the matrix data and updates the buffers. If the row coordinate has changed then the new block is from a different horizontal row in the matrix, so the y-buffer is flushed into the external vector memory. Next the required range of x-vector entries is loaded into the x-buffer and when this process is completed the matrix processing resumes.

If the matrix word is a valid matrix entry (VAL) then the index word specifies the coordinates of the entry in the block (i.e. is the address in the x and y buffers) and the payload is the entry itself.

If the matrix word is a NOP then both the index word and payload can be anything. The data will pass through the data path but nothing will be written into the y-buffer. However, choosing a value for the index word and payload that minimizes the Hamming distance between the NOP word and the previous and following words may reduce the dynamic power consumed by the SCAR unit when fetching and executing the NOP.

Since the matrix data within each block comes with sufficient information to determine its exact location within the block the processing is not confined to following rows or columns. The data can be organized to maximize FPU utilization. The only ordering constraint is to prevent RAW hazards.

State Machine Operation Once the SCAR unit has been given the start signal it loads the first entry in the data structure. This will tell the system which range of x- vector elements to load into the x- buffer. When they are in the x-buffer the matrix-by-vector multiplication can begin. Each matrix entry is read and if it is a valid entry it is multiplied by the appropriate x entry and added to the y entry. If it is a NOP it passes through the data-path but the result will not be written in the y-buffer.

When a block is finished the next x-vector range is loaded into the x-buffer and the process repeats.

When all the matrix blocks making up a horizontal section of the matrix have been processed the y-buffer contains the final result for that section of the result vector and is written into the external vector memory. The y-vector is reinitialized during this process. Once this is complete the process repeats for the next matrix block.

Example

A sample SCAR data stream for the matrix shown in Fig. 3 is shown in Table 2. The adder feedback path is assumed to be 3 clock cycles in this example. This means that there must be at least 3 data words between entries from the same row. These "padding" data words can be entries from other rows or NOPs if there are no other entries available.

One simple way of ensuring this ordering is to allocate rows to each pipeline stage using modulo arithmetic. So for a 3 stage pipeline and a 10x10 block we can construct

3 data streams. Stream 0 contains rows {0,3,6,9}, stream 1 contains rows {1,4,7} and stream 2 contains rows {2,5,8}. NOPs are then appended to the end of these data streams to ensure that they are all the same length. The streams can then be interleaved to construct the full matrix stream. This is the method that was used to construct the matrix stream in Table 2. Other techniques for constructing the matrix data stream are equally valid. In some cases the number of NOPs inserted in the stream can be reduced by using a more sophisticated scheduling strategy (e.g. list scheduling) rather than restricting the method to modulo(k) row interleaving where k is the adder pipeline depth.

Table 2: A typical SCAR data stream for a 10x10 block from Fig. 3 with a latency of

3 cycles in the adder/y buffer path.

Performance measurements A proof of concept system has been implemented on an FPGA and tested using a range of square finite element matrices. Matrices from different application spaces may have different performance results. The system used in the test has a 64-bit vector bus and a 96-bit matrix bus. The buffer size is 2048 words for both the x and y buffers. The latency of the adder/y buffer feedback path is 14 clock cycles. The results are shown in Table 3. The median FPU utilization is 85.81% that equates to 172MFLOPS @ 100MHz. The median vector bus utilization is 8.20% so if there are multiple independent matrix memories then several SCAR units could share a single vector bus.

Table 3: SCAR performance.

This following outlines some possible extensions/improvements to the invention.

Early restart

In the above system all of the relevant entries in the x vector are loaded into the x- buffer at the start of a block. While this process is underway the arithmetic units are idle. It would be more efficient to have the arithmetic units only wait until the x vector element that is needed by the first matrix entry is loaded into the buffer. This is relatively simple to implement. It does require the inclusion of stall hardware in the arithmetic units to deal with the case where x entries are missing and miss detection hardware in the x-buffer.

In terms of overall performance early restart will not have a significant impact on matrices that consist of a relatively small number of non-zero blocks which are relatively dense. This type of matrix is common in finite element calculations. However, matrices with a large number of relatively sparse blocks will benefit significantly from early restart.

Single memory interface

The system that was implemented used two independent interfaces to external memory, one for the matrix and one for the vector. However, in the SCAR unit without early restart the two memory interfaces are never active at the same time. This means that a single memory interface could be used. The only performance penalty with this approach is an additional external memory startup delay at the beginning of each block as the matrix interface is activated. If this delay is short relative to the number of non-zero matrix entries in the block then its impact on overall performance will be minimal.

Multiple arithmetic units

The description above assumes that there is only one multiplier and adder in the system. This does not have to be the case and it is a relatively simple matter to build a system with two multipliers and adders. The approach to constructing the matrix data structure is exactly the same as before.

The main complication with this approach is that the number of ports in the y buffer increases. Two adders need a 4-port y buffer. This can be implemented by running the 2-port buffer at twice the clock rate as the rest of the system or the y buffer could be divided into multiple parallel banks. It is an easy matter for the software to take account of this. In fact many of the algorithms to construct the matrix data stream will naturally implement this split.

Symmetry support

Many algorithms produce symmetric matrices. It is possible to take advantage of this in the matrix-by- vector multiplication. Symmetry support can take two forms, viz. full symmetry and block symmetry.

Full Symmetry

Support for symmetric matrices has several significant advantages. It approximately halves the memory bandwidth required to load the matrix and also halves the amount of memory required to store the matrix. Also since multiple arithmetic units can operate in parallel it has the potential to nearly double the performance of the hardware. The basic SCAR has two caches (x and y). The region of values in the y cache stays constant across an entire row of blocks. The region of values in the x cache changes every time we move to the next block across a row.

The symmetric SCAR has four caches, viz. horizontal x, horizontal y, vertical x and vertical y. The range of values in the horizontal caches stays constant, whereas the range of values in the vertical caches changes as we move across a row of blocks.

In the basic SCAR a set of x values is read in at the start of every block. In the symmetric SCAR a set of x values must also be read into the vertical x cache at the start of a block. The vertical y cache must also write out a set of y values and then read in a new set of y values. This will produce approximately three times as much memory traffic on the vector memory bus as in the basic SCAR. The horizontal y cache must also be updated at the end of every row of blocks but this traffic should be insignificant compared with the traffic produced by the vertical caches.

The operation of the symmetric SCAR is simple for most blocks. A non-zero from the matrix is read in. Two x values (one from each x cache) and two y values (one from each y cache) are fed into two parallel arithmetic pipelines along with the matrix entry. The results are then written back to the y caches.

As with the basic SCAR, it's up to the software to eliminate RAW hazards. However, given that two y values are being updated, the software must schedule the operations to avoid RAW hazards on both values. This is a more difficult problem than in the basic SCAR and will probably require more padding with NOPs.

Block Symmetry

Full symmetric matrix support requires a more complicated caching arrangement than in the basic SCAR architecture. However, many real-world symmetric matrices have a narrow bandwidth where most of the non-zero entries are clustered around the main diagonal. This means that the blocks in the SCAR data stream around the diagonal are themselves symmetric. A large proportion of the total non-zeros will be contained in these blocks.

The symmetry of these blocks can be exploited while the rest of the matrix is processed as if it was not symmetric.

Backward/forward substitution

Backward/forward substitution or a triangular solve is another common but time consuming operation.

Consider:

Ly = x

Where L is an NxN lower triangular matrix, x is a known vector and y is an unknown vector. The elements of y are determined as follows:

Σ (=0

The divisions can be eliminated by the replacing the diagonal entries of the matrix with their reciprocals on the host when the matrix is being constructed. This is a very similar operation to matrix-by-vector multiplication. The main difference is that the addition is now a subtraction.

A similar argument can be applied to an upper triangular system.

The invention is not limited to the embodiments described but may be varied in construction and detail. For example when reading the x vector data for a tile there is no particular reason why x vector data needs to be read in a single burst. In fact it is a viable option to read in separate parts of the x-vector and process them together. The method can be implemented either in hardware or in software

Claims

1. A method performed by a data processor for Ax = y sparse matrix-by- vector multiplication, the method comprising the steps of an arithmetic data path multiplying each of a plurality of blocks of the matrix A at a time with a fragment of the x vector to provide a fragment of the y vector, wherein the matrix is encoded with matrix words and the words are processed as a stream for multiplication.

2. A method as claimed in claim 1, wherein the blocks are arranged in rows.

3. A method as claimed in claim 2, wherein all blocks of a row are processed before processing blocks of a next row.

4. A method as claimed in any preceding claim, wherein there is an x-buffer for the x vector and a y-buffer for the y vector.

5. A method as claimed in claim 4, wherein the y-buffer is initialized before the start of a row.

6. A method as claimed in any preceding claim, wherein the matrix words are streamed in from a memory without data processor storage.

7. A method as claimed in any of claims 4 to 6, wherein the y-buffer is initialized and a write-out from the y-buffer is performed in a single cycle using two buffer ports.

8. A method as claimed in any of claims 4 to 7, wherein relevant entries of the x- vector are written from external memory to the x-buffer in burst fashion.

9. A method as claimed in any preceding claim, wherein the multiplication is controlled by a state machine.

10. A method as claimed in claim 9, wherein state machine commands are embedded in the matrix words.

11. A method as claimed in any preceding claim, wherein the matrix block structure is represented by the matrix words.

12. A method as claimed in claim 11, wherein at least some matrix words have a command, an index, and a payload.

13. A method as claimed in either of claims 11 or 12, wherein matrix words include data, row, and column block coordinate information.

14. A method as claimed in any preceding claim, comprising the further step of pre-processing the matrix according to an encoding scheme to generate the matrix words.

15. A method as claimed in claim 14, wherein the matrix words are encoded to avoid Read after Write (RAW) data hazards by any available data interleaving method such as modulo arithmetic.

16. A method as claimed in either of claims 14 or 15, wherein the matrix elements within a tile are not stored in a row or column-oriented format, the x- and y- coordinate within the tile of each matrix element being specified explicitly by the matrix words, so that the elements of a tile can be re-ordered arbitrarily when encoding the matrix.

17. A method as claimed in any of claims 14 to 16, wherein floating point units are never forced to stall.

18. A method as claimed in any preceding claim, comprising the steps of exploiting symmetry in matrices to reduce processing.

19. A method as claimed in claim 18, wherein the symmetry is about the main diagonal at the block level.

20. A method as claimed in claim 19, comprising implementing a triangular solve where the diagonal elements in the matrix are replaced by their reciprocals in a pre-processing step.

21. A data processing system comprising means for performing a method as claimed in any preceding claim.

22. A data processing system as claimed in claim 21 , adapted to perform the sparse matrix by vector multiplication when performing finite element analysis.

23. A computer program product comprising software code for performing steps of a method of any of claims 1 to 20 when executing on a digital processor.