WO2023272917A1

WO2023272917A1 - Sparse matrix storage and computation system and method

Info

Publication number: WO2023272917A1
Application number: PCT/CN2021/115335
Authority: WO
Inventors: 李祎; 杨岭; 缪向水
Original assignee: 华中科技大学
Priority date: 2021-06-28
Filing date: 2021-08-30
Publication date: 2023-01-05
Also published as: CN113506589B; CN113506589A

Abstract

A sparse matrix storage and computation system and method, which belong to the field of microelectronic devices. The system comprises: a first storage array, which is used for storing a coordinate index table of non-zero elements of a sparse matrix; a second storage array, which is used for storing elements of the sparse matrix, and also serves as an in-situ computing core for a multiplication operation of the sparse matrix; a partitioning and storage scheduling unit, which is used for partitioning the sparse matrix into several sub-matrices, storing the sub-matrices in the second storage array according to different compression formats, and establishing an index table corresponding to the sparse matrix; and a second peripheral circuit, which is used for converting a vector into a voltage signal, applying the voltage signal to a bit line or a word line corresponding to each sub-matrix of the sparse matrix, and completing a multiplication operation of the sparse matrix and the vector.

Description

A sparse matrix storage and calculation system and method

【Technical field】

The invention belongs to the field of microelectronic devices, and more specifically relates to a sparse matrix storage and calculation system and method.

【Background technique】

Sparse matrix is a common matrix in scientific and engineering calculations, but because its 0 elements account for most of the entire matrix, and 0 elements are meaningless for matrix calculations, therefore, the storage and calculation efficiency of sparse matrices Low.

The storage of sparse matrices and matrix-vector multiplication have always been a major challenge in the field of computers and microelectronics, especially in memory computing. Due to the natural in-situ computing and high parallelism of memory computing technology, the location of matrix element storage There are very strict alignment requirements, so in the case of complete parallelism, if the sparse matrix is not mathematically transformed, it is impossible to remove the 0 elements, and the 0 elements are often not stored in the memory in the form of 0 in the memory calculation, generally The device is stored in a high-resistance state, and different devices have different resistance states for storing 0. At the same time, there is no semiconductor memory with a conductance of 0. Therefore, the 0 element not only wastes storage space, but also causes calculation errors and increases unnecessary energy consumption and calculation delay, and currently there are no patents and literatures setting specific storage and operation formats for sparse matrices for in-memory computing architectures.

【Content of invention】

In view of the defects of the prior art, the purpose of the present invention is to provide a sparse matrix storage and calculation system and method, aiming at solving the problem that the existing sparse matrix storage and matrix-vector multiplication cannot remove 0 elements, and 0 elements not only waste storage Therefore, there are problems of large storage space and low computational efficiency in the process of sparse matrix storage and matrix-vector multiplication.

To achieve the above object, the present invention provides a sparse matrix storage and calculation system, including a first memory array, a second memory array, a first peripheral circuit, a second peripheral circuit, a main processor, an on-chip cache and block storage scheduling unit;

The first storage array is used to store the coordinate index table of the non-zero elements of the sparse matrix; the second storage array is used to store the elements of the sparse matrix, and at the same time as the in-situ calculation core of the sparse matrix multiplication operation;

The on-chip cache is used to load the index table of the sparse matrix when performing the sparse matrix multiplication operation, and transmit the address decoding and the selection of the strobe switch position in the index table to the first peripheral circuit and the second peripheral circuit respectively; and store the intermediate operation As a result, after all calculation tasks are completed, all intermediate calculation results are returned to the main processor;

The block storage scheduling unit is used to block the sparse matrix into several sub-matrices, store each sub-matrix to the second storage array according to different compression formats; and establish an index table corresponding to the remaining sub-matrix, and store it in the first storage array ;

The first peripheral circuit is used to decode the received address, read and write the index table in the first storage array, and transmit the index table of the read-write sparse matrix to the on-chip cache;

The second peripheral circuit is used to convert the vector into a voltage signal, and open the corresponding switch according to the selection of the position of the strobe switch, and the voltage signal is applied to the bit line or word line corresponding to the sub-matrix of the sparse matrix through the opened switch, and Read the intermediate operation results through the word line or bit line and store them in the on-chip cache;

The main processor is used for analyzing the type of the sparse matrix; receiving the intermediate operation result; and transferring the received vector to the second peripheral circuit.

Preferably, the method of storing the sub-matrix according to different compression formats is:

The sub-matrix with all 0s is eliminated, and the rows or columns with all zeros at the front and end of the remaining sub-matrices are eliminated, and only the rows or columns with non-zero elements are stored.

Preferably, when compressing the sub-matrix, the indented row storage format is directly called, and the non-zero elements are left-shifted so that all elements are compressed into the same row for storage.

Preferably, the first peripheral circuit includes a read-write circuit, a drive circuit, a digital-to-analog converter, an analog-to-digital converter, and an address decoder;

The second peripheral circuit includes a read-write circuit, a drive circuit, a digital-to-analog converter, an analog-to-digital converter and a gate switch.

Preferably, the structures of the first memory array and the second memory array are a cross bar structure, or a transistor-memristor cascade structure, or a single transistor-multi-memristor cascade structure.

Preferably, the memory in the first memory array and the second memory array is a memristor, or a resistive change memory, or a phase change memory, or an optional transfer torque-magnetic random access memory, or a NOR Flash device or a NAND Flash device.

On the other hand, the present invention provides a sparse matrix storage calculation method, comprising the following steps:

By identifying and judging the type of the sparse matrix, the sparse matrix is divided, stored according to different compression formats, and an index table corresponding to each sub-matrix is established;

Convert vectors to electrical signals when performing sparse matrix-vector multiplication;

Taking each sub-matrix as a unit, sequentially decoding according to the address in the corresponding index table of each sub-matrix, loading the electrical signal into the sub-matrix, completing the multiplication and accumulation operation between the current sub-matrix and the vector, and storing the current intermediate operation result.

Preferably, the sub-matrix supports a direct indentation storage format, and the non-zero elements are left-shifted, so that all elements are compressed into the same row for storage.

Generally speaking, compared with the prior art, the above technical solution conceived by the present invention has the following beneficial effects:

The storage array in the sparse array storage and calculation system provided in the present invention includes two parts, respectively the first storage array and the second storage array; the first storage array is used to store the coordinate index table of the non-zero element of the sparse matrix; the second storage array It is used to store the elements of the sparse matrix, and at the same time as the in-situ calculation core of the sparse matrix multiplication operation; this storage method can effectively improve the storage efficiency of the sparse matrix-vector multiplication in the in-memory calculation, and ensure the reliability of the calculation.

In the present invention, the block storage scheduling unit divides the sparse matrix into several sub-matrices and removes the 0 elements in the sub-matrices, stores each sub-matrix in the second storage array according to different compression formats, and establishes the corresponding The index table is stored in the first storage array; because there are many 0 elements in the sparse matrix, not only the storage space is wasted, but also some unnecessary energy consumption and calculation delay will be added during the calculation process. Therefore, the block storage scheduling unit will be sparse The deletion of 0 in the matrix can take into account the storage efficiency while retaining the parallelism of performing matrix-vector multiplication for in-memory calculations. Among them, the compression efficiency improvement for diagonal matrices and triangular matrices is particularly obvious.

【Description of drawings】

Fig. 1 is a schematic structural diagram of a sparse matrix storage and calculation system provided by an embodiment of the present invention;

Fig. 2 is a schematic diagram of the storage and operation format of the diagonal sparse matrix provided by Embodiment 1 of the present invention;

Fig. 3 is a schematic diagram of the storage and operation format of the triangular sparse matrix provided by Embodiment 2 of the present invention;

Fig. 4 is a schematic diagram of the storage and operation format of the random sparse matrix provided by Embodiment 3 of the present invention.

【detailed description】

In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

On the one hand, as shown in FIG. 1 , the present invention provides a sparse matrix storage and calculation system, including a first storage array 3-1, a second storage array 3-3, a first peripheral circuit 3-2, The second peripheral circuit 3-4, the main processor 1, the on-chip cache 4 and the block storage scheduling unit 2;

The first storage array 3-1 is used to store the coordinate index table of the non-zero elements of the sparse matrix; the second storage array 3-3 is used to store the elements of the sparse matrix, and at the same time as the in-situ calculation core of the sparse matrix multiplication operation;

The on-chip buffer 4 is used to load the index table of the sparse matrix when performing the sparse matrix multiplication operation, and transmit the address decoding in the index table and the selection of the strobe switch position to the first peripheral circuit 3-2 and the second peripheral circuit 3 respectively -4; and store the intermediate calculation results, and return all intermediate calculation results to the main processor after all calculation tasks are completed;

The block storage scheduling unit 2 is used to block the sparse matrix into several sub-matrices, store each sub-matrix to the second storage array according to different compression formats; and establish an index table corresponding to the remaining sub-matrix, and store it in the first storage array array;

The first peripheral circuit 3-2 is used to decode the received address, read and write the index table in the first storage array, and transmit the index table of the read-write sparse matrix to the on-chip cache;

The second peripheral circuit 3-4 is used to convert the vector into a voltage signal, and open the corresponding switch according to the selection of the position of the strobe switch, and the voltage signal is applied to the bit line or word line corresponding to the sub-matrix of the sparse matrix through the opened switch On, and read the intermediate operation results through the word line or bit line and store them in the on-chip cache;

The main processor 1 is used for analyzing the type of the sparse matrix; receiving intermediate operation results; and transferring the received vectors to the second peripheral circuit.

Preferably, the structures of the first storage array 3-1 and the second storage array 3-3 are cross bar structures, or transistor-memristor cascade structures, or single transistor-multi-memristor cascade structures.

Preferably, the memory in the first memory array 3-1 and the second memory array 3-3 is a memristor, or a resistive change memory, or a phase change memory, or a self-selected transfer torque-magnetic random access memory, or a NOR Flash device or NAND Flash devices.

Preferably, the acquired sub-matrix supports a direct call indented storage format, and the non-zero elements are left-shifted so that all elements are compressed into the same row for storage.

Example 1

As shown in Figure 2, when the sparse matrix to be processed is an n×n diagonal matrix 6, first determine the block parameters according to the actual needs, assuming that the calculation is divided into two blocks, and call the block algorithm 7 of the diagonal matrix, divided into the above , the next two sub-matrices;

Eliminate all zero columns and save columns containing non-zero elements, such as the first sub-matrix 7-1 and the second sub-matrix 7-2;

The first sub-matrix 7-1 and the second sub-matrix 7-2 are stored in the second storage array 3-3, and corresponding indexes are set up and stored in the first storage array; the index situation is specifically: the first sub-matrix in this embodiment The columns of the matrix 7-1 are 1~n/2+1 columns; the columns of the second sub-matrix 7-2 are n/2~n columns, and the column information is stored in the second storage array 3-3;

When the multiplication operation of the sparse matrix and the vector needs to be performed, the vector is sent from the main processor to the second peripheral circuit 3-4, and the vector is converted into a voltage signal;

Load the index table from the first storage array to the on-chip cache 4;

In the first cycle, first read and write the address corresponding to the first sub-matrix 7-1, that is, the address of column 1~n/2+1, from the on-chip cache to the block storage scheduling unit 2;

According to the address information, open the switch corresponding to the first sub-matrix 7-1 in the second peripheral circuit 3-4; the first part of the vector voltage signal 9-1 enters the second storage array, completes the first matrix-vector multiplication operation, and obtains an intermediate result Part 10-1 of the vector Y is stored in the on-chip cache 4;

Carry out the multiplication operation of the matrix vector for the second time, because the index table has been loaded in the on-chip cache 4, therefore, the address corresponding to the second sub-matrix 7-2, that is, the address of n/2～n columns, is sent to the second peripheral Circuit 3-4, the switch in the second peripheral circuit 3-4 is connected to the second sub-matrix 7-2, and another part of the voltage signal 9-2 of the vector enters the second storage array to complete the multiplication operation of the second array vector to obtain the intermediate The other part 10-2 of the result vector Y is stored in the on-chip cache 4;

A part 10-1 and another part 10-2 of the intermediate result vector Y are returned to the main processor together, that is, a round of multiplication of the sparse matrix vector is completed.

Same as the above operation, the sparse matrix can be divided into finer blocks, such as divided into 4 blocks (8-1, 8-2, 8-3 and 8-4); the vector is divided into 9-3, 9-4, 9-5 and 9-6; perform four operations, but store fewer 0 elements.

Example 2

As shown in Figure 3, when the sparse matrix to be processed is an n×n triangular matrix 11, first determine the block parameters according to the actual needs, assuming that the calculation is divided into two blocks, and call the block algorithm 12 of the diagonal matrix, which is divided into upper and lower the next two submatrices;

Eliminate all zero columns and save columns containing non-zero elements, such as the first sub-matrix 12-1 and the second sub-matrix 12-2;

The first sub-matrix 12-1 and the second sub-matrix 12-2 are stored in the second storage array 3-3, and corresponding indexes are set up and stored in the first storage array; the index situation is specifically: the first sub-matrix in this embodiment The columns of the matrix 12-1 are 1-n/2 columns; the columns of the second sub-matrix 12-2 are 1-n columns, and the column information is stored in the second storage array 3-3;

Load the index table from the first storage array to the on-chip cache 4;

In the first cycle, the addresses corresponding to the first sub-matrix 12-1, that is, the addresses of columns 1 to n/2, are first read and written from the on-chip cache to the block storage scheduling unit 2;

According to the address information, open the switch corresponding to the first sub-matrix 12-1 in the second peripheral circuit 3-4; the first part of the vector voltage signal 9-1 enters the second storage array, completes the first matrix-vector multiplication operation, and obtains an intermediate result Part 10-1 of the vector Y is stored in the on-chip cache 4;

Perform the second matrix-vector multiplication operation, because the index table has been loaded into the on-chip cache 4, therefore, the address corresponding to the second sub-matrix 12-2, that is, the address of the 1～n/2 columns, is sent to the second peripheral Circuit 3-4, the switch in the second peripheral circuit 3-4 is connected to the second sub-matrix 12-2, and another part of the voltage signal 9-2 of the vector enters the second storage array to complete the multiplication operation of the second array vector to obtain the intermediate The other part 10-2 of the result vector Y is stored in the on-chip cache 4;

Same as the above operation, the sparse matrix can be divided into finer blocks, such as divided into 4 blocks (13-1, 13-2, 13-3 and 13-4); four operations are performed, but the stored 0 elements are more few.

Example 3

As shown in Figure 4, when the processed matrix is an n×n random sparse matrix 15, the traditional indented storage format is used first, and all the non-zero elements of each row are concentrated at the beginning of the row, such as 15-1 shown;

Build an index table 16 and store it in the storage area of the first storage array;

When it is necessary to perform matrix-vector multiplication, the vector is sent from the main processor to the second peripheral circuit, and the second peripheral circuit converts the vector into a voltage signal;

Load the index table from the storage area to the on-chip cache 4, because the elements of each row are not aligned in columns, so in this case, calculations need to be performed row by row. The header of the index table is the row number, and the column number of the row element Stored as a linked list element, so when performing calculations, load a linked list of the index table in turn, convert it into the address of the sparse matrix 15-1, turn on the corresponding switch, perform the vector multiplication of the line diversion, and store the result of each operation on the chip Cache 4, a complete matrix-vector multiplication is completed, and then the result is returned to the main processor.

In summary, the present invention has the following advantages:

Those skilled in the art can easily understand that the above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present invention, All should be included within the protection scope of the present invention.

Claims

A sparse matrix storage and calculation system is characterized in that it includes: a first storage array, a second storage array, a first peripheral circuit, a second peripheral circuit, a main processor, an on-chip cache, and a block storage scheduling connected in pairs unit;

The first storage array is used to store the coordinate index table of the non-zero elements of the sparse matrix; the second storage array is used to store the elements of the sparse matrix, and at the same time serve as an in-situ computing core of the sparse matrix multiplication operation;

The on-chip buffer is used to load the index table of the sparse matrix when performing the multiplication operation of the sparse matrix, and transmit the address decoding in the index table and the selection of the strobe switch position to the first peripheral circuit and the second peripheral circuit respectively; and store Intermediate calculation results, after all calculation tasks are completed, return all intermediate calculation results to the main processor;

The block storage scheduling unit is used to block the sparse matrix into several sub-matrices, store each sub-matrix to the second storage array according to different compression formats; and establish an index table corresponding to the remaining sub-matrix, and store it in the first storage array;

The first peripheral circuit is used to decode the received address, read and write the index table in the first storage array, and transmit the index table of the read-write sparse matrix to the on-chip cache;

The second peripheral circuit is used to convert the vector into a voltage signal, and open the corresponding switch according to the selection of the position of the strobe switch, and the voltage signal is applied to the bit line or word line corresponding to the sub-matrix of the sparse matrix through the opened switch , and read the intermediate operation results through the word line or bit line and store them in the on-chip cache;

The main processor is used for analyzing the type of the sparse matrix; receiving the intermediate operation result; and transferring the received vector to the second peripheral circuit.
The sparse matrix storage and calculation system according to claim 1, wherein the method for storing the sub-matrix according to different compression formats is:

The sub-matrix with all 0s is eliminated, and the rows or columns with all zeros at the front and end of the remaining sub-matrices are eliminated, and only the rows or columns with non-zero elements are stored.
The sparse matrix storage and calculation system according to claim 1 or 2, wherein when compressing the sub-matrix, the storage format of directly calling the indented line is adopted, and the non-zero elements are left-shifted, so that all elements are compressed into the same line to store.
The sparse matrix storage and calculation system according to claim 1, wherein the structure of the first storage array and the second storage array is a cross bar structure, or a transistor-memristor cascade structure, or a single Transistor-multi-memristor cascade structure.
The sparse matrix storage and calculation system according to claim 1 or 4, wherein the memories in the first storage array and the second storage array are memristors, or resistive change memory, or phase change memory, Or optional transfer torque-magnetic random access memory, or NOR Flash device or NAND Flash device.
The sparse matrix storage and calculation system according to claim 5, wherein the first peripheral circuit includes a read-write circuit, a drive circuit, a digital-to-analog converter, an analog-to-digital converter, and an address decoder;

The second peripheral circuit includes a read-write circuit, a drive circuit, a digital-to-analog converter, an analog-to-digital converter and a gate switch.
A sparse matrix storage and calculation method is characterized in that it comprises the following steps:

By identifying and judging the type of the sparse matrix, the sparse matrix is divided, stored according to different compression formats, and an index table corresponding to each sub-matrix is established;

Convert vectors to electrical signals when performing sparse matrix-vector multiplication;

Taking each sub-matrix as a unit, sequentially decoding according to the address in the corresponding index table of each sub-matrix, loading the electrical signal into the sub-matrix, completing the multiplication and accumulation operation between the current sub-matrix and the vector, and storing the current intermediate operation result.
The sparse matrix storage method according to claim 7, wherein the method for storing the sub-matrix according to different compression formats is:

The sub-matrix with all 0s is eliminated, and the rows or columns with all zeros at the front and end of the remaining sub-matrices are eliminated, and only the rows or columns with non-zero elements are stored.
The sparse matrix storage and calculation method according to claim 7 or 8, characterized in that, the sub-matrix supports a storage format that directly calls indented lines, shifts the non-zero elements to the left, and compresses all elements into the same row for storage .