CN116257209A

CN116257209A - Compressed storage of sparse matrix and parallel processing method of vector multiplication thereof

Info

Publication number: CN116257209A
Application number: CN202111509788.5A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Muxi Integrated Circuit Shanghai Co ltd
Current assignee: Muxi Integrated Circuit Shanghai Co ltd
Priority date: 2021-12-10
Filing date: 2021-12-10
Publication date: 2023-06-13

Abstract

The application discloses a compression storage method of a sparse matrix, which sequentially shifts non-zero elements in each row of the sparse matrix leftwards according to a sequence from left to right, enables the non-zero elements in each row to be arranged leftwards continuously to obtain a first intermediate matrix, performs reverse ordering on the rows of the first intermediate matrix to obtain a second intermediate matrix, and intercepts non-zero element parts of the second intermediate matrix to obtain a compression matrix of the sparse matrix. Based on the compressed matrix of the sparse matrix, the application simultaneously provides a parallel processing method for the sparse matrix vector multiplication, which realizes the self-adaptive calculation task grouping of each row in the compressed matrix, generates at least one parallel calculation task to respectively execute the multiplication operation of each row in the at least one row vector group and the target vector, reduces the delay of parallel execution calculation of each row in the compressed matrix, and improves the parallel processing performance of the sparse matrix vector multiplication.

Description

Compressed storage of sparse matrix and parallel processing method of vector multiplication thereof

Technical Field

The application relates to the field of high-performance computation of computers, in particular to a parallel processing method for compressed storage of a sparse matrix and vector multiplication thereof.

Background

A sparse matrix is a matrix in which the number of zero elements is far greater than the number of non-zero elements in the matrix, and the distribution of non-zero elements is irregular. Sparse matrix vector multiplication (SpMV) is widely applied in high-performance computation, and is one of the most time-consuming operations in various sparse matrix iterative solvers, so that optimizing the performance of sparse matrix vector multiplication has practical demands and significance. Sparse matrix vector multiplication is typically based on a compressed memory format of a sparse matrix to perform a corresponding vector multiplication operation, and its computation performance is affected by different data memory formats of the sparse matrix and the parallel processing manner in the processor.

Common sparse matrix compressed storage formats are coordinate formats (COO), compressed Sparse Rows (CSR), ELL, diagonal (DIA), and hybrid storage formats (HYB) combining COO and ELL formats. The memory formats are not merged with the memory access of the matrix non-zero elements and the indexes, the memory access efficiency is low, or a large amount of zero elements are required to be filled so that the number of the non-zero elements in each line after compression is balanced, more calculation and storage resources are wasted, and the problems restrict the parallel processing performance of operations such as sparse matrix vector multiplication and the like in multi-core processors such as GPU and the like.

Disclosure of Invention

In order to solve the problems, the application provides a compressed storage method of a sparse matrix and a parallel processing method of vector multiplication of the sparse matrix, which can remarkably increase the compressed storage efficiency of the sparse matrix, save calculation and storage resources and improve the parallel processing performance of the sparse matrix vector multiplication operation.

In a first aspect, an embodiment of the present application provides a compressed storage method for a sparse matrix, including:

shifting the non-zero elements in each row of the sparse matrix leftwards in sequence from left to right to enable the non-zero elements in each row to be arranged continuously towards the left, so as to obtain a first intermediate matrix;

the rows of the first intermediate matrix are inversely sequenced according to the sequence from the large number to the small number of the non-zero elements of each row to obtain a second intermediate matrix, and the non-zero element parts of the second intermediate matrix are intercepted to obtain a compression matrix of the sparse matrix;

storing each column of non-zero elements in the compression matrix into a first array from top to bottom according to a column sequence from left to right, correspondingly storing column indexes of each column of non-zero elements in the sparse matrix into a second array, and storing the number of each column of non-zero elements into a third array; and continuously storing the number of non-zero elements of each row in the compression matrix to a fourth array according to the row sequence from top to bottom, and storing the row index of each row in the compression matrix in the sparse matrix to a fifth array.

In an alternative embodiment, the first, second, third, fourth, and fifth arrays form five-tuple characterizing a memory characteristic of the compression matrix.

In a second aspect, an embodiment of the present application further proposes a parallel processing method for sparse matrix vector multiplication, including:

step S710, performing compression storage on a sparse matrix according to the compression storage method described in the foregoing embodiment, to obtain a compressed matrix of the sparse matrix and a first array, a second array, a third array, a fourth array and a fifth array that characterize storage characteristics of the compressed matrix;

step S720, according to the line sequence of the compression matrix, carrying out self-adaptive calculation task grouping on the lines of the compression matrix to obtain at least one line vector group, and obtaining the number of line parallel threads, the total number of parallel threads and the line start address index required by each line vector group to execute parallel calculation;

in step S730, at least one parallel computing task is generated to perform multiplication operation of each of the at least one row vector group and the target vector according to the number of parallel threads, the total number of parallel threads and the row start address index required for performing parallel computation.

In an alternative embodiment, the step S720 includes:

step S7210, pre-grouping each row of the compression matrix according to the row sequence of the compression matrix and a preset grouping row number to obtain at least one row group;

step 7220, calculating the maximum line non-zero element number and the minimum line non-zero element number of each line group in the at least one line group, and determining the line parallel thread number required by each line group to execute parallel calculation according to the ratio of the maximum line non-zero element number to the minimum line non-zero element number;

step S7230, merging the line groups with the same number of parallel threads for executing the parallel computation into one line vector group, to obtain the at least one line vector group.

In an alternative embodiment, the determining the number of parallel threads for each row group to perform parallel computation according to the ratio of the maximum number of non-zero elements to the minimum number of non-zero elements includes:

and determining the number of line parallel threads required by each line group to execute parallel calculation according to whether the ratio of the maximum line non-zero element number to the minimum line non-zero element number is in a preset ratio range.

In an alternative embodiment, the step S730 includes:

Step S7310, in each parallel computing task, allocating at least one parallel thread to each row in each row vector group according to the number of parallel threads, the total number of parallel threads and the row start address index required for executing parallel computing in each row vector group, and determining a first column index in each row and a row address index of each row corresponding to each parallel thread in the at least one parallel thread;

step S7320, in one cycle of each parallel thread, reading a non-zero element and a second column index of the non-zero element in a sparse matrix from the first array and the second array according to the row address index, the first column index and the third array, respectively, and performing a multiplication operation of the non-zero element and a multiplier of a position corresponding to the second column index in the target vector;

step S7330, in the next cycle of each parallel thread, shifting each parallel thread rightward by a predetermined column number in each row to obtain a new first column index corresponding to each parallel thread, and repeating step S7320 until the first column index exceeds the number range of non-zero elements in each row;

Step S7340, in the execution of one of the at least one parallel thread, sums the multiplication results of the execution completion of the at least one parallel thread.

In an alternative embodiment, the step S7340 further includes: and restoring the summation result to the original result sequence of the multiplication operation of the sparse matrix according to the row index of each row in the sparse matrix in the compressed matrix stored in the fifth array.

In an alternative embodiment, the step S7330 includes: and when the first column index corresponding to each parallel thread exceeds the number range of non-zero elements of each row, storing the multiplication operation result which is completed by executing each parallel thread into a shared memory.

In an alternative embodiment, the step S7340 further includes: in the execution of one of the at least one parallel thread, reading and summing the multiplication result of each parallel thread execution completion from the shared memory, and outputting the summation result to the memory.

In an alternative embodiment, the generating at least one parallel computing task in step S730 includes:

determining the number of thread blocks required to be started by at least one GPU task and the number of parallel threads started by each thread block according to the number of parallel threads and the total number of parallel threads required by the parallel computation executed by each row vector group;

And generating the at least one GPU task according to the number of thread blocks required to be started by the at least one GPU task and the number of parallel threads started by each thread block.

Compared with the prior art, the application has the following beneficial effects: on one hand, the compressed storage method of the sparse matrix can realize the complete merging access of the compressed stored element array and the column index array in the memory when the computation such as the vector multiplication of the sparse matrix is executed, improve the data processing efficiency of executing sparse matrix vector operation in a multi-core processor such as a GPU (graphics processing unit), store non-zero elements of each column without filling a large number of zero elements so as to uniformly expand the element number of each row or each column with unbalanced non-zero element number, and save memory storage resources. On the other hand, by utilizing the advantages of the compressed storage method for the sparse matrix, the capacity of thread parallel computing can be fully utilized when the sparse matrix vector computing is executed in a multi-core processor such as a GPU (graphics processing unit) by adaptively distributing the number of parallel threads executed by each row of the compressed matrix, the performance of thread parallel computing on non-zero elements of each row with unbalanced quantity in the compressed matrix is improved, and the execution delay of the parallel execution computing of each row is obviously reduced.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the following description will briefly explain the drawings that are required to be used in the embodiments of the present application. It is appreciated that the following drawings depict only certain embodiments of the application and are not to be considered limiting of its scope.

FIG. 1 is a flow chart of a compressed storage method of sparse matrix according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an exemplary sparse matrix according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an exemplary sparse matrix transformed intermediate matrix according to an embodiment of the present application;

FIG. 4 is a schematic diagram of an exemplary sparse matrix transformed compressed matrix according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a compressed storage format of an exemplary sparse matrix according to an embodiment of the present application;

FIG. 6 is another schematic diagram of a compressed storage format of an exemplary sparse matrix according to an embodiment of the present application;

FIG. 7 is a flow diagram of a parallel processing method of sparse matrix vector multiplication according to an embodiment of the present application;

FIG. 8 is a flow diagram of a parallel processing method of sparse matrix vector multiplication according to another embodiment of the present application;

FIG. 9 is a flow diagram of a parallel processing method of sparse matrix vector multiplication according to another embodiment of the present application;

FIG. 10 is a schematic diagram of parallel thread processing for sparse matrix vector multiplication according to one embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more clear, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present application. It should be understood, however, that the described embodiments are only some, but not all, of the exemplary embodiments of the present application and, therefore, the following detailed description of the embodiments of the present application is not intended to limit the scope of the application as claimed. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, are intended to be within the scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and in the claims of this application are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order, and are not to be construed as indicating or implying relative importance.

As described above, the compressed storage format of the sparse matrix is often difficult to meet the parallel processing performance requirement of high-performance computation on sparse matrix vector multiplication. Therefore, the application firstly provides a compressed storage method of the sparse matrix, and on the basis, the compressed storage method is further based on the compressed storage method to provide a parallel processing method of the sparse matrix vector multiplication, so that the parallel processing performance of the sparse matrix vector multiplication in the multi-core processor is improved, and the requirement of high-performance calculation is met.

Example 1

Fig. 1 is a flow chart of a compressed storage method of a sparse matrix according to an embodiment of the present application. As shown in fig. 1, the compressed storage method of the sparse matrix provided by the application includes the following steps:

step S110, shifting the non-zero elements in each row of the sparse matrix leftwards in sequence from left to right, and enabling the non-zero elements in each row to be arranged leftwards continuously to obtain a first intermediate matrix.

As shown in fig. 2, assume an original sparse matrixIs a two-dimensional matrix of m x n, m, n each being an integer greater than 1, with unequal numbers of non-zero elements and zero elements in each row, for a more visual description of the scheme described herein, fig. 2 presents only schematically an exemplary m x 13 sparse matrix a of a limited number of columns (n=13) _s . Wherein, the 0 th row comprises 7 non-zero elements a 1-a 7, the 1 st row comprises 9 non-zero elements b 1-b 9, … …, and so on, the m-1 th row comprises 8 non-zero elements s 1-s 8, and the rest are zero elements.

In this step, first, the matrix A is followed _s The non-zero elements in each row are sequentially shifted leftwards from left to right in the order of the non-zero elements in each row, so that the non-zero elements in each row are arranged leftwards continuously, namely the non-zero elements in each row are shifted from the middle position of the non-zero elements to the right position of the matrix, and all the non-zero elements are converged to the left position of the matrix to form a first middle matrix A as shown in figure 3 ₁ 。

And step S120, the rows of the first intermediate matrix are inversely sequenced according to the sequence from the large number to the small number of the non-zero elements in each row, a second intermediate matrix is obtained, and the non-zero element part of the second intermediate matrix is intercepted, so that the compressed matrix of the sparse matrix is obtained.

As shown in fig. 3, a first intermediate matrix a ₁ The number of non-zero elements in each row is unequal, wherein the number of non-zero elements b 1-b 9 in the 1 st row is the largest, and the number of non-zero elements k 1-k 2 in the m-3 rd row is the smallest. In this step, it is necessary to follow the first intermediate matrix A ₁ For the first intermediate matrix A, the number of non-zero elements in each row is from large to small ₁ Is inverted to form a second intermediate matrix a as shown in fig. 4 ₂ . After reverse ordering, a first intermediate matrix A ₁ Line 1 with the largest number of non-zero elements becomes the second intermediate matrix A ₂ Row 0, first intermediate matrix a ₁ The m-3 th row with the least number of non-zero elements becomes the second intermediate matrix A ₂ Line m-1 of (2). From the appearance point of view of the non-zero element distribution in the matrix, a second intermediate matrix A ₂ The non-zero element portions of the array are shaped like an inverted Ladder (inverted Ladder) and the non-zero element portions are truncated to obtain a configuration as shown in FIG. 5Compression matrix A is shown _c 。

Step S130, storing each column of non-zero elements in the compression matrix into a first array from top to bottom according to a column sequence from left to right, correspondingly storing column indexes of each column of non-zero elements in the sparse matrix into a second array, and storing the number of each column of non-zero elements into a third array; and continuously storing the number of non-zero elements of each row in the compression matrix to a fourth array according to the row sequence from top to bottom, and storing the row index of each row in the compression matrix in the sparse matrix to a fifth array.

In this step, after the original sparse matrix A has been obtained _s Compressed matrix a of (2) _c Thereafter, the compression matrix A is first required _c Is stored in succession to the first array. As shown in fig. 5, for compression matrix a _c Each column of non-zero elements in the matrix A is compressed in sequence according to the sequence from left to right _c Each column of non-zero elements is stored successively from top to bottom into a first array A [ ]]. In one embodiment, the first array A []May be a one-dimensional array of length size equal to the number of all non-zero elements in the original sparse matrix. For example, the compression matrix A is compressed in the above order _c The non-zero elements of each column are continuously stored to obtain a first array A]As can be seen from the lower left of fig. 5.

In this step, in order to realize multiple operations such as vector multiplication (SpMV) on non-zero elements in the original sparse matrix, the compressed matrix a needs to be sequentially stored in the same order as the non-zero elements in the first array _c And the corresponding column index of each column of non-zero elements in the original sparse matrix is continuously stored into the second array. As shown in fig. 5, for compression matrix a _c The non-zero elements in each column are stored in the second COL array from top to bottom]. In one embodiment, a second plurality of COL [ ] ]May be a one-dimensional array of length size equal to the number of all non-zero elements in the original sparse matrix. The compression matrix A is compressed according to the sequence _c After the corresponding column index of each column of non-zero elements in the original sparse matrix is stored, the obtained second group COL]As can be seen from the lower right of fig. 5. To compress matrix A _c For example, non-zero elements b1, s1, a1, g1, c1, f1, d1, …, e1, l1, k1 correspond to the original sparse matrix A, respectively _s The column indices of (a) are 0,0,1,0,1,0,0, …,2,1,2, respectively.

In this step, in order to perform parallel thread calculation and adaptive task scheduling in implementing multiple operations such as vector multiplication on non-zero elements in the original sparse matrix, the compressed matrix a needs to be further processed in a left-to-right column order _c The number of non-zero elements in each column is stored in succession in a third array. As shown in fig. 6, the compression matrix a is sequentially compressed in a column order from left to right _c The number of non-zero elements per column is stored consecutively to a third array CNT]Is a kind of medium. In one embodiment, a third array CNT [ ] is provided]Can be an original sparse matrix A with the length size _s A one-dimensional array of the number of non-zero elements in the row with the most non-zero elements due to the compressed matrix A _c Line 0, i.e. corresponds to the original sparse matrix A _s With the most non-zero elements, thus third array CNT]The length of the compressed matrix A is equal to _c The number of non-zero elements in line 0 of (c). In the example shown in fig. 6, the matrix a is compressed _c The number of non-zero elements of row 0 of (3) is 9, the third array CNT ]]The length of the compressed matrix A is 9, i.e. the compressed matrix A can be arranged in a sequence from left to right _c The number of non-zero elements per column is stored to CNT [0 ] in turn]～CNT[8]。

In this step, in order to perform parallel thread calculation and adaptive task scheduling in various operations such as vector multiplication on non-zero elements in the original sparse matrix, the compressed matrix a needs to be further compressed according to a row order from top to bottom _c The number of non-zero elements per row is stored consecutively in the fourth array. As shown in fig. 6, the compression matrix a may be sequentially arranged in a row order from top to bottom _c The number of non-zero elements in each row is stored consecutively to a fourth array RNNZ []Is a kind of medium. In one embodiment, a fourth array RNNZ []Can be an original sparse matrix A with the length size _s A one-dimensional array of rows of (a) is provided. In the example shown in FIG. 6, due to the original sparse matrix A _s The number of lines of m, i.e. fourthArray RNNZ []The length of (a) is m, i.e. the compressed matrix A can be arranged in a row-wise order from top to bottom _c The number of non-zero elements in each row is sequentially stored to RNNZ [0 ]]～RNNZ[m-1]。

In this step, in order to restore the row sequence of the original sparse matrix after performing various operations such as vector multiplication on non-zero elements in the original sparse matrix, the compressed matrix a needs to be compressed according to the row sequence from top to bottom _c The corresponding row index of each row in the original sparse matrix is continuously stored into a fifth array. As shown in fig. 6, the compression matrix a may be sequentially arranged in a row order from top to bottom _c The corresponding row index of each row in the original sparse matrix is continuously stored to a fifth array IPERM [ []Is a kind of medium. In one embodiment, a fifth array IPERM]Can be an original sparse matrix A with the length size _s A one-dimensional array of rows of (a) is provided. In the example shown in FIG. 6, due to the original sparse matrix A _s The number of lines of (a) is m, i.e. the fifth array IPERM [ []The length of (a) is m, i.e. the compressed matrix A can be arranged in a row-wise order from top to bottom _c The corresponding row index of each row in the original sparse matrix is sequentially stored to IPERM [0 ]]～IPERM[m-1]。

In some embodiments, the first array A [ ], the second array COL [ ], the third array CNT [ ], the fourth array RNNZ [ ], and the fifth array IPERM [ ] are used to characterize the storage characteristics of the compression matrix, and may be represented by five-tuple { A [ ], COL [ ], CNT [ ], RNNZ [ ], IPERM [ ] }.

According to the compressed storage method of the sparse matrix, firstly, each row of non-zero elements in an original sparse matrix is shifted leftwards to enable the non-zero elements in each row to be arranged leftwards continuously, then, each row is subjected to inverted sequencing according to the number of the non-zero elements in each row to obtain a second intermediate matrix, each column of non-zero elements in the second intermediate matrix is sequentially compressed and stored into a first array according to a column sequence from left to right, meanwhile, column indexes of each column of non-zero elements in the original sparse matrix are correspondingly stored into a second array according to the same sequence, the number of each column of non-zero elements is continuously stored into a third array, and the number of each row of non-zero elements is continuously stored into a fourth array. On the one hand, the compressed storage mode can realize the complete merging access of the compressed storage element array and the column index array in the memory when performing computation such as vector multiplication of a sparse matrix, and improves the data processing efficiency of executing sparse matrix vector operation in a multi-core processor such as a GPU. On the other hand, by storing the number of non-zero elements in each column to the third array and the number of non-zero elements in each row to the fourth array in the second intermediate matrix, the storage of non-zero elements in each column does not need to be filled with a large number of zero elements so as to perform balanced expansion of the element number of each row or each column with unbalanced non-zero element number, and meanwhile, when a sparse matrix vector operation is executed in a multi-core processor such as a GPU, the capability of thread parallel computation can be fully utilized, and the performance of thread parallel computation on non-zero elements in each row in a compression matrix is improved.

Example two

Based on the compressed storage format of the original sparse matrix obtained by the compressed storage method of the sparse matrix, the execution of sparse matrix vector operation in the multi-core processor such as the GPU can fully utilize the thread parallel computing capability. Therefore, the embodiment of the application further provides a parallel processing method for sparse matrix vector multiplication, parallel acceleration calculation of non-zero elements of each row in a compression matrix can be realized through self-adaptive thread task scheduling, parallel processing performance of sparse matrix vector multiplication is improved, and meanwhile, the problem of execution delay in parallel operation of each row of data is minimized.

Fig. 7 is a flow chart of a parallel processing method of sparse matrix vector multiplication according to an embodiment of the present application. As shown in fig. 7, the parallel processing method of sparse matrix vector multiplication according to the embodiment of the present application includes the following steps:

step S710, according to the method for compressing and storing a sparse matrix in the first embodiment of the present application, the sparse matrix is compressed and stored, so as to obtain a compressed matrix of the sparse matrix and a first array, a second array, a third array, a fourth array and a fifth array that characterize storage characteristics of the compressed matrix.

In this step, the original sparse matrix A _s Compressing and storing to obtain the originalCompressed matrix A of sparse matrix _c And a first array A [ for characterizing the compression matrix]Second-digit COL]Third array CNT]Fourth array RNNZ []And a fifth array IPERM []. Wherein, the first array A []Storing each column of non-zero elements in the compressed matrix, a second set of COLs]Storing the column index of each column of non-zero elements in the compressed matrix in the original sparse matrix, a third array CNT [ []Storing the number of non-zero elements per column in the compressed matrix, a fourth array RNNZ []Storing the number of non-zero elements per row in the compressed matrix, fifth array IPERM []A row index for each row in the compressed matrix in the original sparse matrix is stored.

In some embodiments, the first array A [ ], the second array COL [ ], the third array CNT [ ], the fourth array RNNZ [ ], and the fifth array IPERM [ ] used to characterize the compression matrix feature may be represented by five-tuple { A [ ], COL [ ], CNT [ ], RNNZ [ ], IPERM [ ] }.

Step S720, according to the line sequence of the compression matrix, carrying out self-adaptive calculation task grouping on the lines of the compression matrix to obtain at least one line vector group, and obtaining the number of line parallel threads, the total number of parallel threads and the line start address index required by each line vector group to execute parallel calculation.

In this embodiment, the matrix A is compressed _c Performing vector multiplication operation with the target vector, compressing matrix A _c A row vector may be understood as a parallel thread calculation on which a vector multiplication may be performed. Due to compression of matrix A _c The number of non-zero elements in each row is reduced from top to bottom in turn, and when vector multiplication operation is performed on each row vector and the target vector in the compression matrix, the different number of non-zero elements can cause great delay difference of calculation execution of each row, and the parallel calculation performance of sparse matrix vector multiplication can be affected. Especially when the ratio between the maximum line non-zero element number and the minimum line non-zero element number in the compression matrix is large, or the ratio between the maximum line non-zero element number and the minimum line non-zero element number is close to 1, but the absolute value between the maximum line non-zero element number and the minimum line non-zero element number is very different, the delay of parallel computation of each line is very different, and the sparse matrix vector is affectedParallel computing performance of multiplication.

In this step, according to the compression matrix A _c For compression matrix A _c The self-adaptive calculation task grouping is carried out on the row vectors of the compression matrix to obtain at least one row vector group, each row of the compression matrix can be divided into different row vector groups according to the quantity distribution of non-zero elements of each row in the compression matrix, and the number of parallel row threads required by parallel calculation in different row vector groups can be different. The number of parallel threads required for performing parallel computation by a row vector group refers to the number of parallel threads required for performing parallel computation by each row in the row vector group, and the total number of parallel threads required for performing parallel computation by a row vector group refers to the total number of parallel threads required for performing parallel computation by all rows in the row vector group, and generally, the total number of parallel threads is equal to the product of the total number of rows and the number of parallel threads of the row vector group. The row start address index refers to a row address index of a first row of each row vector group in a compression matrix when parallel task calculation is performed on each row vector group, and may also be referred to as a start offset address of the row vector group.

This step compresses matrix A _c The number of parallel lines needed for executing parallel computation of each line vector group can be matched with the distribution of the number of non-zero elements of each line in the line vector group, so that the execution time delay of executing parallel computation of each line is not greatly different on the whole, and the parallel computation performance of sparse matrix vector multiplication is improved on the whole.

In some embodiments, as shown in fig. 8, step S720 may further include the steps of:

step S7210, pre-grouping each row of the compression matrix according to the row sequence of the compression matrix and a predetermined grouping row number, to obtain at least one row group.

In this step, the predetermined number of packet lines may be based on the compression matrix A _c Line size of the multi-core processor, GPU, etc., and the like. As an example, a packed row pair compression matrix a of 256, 128, or 64 rows, etc. may be selected _c Pre-grouping the rows of the plurality of rows to obtain a plurality of row groups.

And step 7220, calculating the maximum line non-zero element number and the minimum line non-zero element number of each line group in the at least one line group, and determining the line parallel thread number required by executing parallel calculation of each line in the line group according to the ratio of the maximum line non-zero element number and the minimum line non-zero element number of each line group.

In this step, it is assumed that the previous step has been based on the compression matrix A _c Is compressed by a group of pairs of 256 rows _c Is pre-grouped to obtain a number of line groups. Then, the matrix a can be compressed from _c Corresponding fourth array RNNZ []The maximum row non-zero element number max_RNNZ and the minimum row non-zero element number min_RNNZ corresponding to each row group are obtained, and then the number of row parallel threads required by executing parallel calculation in each row group can be determined according to the ratio of max_RNNZ/min_RNNZ.

In some embodiments, it may be determined which level the number of parallel threads of the row required to perform parallel computation per row in the group of rows is in, depending on whether the ratio of max_rnnz/min_rnnz is within a predetermined ratio range.

In this step, after the number of parallel threads of a line required for performing parallel computation on each line in each line group has been determined according to the ratio of max_rnnz/min_rnnz, the line groups with the same number of parallel threads of a line required for performing parallel computation on each line may be combined, that is, the line groups with the same number of parallel threads of a line may be combined into one line vector group, and then the parallel computation may be performed on the line vector group by using the same computation task.

As an example, the following ratio of max_rnnz/min_rnnz is divided into five ratio ranges to determine which level the number of parallel threads of a row required for performing parallel computation for each row in the row group is, and the specific rule is described as follows:

1) Setting min_rnnz=4 when the minimum line non-zero element number min_rnnz of the line group is set to be smaller than 4;

2) If max_RNNZ/min_RNNZ >64, 256 parallel threads are needed for each row in the row group to execute parallel computation, and the total number of parallel threads executed by all rows is 256×256;

3) If max_rnnz/min_rnnz >16 and max_rnnz/min_rnnz < = 64, then 64 parallel threads are needed for each row in the row group to execute parallel computation, and the total parallel threads executed by all rows is 64×256;

4) If max_rnnz/min_rnnz >4 and max_rnnz/min_rnnz < = 16, then 16 parallel runs are needed for each row in the row group to execute parallel calculation, and the total parallel runs executed by all rows is 16×256;

5) If max_rnnz/min_rnnz >1 and max_rnnz/min_rnnz < = 4, 4 parallel threads are needed for each row in the row group to execute parallel computation, and the total parallel threads executed by all rows is 4×256;

6) If max_rnnz/min_rnnz < = 1, then 1 parallel thread number is needed for each row in the row group to perform parallel computation, and the total parallel thread number for all rows is 1×256.

Subsequently, the matrix A is compressed according to the above rule _c The total number of parallel runs of each row vector set is divided into 256, 64, 16, 4, and 1 row sets, and then the same number of parallel runs of the row sets are combined, whereby five row vector sets of row vector set 1 to row vector set 5 can be generated according to the above rule, and the number of parallel runs of the rows, the total number of parallel runs, and the row start address index of each row vector set are shown in table 1 below. The rows_pt256, rows_pt64, rows_pt16, rows_pt4, and rows_pt1 represent the total number of rows of the row vector group 1 to the row vector group 5, respectively. row_start_256, row_start_64, row_start_16, row_start_4, and row_start_1 represent row Start address indexes of row vector group 1 to row vector group 5, respectively.

TABLE 1

In this step, the matrix A is compressed _c After the adaptive computing task grouping is performed, multiplication operation of each row vector group and the target vector can be respectively performed corresponding to at least one parallel computing task. In one embodiment, a plurality of parallel computing tasks may be started in a multi-core processor such as a GPU to perform parallel acceleration computations for each row vector group, respectively. Each parallel computing task may determine a suitable kernel function (kernel) operating parameter according to the number of parallel runs of the row vector set, the total number of parallel runs, and the row start address index, and perform parallel computation of each row vector set by running different kernel functions.

In one embodiment, the operating parameters include at least the number of thread blocks (blocks) that the GPU task needs to launch and the number of parallel threads launched per thread Block. In one embodiment, the step S730 may include: and generating the at least one GPU task according to the number of thread blocks required to be started by the GPU task and the number of parallel threads started by each thread block.

As an example, a corresponding GPU computing task may be generated for each row-vector group, with the operating parameters of each computing task characterized as shown in table 2 below:

TABLE 2

The number of thread blocks in the table is the number of parallel thread blocks (blocks) required to be started for executing the GPU task, and the number of Block threads is the number of parallel threads started by each thread Block.

In some embodiments, as shown in fig. 9, step S730 may further include the steps of:

in step S7310, in each parallel computing task, at least one parallel thread is allocated to each row in each row vector group according to the number of parallel threads, the total number of parallel threads and the row start address index required for executing parallel computing in each row vector group, and a first column index in each row and a row address index in each row corresponding to each parallel thread in the at least one parallel thread are determined.

In this step, for each parallel computing task, first, according to the number of parallel threads, the total number of parallel threads, and the line start address index required for performing parallel computing for each line vector group, parallel executing threads are allocated to each line in each line vector group from the started parallel thread block. By way of example, fig. 10 schematically depicts a schematic diagram of parallel computation performing vector multiplication on a certain row vector group with a total number of rows rowsjpt4=16 in a compression matrix, taking as an example a parallel computation task with a number of row parallel threads of 4.

As shown in fig. 10, in the parallel computing task, the number of parallel threads started in each Thread block is 64, i.e., threads Thread0 to Thread63, the total number of rows rows_pt4=16 of the row vector, and the row address index of row Start address index, i.e., row 0 in the compression matrix is row_start_4. The number of parallel threads executed by each row of the row vector group is 4, that is, the total number of parallel threads executed is 16×4=64, that is, one thread block is started, so that the parallel computing requirement of the row vector can be met. Since the number of parallel threads required for each row in the row vector is 4, 4 parallel threads may be initially allocated to each row in the row vector. Sequentially and respectively distributing threads Thread0 to Thread3 to element positions with row indexes of 0 to 3 in the 0 th row in the row vector from a Thread block, wherein the row address index of the row is row_Start_4; sequentially and respectively distributing threads Thread 4-Thread 7 to element positions with row indexes of 0-3 in the 1 st row in the row vector, wherein the row address index of the row is row_Start_4+1; … …; and by analogy, sequentially allocating threads Thread60 to Thread63 to element positions with row indexes of 0-3 in the 15 th row of the row vector, wherein the row address index of the row is row_Start_4+15.

Step S7320, in one loop of each parallel thread, reads a non-zero element and a second column index of the non-zero element in a sparse matrix from the first array and the second array according to the row address index, the first column index, and the third array, and performs a multiplication operation of the non-zero element and a multiplier of the target vector corresponding to a position of the second column index.

In this step, for one cycle of each parallel thread, according to the row address index, the column index in each row corresponding to the parallel thread, and the number of non-zero elements in each column in the compression matrix stored in the third array CNT [ ], the non-zero elements to be executed corresponding to each parallel thread and the column index (column number) in the sparse matrix corresponding to each parallel thread may be read out from the first array a [ ] and the second array COL [ ] of the compression matrix, respectively. As shown in fig. 10, for Thread0 to Thread3, the non-zero elements b1, b2, b3, b4 are read from the first array a [ ], and the

column indexes

0,2,3,5 in the sparse matrix corresponding to the non-zero elements b1, b2, b3, b4 are read from the second array COL [ ], respectively (see fig. 5); for threads Thread 4-Thread 7, respectively reading non-zero elements s1, s2, s3, s4 from the first array A [ ], and reading

column indexes

0,1,3,4 (see FIG. 5) in the sparse matrix corresponding to the non-zero elements s1, s2, s3, s4 from the second array COL [ ]; and the other threads and so on.

Then, for each parallel Thread, performing a multiplication operation between the non-zero element and a multiplier in the target vector X corresponding to the non-zero element at a column index position in the sparse matrix, e.g., performing an operation of the non-zero element b 1X 0 in one cycle of Thread0 execution; in one cycle executed by Thread1, performing an operation of a non-zero element b 2X 2; in one cycle executed by Thread2, performing an operation of a non-zero element b 3X 3; in one cycle executed by Thread3, the operation of non-zero element b 4X 5 is executed; and so on.

Step S7330, in the next cycle of each parallel thread, shifting each parallel thread rightward by a predetermined column number in each row to obtain a new first column index corresponding to each parallel thread, and repeating step S7320 until the first column index exceeds the number range of non-zero elements in each row.

In this step, when each parallel thread completes one cycle, in the next cycle, the parallel thread needs to be shifted to the right by a predetermined number of columns in the current row. As shown in fig. 10, in this example, each parallel Thread is shifted by 4 columns, for example, for threads Thread0 to Thread3, the next loop is shifted from the positions of the non-zero elements b1, b2, b3, b4 to the positions of the non-zero elements b5, b6, b7, b8, respectively, the non-zero elements b5, b6, b7, b8 are continuously read from the first array a [ ] according to the previous steps, and the

column indexes

6,7,8,10 in the sparse matrix corresponding to the non-zero elements b5, b6, b7, b8 are read from the second array COL [ ]. After the operation of this cycle is performed, thread0 continues to shift 4 columns to the right, shifting from the position of non-zero element b5 to the position of non-zero element b9, and the other 3 threads Thread 1-Thread 3 terminate the cycle because the shifted corresponding column index exceeds the number range of non-zero elements in this row. The execution of parallel threads for the other rows of the row vector set and so on.

In this step, for a plurality of parallel threads executed in each line, one of the parallel threads is selected, and after the loop operation is completed, the multiplication results of a group of parallel threads allocated in each line are summed. In one embodiment, the step S7340 further includes performing a matrix reduction operation on the summed result according to the fifth array in the execution of one of the at least one parallel threads, that is, reducing the summed result to an original result sequence of the multiplication operation of the sparse matrix according to a row index of each row in the sparse matrix in the compressed matrix stored in the fifth array.

In one embodiment, step S7340 saves the multiplication result of each parallel thread to the shared memory when the first column index corresponding to each parallel thread exceeds the number range of non-zero elements in each row.

In one embodiment, in the execution of one of the at least one parallel threads, the step S7340 reads the result of the multiplication performed by each of the parallel threads from the shared memory and sums the result, and outputs the result to the memory.

According to the parallel processing method for sparse matrix vector multiplication, on the basis of utilizing the advantages of the compressed storage method for sparse matrix, the first array and the second array of the compressed matrix can be completely combined and accessed in the memory, and the data processing efficiency of executing sparse matrix vector operation in multi-core processors such as a GPU (graphics processing unit) is improved. On the other hand, by adaptively distributing the number of parallel threads executed by each row of the compression matrix, when sparse matrix vector operation is executed in a multi-core processor such as a GPU, the capacity of thread parallel calculation can be fully utilized, the performance of thread parallel calculation on non-zero elements of each row with uneven quantity in the compression matrix is improved, and the execution delay of parallel execution operation of each row is obviously reduced.

The above describes exemplary embodiments of the present application, it should be understood that the above-described exemplary embodiments are not limiting, but rather illustrative, and the scope of the present application is not limited thereto. It will be appreciated that modifications and variations to the embodiments of the present application may be made by those skilled in the art without departing from the spirit and scope of the present application, and such modifications and variations are intended to be within the scope of the present application.

Claims

1. A compressed storage method of a sparse matrix, comprising:

2. The compressed memory method of claim 1, wherein the first, second, third, fourth, and fifth arrays form five-tuple characterizing a memory characteristic of the compressed matrix.

3. A parallel processing method for sparse matrix vector multiplication, comprising:

step S710, performing compression storage on a sparse matrix according to the compression storage method of claim 1 or 2, to obtain a compressed matrix of the sparse matrix and a first array, a second array, a third array, a fourth array and a fifth array characterizing storage characteristics of the compressed matrix;

4. A parallel processing method according to claim 3, wherein the step S720 includes:

5. The parallel processing method of sparse matrix vector multiplication according to claim 4, wherein said determining the number of parallel threads of a row required for performing parallel computation for each of said row groups based on the ratio of said maximum number of non-zero elements of a row to said minimum number of non-zero elements of a row comprises:

6. The parallel processing method of sparse matrix vector multiplication according to any one of claims 3-5, wherein step S730 comprises:

7. The parallel processing method of sparse matrix vector multiplication according to claim 6, wherein said step S7340 further comprises: and restoring the summation result to the original result sequence of the multiplication operation of the sparse matrix according to the row index of each row in the sparse matrix in the compressed matrix stored in the fifth array.

8. The parallel processing method of sparse matrix vector multiplication according to claim 7, wherein said step S7330 comprises: and when the first column index corresponding to each parallel thread exceeds the number range of non-zero elements of each row, storing the multiplication operation result which is completed by executing each parallel thread into a shared memory.

9. The parallel processing method of sparse matrix vector multiplication according to claim 8, wherein said step S7340 further comprises: in the execution of one of the at least one parallel thread, reading and summing the multiplication result of each parallel thread execution completion from the shared memory, and outputting the summation result to the memory.

10. The parallel processing method of sparse matrix vector multiplication according to claim 9, wherein generating at least one parallel computing task in step S730 comprises: