CN114491402A

CN114491402A - Calculation method for sparse matrix vector multiplication access optimization

Info

Publication number: CN114491402A
Application number: CN202210066814.XA
Authority: CN
Inventors: 胡长军; 储根深; 何远杰; 丁哲昭; 董玲玉; 戴潮虎
Original assignee: University of Science and Technology Beijing USTB
Current assignee: University of Science and Technology Beijing USTB
Priority date: 2022-01-20
Filing date: 2022-01-20
Publication date: 2022-05-13

Abstract

The invention discloses a calculation method for sparse matrix vector multiplication access optimization, which is suitable for DCU (distributed computing Unit) and GPU (graphics processing Unit) architectures, and comprises the following steps: dividing an original sparse matrix into a plurality of blocks according to a fixed row number, wherein each block is independently calculated by one thread block, and a space with a fixed size is opened up for each thread block in an LDS (laser direct structuring); calculating the number of turns required to be calculated by each thread block; in one round of calculation, all threads in each thread block finish a plurality of times of non-zero element calculation and write the result into an LDS; one or more threads in each thread block sum the multiplication results of the LDS and store the results to a register; and after all rounds of calculation are finished, further calculating the result in the register, and writing the result back to the video memory. The invention is based on the original CSR format, does not need preprocessing, can fully utilize the access merging characteristic and realizes higher access bandwidth utilization.

Description

Calculation method for sparse matrix vector multiplication access optimization

Technical Field

The invention relates to the technical field of high-performance calculation and algorithm, in particular to a calculation method for sparse matrix vector multiplication access optimization suitable for DCU (distributed computing Unit) and general GPU (graphics processing Unit) architecture.

Background

Sparse Matrix-Vector Multiplication (SpMV) is widely used in the fields of scientific computation, graphic analysis, signal processing, and the like. The matrix involved in the calculation is sparse, i.e. most elements in the matrix are zero. Due to the efficient compression rate, the Compressed Sparse Row compression Format (CSR) is the most common storage form for Sparse matrices. In sparse matrix vector multiplication, the performance bottleneck is memory access, and how to improve the memory access performance becomes a key.

Graphics Processing Units (GPUs) can provide powerful parallel computing power and huge data throughput, are excellent in handling large-scale numerical computing problems, and provide a solid computing platform for high-performance computing systems. A DCU (Deep Computing Unit) is one type of GPU accelerator that provides high memory access bandwidth. This makes it possible to implement efficient SpMV computation on DCU accelerators and GPU platforms.

The existing SpMV algorithms (such as CSR-Scale, CSR-Vector and other methods) often face the problem of discontinuous access and storage, and cannot fully utilize the access and storage combination characteristics of a GPU, so that the access and storage bandwidth utilization is not high, and the overall performance loss is large. Other schemes (such as the CSR-Stream algorithm, the CSR-Adaptive algorithm, the HOLA algorithm, and the CSR5) that can efficiently use the memory bandwidth often need to perform preprocessing or convert the CSR format into other formats, and thus, the overhead is large.

Disclosure of Invention

The invention provides a calculation method and a calculation device for sparse matrix vector multiplication memory access optimization, which aim to solve the technical problems of low memory access bandwidth utilization, large overall performance loss and large extra cost in the prior art.

In order to solve the technical problems, the invention provides the following technical scheme:

on one hand, the invention provides a calculation method for sparse matrix vector multiplication access optimization, which is suitable for DCU and general GPU architecture, and comprises the following steps:

dividing an original sparse matrix into a plurality of blocks according to a fixed number of rows, wherein each block is independently calculated by one thread block, and respectively opening up a space with a fixed size for each thread block in an LDS (shared memory);

calculating the number of nonzero elements of each thread block in charge of calculation, and calculating the number of turns required to be calculated by each thread block according to the calculation result by combining the preset number of nonzero elements of each thread block which are calculated at most in each turn;

in one round of calculation, all threads in each thread block finish a plurality of times of calculation of products of non-zero elements and corresponding elements in corresponding vectors, and a multiplication result is written into an LDS space corresponding to the current thread block;

dividing all threads in each thread block into a group of single or multiple threads, summing the multiplication results of the LDS by each group of threads, and storing the summation results in a register;

and when all rounds of calculation are finished, obtaining a final calculation result based on the summation result stored in the register, and writing the final calculation result back to the video memory.

Further, the original sparse matrix is stored using CSR format.

Further, when the original sparse matrix is divided into a plurality of blocks according to the fixed number of rows, except for the last block, the number of rows of each block is equal and the number of rows is N, and the number of rows of the last block is less than or equal to N.

Furthermore, the number of non-zero elements calculated at most in each round of each thread block is R multiplied by M; wherein M is the number of threads contained in each thread block, and R is a positive integer;

in a single calculation turn, each thread in each thread block completes the calculation of the product of at most R non-zero elements in the sparse matrix block for which the current thread block is responsible and the corresponding elements in the corresponding vector.

Further, when all threads in each thread block are divided into a group of single or multiple threads, the number of threads in each group is a non-negative integer power of 2;

and the total number of each thread block is greater than or equal to the matrix row number processed by the thread block.

The technical scheme provided by the invention has the beneficial effects that at least:

1. the access bandwidth is high. Because all threads in the thread block calculate the nonzero elements participating in multiple lines together, the reading of the nonzero elements of the sparse matrix among the threads is continuous, the locality of data access is improved, and the access bandwidth can be effectively improved.

2. The acceleration effect is good. Because all threads participate in the calculation in a single round of calculation together, only part of the threads in one round are in an idle state under the worst condition, the utilization rate of calculation resources is improved, and the calculation efficiency can be effectively improved.

3. Saving LDS space. By adopting batch calculation, only the LDS space with a fixed size needs to be distributed for each thread block, and the space is repeatedly utilized in the calculation process without distributing the space for the calculation of each batch.

4. No pre-processing operation is required. The current preprocessing-based algorithms are relatively expensive, and even the preprocessing time is longer than the calculation time. The method of the invention does not need pretreatment operation and has higher performance.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of a calculation method for sparse matrix vector multiplication access optimization according to an embodiment of the present invention;

fig. 2 is a flowchart of a calculation method for sparse matrix vector multiplication access optimization according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

First embodiment

In order to achieve a better effect of the SpMV in the DCU and GPU architectures and to solve the problem of discontinuous access of non-zero elements in each row of the sparse matrix between threads, the embodiment provides a sparse matrix vector multiplication access optimization calculation method suitable for the DCU and general GPU architectures. In each calculation round circulation of the block, data are loaded from a video memory to carry out multiplication calculation, results are written into an LDS, multiplication results are read from the LDS to carry out specification summation, and the results are written into a register. After the calculation round circulation of the block is completed, data is required to be read from the register to calculate a final result, and the result is written back to the video memory.

The implementation principle of the sparse matrix vector multiplication memory access optimization calculation method suitable for DCU and general GPU architectures in this embodiment is shown in fig. 1, where in fig. 1, italic characters are memory access or LDS operations, and bold represents a storage space in a video memory or a storage space in LDS. Each block is distributed with the same number of rows, the internal thread of the block continuously accesses and stores the non-zero element value of the matrix, calculates the product of the non-zero element and the corresponding x vector element value, and stores the multiplication result in the LDS. After all threads in the block finish the writing of the LDS, one or more threads are used for reading the values in the LDS for stipulation and summation, and finally, the result is written back to the memory. In fig. 1, row _ range and reduce _ range respectively represent the index range of the matrix row for which the thread is responsible and the range of the specification for which the thread is responsible in the current round of calculation. Taking thread 1 as an example, thread 1 is responsible for calculating row 1, the number range of row 1 in the value array is [3,11), and thread 1 only calculates the element with the number of 10 in the second round of calculation, so row _ range of thread 1 is [3,11), and reduce _ range is [10, 11).

Based on the above, the execution flow of the sparse matrix vector multiplication access optimization calculation method applicable to DCU and general GPU architecture of the present embodiment is shown in fig. 2, and includes the following steps:

and S1, dividing the calculation task and initializing.

The original sparse matrix is divided into a plurality of blocks according to the fixed line number, each block is independently calculated by one thread block, and a space with a fixed size is respectively opened up for each thread block in the LDS.

And S2, counting the calculation turns.

And calculating the number of nonzero elements of each thread block in charge of calculation, and calculating the number of turns required to be calculated by each thread block according to the calculation result and the preset number of nonzero elements of each thread block which are calculated at most in each turn.

And S3, calculating a multiplication result.

In one round of calculation, all threads in each thread block finish a plurality of times of calculation of products of non-zero elements and corresponding elements in corresponding vectors, and the multiplication results are written into the LDS space corresponding to the current thread block.

S4, sum reduction.

All threads in each thread block are divided into a group of single threads or multiple threads, each group of threads sums the multiplication results of the LDS respectively, and the summation results are stored in a register.

S5, write back the final result.

And after finishing all rounds of calculation, further calculating according to needs based on the summation result stored in the register to obtain a final calculation result, and writing the final calculation result back to the video memory.

Multiplication by sparse matrix vector

For example, the method of the present embodiment is further described. Wherein A is a sparse matrix with m rows, n columns and s non-zero elements,

is a dense vector of length t (where t is n),

is a vector of length m, and both alpha and beta are constants. The matrix a is stored using the CSR format, i.e. comprising three arrays: value [ s ]]，row_ptr[m+1]And col _ index [ s ]]. The value of each non-zero element in the value array storage matrix A, the position of the first non-zero element of each row of the row _ ptr array storage matrix A in the value array, and the column sequence number of each non-zero element in the col _ index array storage matrix A. Vector quantity

Using arrays x [ t ] respectively]And an array y [ m ]]And (5) storing.

Based on the above, the specific implementation of the calculation method of the embodiment is as follows:

dividing a computing task and initializing:

(1) defining the number of rows of a matrix processed by each block to be N, dividing the matrix A into a plurality of blocks with equal rows, namely, except the last block, the number of rows of the rest blocks is equal and the number of rows is N, and the number of rows of the last block is less than or equal to N. The available matrix A can be divided into

The block, for the block numbered i, consists of all the rows in matrix a between the starting row number N × i and the ending row number min (N × (i +1), m).

(2) According to the block information of the matrix A, each block is allocated with one block, the number of the blocks is equal to the number of the blocks, and the number of the blocks is

Wherein the content of the first and second substances,

is a round-up operation. I.e. each block processes a fixed number of matrix rows.

(3) And according to the block information of the matrix A, each block corresponds to one block. This results in the starting and ending row numbers for each block responsible row. For block i, the line numbers of its start and end are block _ row _ begin ═ N × i, and block _ row _ end ═ min (N × (i +1), m), respectively.

(4) For each block, by using a row _ ptr array, the number of nonzero elements contained in the block corresponding to the block, block _ nnz _ nums, can be obtained. Specifically, the index of the first non-zero element of the start row of the block corresponding to the block in the value array can be obtained through the row _ ptr array, and similarly, the index of the first non-zero element of the stop row in the value array can be obtained, so that block _ nnz _ nums is row _ ptr [ block _ row _ end ] -row _ ptr [ block _ row _ begin ]. And block _ row _ inx _ begin is written as block _ ptr [ block _ row _ begin ], block _ row _ inx _ end is written as block _ row _ end ].

(5) Allocating an LDS array shared _ val: setting the number of threads in each block as M, and allocating an array shared _ val [ R × M ] with a fixed size in the LDS for each block]To save the non-zero element and of the block corresponding to the block

The product of the corresponding elements in the vector. Wherein R is a positive integer, but shared _ val [ R × M]Is no longer than the available LDS space size.

(6) Each thread in the block initializes a variable sum to 0 for subsequent specification summing.

Counting and calculating turns:

and counting the number of times of each block batch calculation, and ensuring that the shared _ val array can store the calculation result of each round. At most R × M nonzero elements can be calculated in each round (only the last round may be less than R × M elements, and the rest rounds are R × M elements), so that the number of times rounds needed by each block to be batched is obtained

Wherein the content of the first and second substances,

indicating rounding up.

And calculating a multiplication result:

(1) and calculating the index ranges of the corresponding value array and the col _ index array calculated by the round in the block. Let the current round number be k (0 ≦ k < rounds, i.e., k is numbered from 0), the start index of these two arrays is block _ round _ inx _ start ═ block _ row _ inx _ begin + kxrxm, and the end index is block _ round _ inx _ end ═ min (block _ round _ inx _ start + rxxm, block _ round _ inx _ end). The start index and the end index are represented in interval form: round _ range [ block _ round _ inx _ start, block _ round _ inx _ end).

(2) In a single round of computation, each thread in the block completes at most R non-zero elements and vectors in the block for which the block is responsible

And (4) calculating the product of the corresponding element in the array, and writing the product into a shared _ val array. For integer i, if block _ round _ inx _ start is satisfied, i is less than or equal to<block _ round _ inx _ end, i is the index number at the round. For index number i meeting the above condition, multiply calculate value [ i]×x[col_index[i]]Will be processed by a thread with sequence number mod (i-block _ row _ inx _ begin, M) in block, and the multiplication result will be written into shared _ val [ mod (i-row _ ptr [ block _ row _ begin ] of LDS]，R×M)]Location. Here, mod (a, b) is a modulo operation, representing the remainder of a/b.

(3) Waiting for all threads in the block to complete the computation and ensuring that all results have been written to the LDS.

And (3) sum of specifications:

(1) dividing all threads in each block into a group of single or multiple threads, wherein the number of threads in each group is a non-negative integer power of 2, and the total number of threads of each block is greater than or equal to the number of matrix rows processed by the block. Let the number of threads in each group be 2^KK is a non-negative integer and satisfies M/2^KThe number of the thread group with the number of i in each block is more than or equal to N, and the number of the thread group with the number of i in each block is obtained to be i/2^KThe integer part of (2).

(2) And according to the grouping information of the threads in the block, allocating a row to each group of threads with the group number smaller than N, wherein all the threads in the group of threads are responsible for the summation of the specifications of the row and writing back the final result to the video memory. The thread group numbered i in each block (where i < N) is responsible for the row number r of one row in the matrix block _ row _ begin + i.

(3) The start and end indices of the array of non-zero element values corresponding to the row for which thread group i is responsible are calculated (the row does not include the non-zero element corresponding to the end index), as shown in the example of fig. 1. The start index and the end index are expressed in interval form: row _ range [ reduce _ row _ idx _ begin, reduce _ row _ idx _ end ]). Wherein, the start index reduce _ row _ idx _ begin is row _ ptr [ r ], and the end index is reduce _ row _ idx _ end is row _ ptr [ r +1 ]. In addition, since the start and end indexes are fixed in each calculation round, the step can be performed in advance in the calculation task division.

(4) Before the thread group i is subjected to reduction, the start index and the end index of a shared _ val array in the LDS to be traversed are calculated. As shown in the example of fig. 1. The start index and the end index are expressed in interval form: reduce _ range [ reduce _ start, reduce _ end). The reduce _ range interval is the intersection of the range row _ range and the range round _ range. The value of "reduce _ start" is max (block _ round _ inx _ start, reduce _ row _ idx _ begin), and the value of "reduce _ end" is min (block _ round _ inx _ end, reduce _ row _ idx _ end).

(5) If the intersection in the previous step is not empty, that is, reduce _ start < reduce _ end, that is, reduce _ row _ idx _ begin < block _ round _ inx _ end and reduce _ row _ idx _ end > block _ round _ inx _ start, the thread group i executes the reduction operation in the next step, otherwise, the reduction operation in the next step is skipped.

(6) For each set of threads i meeting the condition in the block (wherein the condition is i)<N) that together read the multiplication result of the row (i.e., the non-zero element and vector corresponding to the row) from the shared _ val array of the LDS

The product of corresponding elements) and adding the read value to the variable sum of the thread with the number of 0 in the thread group by using shfl _ down and a parallel reduction mode. Specifically, the thread group i reads the elements (excluding the reduce _ end) indexed from the reduce _ start to the reduce _ end in parallel from the LDS, and accumulates them, and the accumulated result is denoted as sum 1. If a thread group comprises a plurality of threads, shfl _ down operation is needed to be utilized for inter-thread communication, and the sum1 specification of the accumulation result is added into the thread group to be compiledAbove the variable sum for the thread with number 0. If the thread group contains only one thread, the accumulation result sum1 in the thread is directly added to sum.

(7) And waiting for all thread operations in the block to complete.

(8) And if the block completes all rounds, namely the multiplication calculation of all the non-zero elements in the matrix block and the sum of the specifications of all the rows are completed, performing a step of writing back a final result. Otherwise, jumping to the step of calculating multiplication results, and executing the next round.

Write back the final result:

and (3) each group of threads in the block is numbered with 0, the variable sum is a protocol summation result, if the group of threads is responsible for the calculation of the r-th row in the matrix A, calculating alpha multiplied sum + beta multiplied by y [ r ], obtaining a final calculation result, and writing the final calculation result back to y [ r ].

To sum up, the calculation method of the embodiment divides the sparse matrix according to the number of rows of the matrix, all threads in each thread block complete the calculation of the same number of rows together, each thread is responsible for the calculation of a plurality of elements in a plurality of rows, and the storage space is allocated in advance in the LDS. The computing method can fully utilize the architecture characteristics of the DCU and the GPU, and utilize the access and memory merging characteristics of hardware as much as possible, so that the access and memory bandwidth can be greatly improved.

It should also be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

Finally, it should be noted that while the above describes a preferred embodiment of the invention, it will be appreciated by those skilled in the art that, once the basic inventive concepts have been learned, numerous changes and modifications may be made without departing from the principles of the invention, which shall be deemed to be within the scope of the invention. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Claims

1. A sparse matrix vector multiplication memory access optimization computing method is suitable for DCU and GPU architectures, and is characterized by comprising the following steps:

dividing an original sparse matrix into a plurality of blocks according to a fixed number of rows, wherein each block is independently calculated by one thread block, and a space with a fixed size is respectively opened up for each thread block in an LDS (laser direct structuring) mode;

2. The sparse matrix vector multiplication access optimization calculation method of claim 1, wherein the original sparse matrix is stored using a CSR format.

3. The sparse matrix vector multiplication access optimization calculation method of claim 1, wherein when the original sparse matrix is divided into a plurality of blocks according to fixed number of rows, except for the last block, the number of rows of each block is equal and the number of rows is N, and the number of rows of the last block is less than or equal to N.

4. The sparse matrix vector multiplication access optimization calculation method of claim 1, wherein the number of non-zero elements calculated at most per round of each thread block is R x M; wherein M is the number of threads contained in each thread block, and R is a positive integer;

5. The sparse matrix vector multiply access optimization computing method of claim 1, wherein when all threads in each thread block are divided in groups of single or multiple threads, the number of threads in each group is a non-negative integer power of 2;