CN110580675A

CN110580675A - Matrix storage and calculation method suitable for GPU hardware

Info

Publication number: CN110580675A
Application number: CN201910859641.5A
Authority: CN
Inventors: 邵雪; 王晓光; 周振亚
Original assignee: Beijing CEC Huada Electronic Design Co Ltd
Current assignee: Huada Empyrean Software Co Ltd; Beijing CEC Huada Electronic Design Co Ltd
Priority date: 2019-09-11
Filing date: 2019-09-11
Publication date: 2019-12-17

Abstract

a matrix storage and calculation method suitable for GPU hardware comprises the following steps: 1) storing the row number, the column number, the nonzero elements, a mark of whether each element is nonzero or not and the position of the initial nonzero element; 2) accessing matrix elements through a GPU (graphics processing Unit), acquiring whether the matrix elements are nonzero and the values of the nonzero elements, and setting the values of the nonzero elements of the matrix; 3) and performing matrix operation by using the GPU. The matrix storage and calculation method suitable for GPU hardware can realize multi-thread high-speed access to any element in the matrix under the GPU hardware, thereby greatly improving the matrix calculation speed in the GPU.

Description

matrix storage and calculation method suitable for GPU hardware

Technical Field

The invention relates to the field of high-performance calculation of GPU (graphics processing Unit) hardware, in particular to the technical field of high-performance calculation of matrix multiplication and LU decomposition by the GPU hardware, and particularly relates to a matrix storage and calculation method suitable for the GPU hardware.

background

in recent years, the scale of matrix operation in high-performance computation is getting larger and stronger, and the traditional CPU framework is limited by the power consumption bottleneck, is difficult to further improve the performance, and cannot meet the computation requirement. Compared with the GPU, the GPU has the advantages of sufficient computing resources and high data access bandwidth, and can be accelerated by tens of times compared with the CPU under the ideal condition. However, the matrix decomposition has high correlation, so that the algorithm optimization difficulty is high, and the GPU application progress is slow.

Disclosure of Invention

In order to solve the defects in the prior art, the invention aims to provide a matrix storage and calculation method suitable for GPU hardware, which makes full use of the characteristics of the GPU hardware and realizes high-performance calculation of a matrix.

in order to achieve the above object, the matrix storage and calculation method suitable for GPU hardware provided by the present invention comprises the following steps:

1) Storing the row number, the column number, the nonzero elements, a mark of whether each element is nonzero or not and the position of the initial nonzero element;

2) Accessing matrix elements through a GPU (graphics processing Unit), acquiring whether the matrix elements are nonzero and the values of the nonzero elements, and setting the values of the nonzero elements of the matrix;

3) and performing matrix operation by using the GPU.

Further, the step 1) further comprises:

storing the number of rows and columns of the matrix;

Sequentially storing non-zero elements of the matrix to a first array according to the sequence of rows or columns of the matrix;

Storing a flag indicating whether each element in each row or column is non-zero in a second array;

storing the position of the starting non-zero element of each row or column in the first array to a third array.

further, the step of sequentially storing the non-zero elements of the matrix to the first array in the order of the rows or columns of the matrix further includes determining the size of the first array according to the total number of the non-zero elements of the matrix.

further, the step of storing the flag indicating whether each element in each row or column is non-zero into the second array further includes determining the size of the second array according to the number of rows or columns of the matrix, and respectively and continuously storing the non-zero flags of the matrix elements in the rows or columns into the second array according to the sequence from low to high, wherein each bit corresponds to one matrix element, and a bit value of 1 represents that the corresponding matrix element is non-zero.

Further, the step 2) further comprises the following steps:

Acquiring a non-zero zone bit according to the position information of the data to be read;

Reading the position of the first non-zero element of the row or column of the data to be read, and recording the position as a first position;

Calculating the position difference between the data to be read and the first non-zero element of the row or column where the data to be read is located, and recording the position difference as a second position;

Calculating the position of the data to be read in the first array, and recording the position as a third position, wherein the third position is the first position plus the second position;

and reading the value of the data to be read in the first array according to the third position.

Further, the step 3) includes matrix addition, matrix subtraction, matrix multiplication and matrix LU decomposition algorithm.

Further, the matrix addition comprises the steps of:

Judging whether two corresponding data in the two matrixes participating in the addition operation are nonzero or not;

if the two corresponding data are both zero elements, the addition operation or subtraction operation result of the two corresponding data is zero;

If only one of the two corresponding data is not zero, the result of the addition operation or the subtraction operation of the two corresponding data is a positive value or a negative value of the non-zero data;

And if the two corresponding data are not zero, the addition operation or the subtraction operation of the two data is the addition operation or the subtraction operation of the two corresponding data.

further, the matrix multiplication comprises the steps of:

Firstly to C_ijmaking a judgment if C_ijIf the element is a non-zero element, ending, otherwise, continuing;

Initializing variable v is 0;

Traverse the ith row of the A matrix from 1 to C_AElement a of_ikFrom 1 to R in jth column of matrix B_BElement b of_kjif a_ikAnd b_kjAll non-zero elements are then v ═ v + a_ik×b_kj；

Obtaining the result C_ij＝v。

further, the LU decomposition algorithm of the matrix includes the following steps:

Reading the compressed matrix data into a dense matrix D, wherein if a_ijis a non-zero element d_ij＝a_ijotherwise d_ijWhen the number of the threads is 0, sequentially synchronizing all the threads of the GPU to ensure the consistency of data in the D matrix;

traversing the kth row and the kth column of the decomposed D matrix in the order from 1 to R or C, and repeating the following steps:

Taking the k-th row element from k to C column element of the D matrix as the k-th row result of the U matrix, wherein if U_kjIs a non-zero element then u_kj＝d_kjj ranges from k +1 to C;

Dividing the elements of the k column from k +1 to R column of the D matrix by D_kkget the k column result of L matrix, wherein if L_ikis a non-zero element_ik＝d_ik/d_kkI ranges from k +1 to R;

Updating all elements at the lower right side of k column of k row of the D matrix, wherein if l_ikAnd u_kjAll are non-zero elements then d_ij＝d_ij-l_ik×u_kjI ranges from k +1 to R, j ranges from k +1 to C;

And all threads of the GPU are synchronized, and the consistency of all data in the D matrix is ensured.

To achieve the above object, the present invention further provides a computer readable storage medium, on which computer instructions are stored, and the computer instructions execute the steps of the above matrix storage and calculation method suitable for GPU hardware when executed.

Has the advantages that: according to the matrix storage and calculation method suitable for GPU hardware, the positions of the non-zero elements of the matrix in the storage space are calculated by using the non-zero marks, any element in the multi-thread high-speed access matrix under the GPU hardware can be accessed, the access efficiency of the GPU hardware to the matrix elements is improved, and therefore the calculation speed of the matrix on the GPU hardware is greatly improved.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

FIG. 1 is a flow chart of a matrix storage and computation method for GPU hardware according to the present invention;

FIG. 2 is a schematic diagram of a decomposed matrix store data according to an embodiment of the invention;

FIG. 3 is a flow diagram of a matrix LU decomposition of a computational method decomposition according to an embodiment of the invention.

Detailed Description

The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.

Fig. 1 is a flowchart of a matrix storage and calculation method suitable for GPU hardware according to the present invention, and the following describes in detail the matrix storage and calculation method suitable for GPU hardware according to the present invention with reference to fig. 1.

at step 11, the data is stored.

In this step, storing the data further comprises the steps of:

(111) Storing the number of rows and columns of the matrix;

(112) sequentially storing non-zero elements of the matrix according to the sequence of rows (columns) of the matrix;

(113) Storing a flag for whether each element in each row (column) is non-zero;

(114) the position of the starting non-zero element of each row (column) is stored.

fig. 2 is a schematic diagram of decomposed matrix storage data according to an embodiment of the present invention, and steps (111) to (114) are described below with reference to fig. 2.

First, in step (111), the number of rows and columns of the matrix are stored.

In the embodiment shown in fig. 2, the matrix has 5 rows and 6 columns, and is stored.

At step (112), the non-zero elements of the matrix are stored sequentially in the order of the matrix rows (columns).

in this step, the storage order of data in an existing matrix is determined as needed, and the following description will be given by taking a row-first scheme as an example. A memory space M S of matrix elements is prepared for the matrix according to its total number S of non-zero elements (shown as 101 in fig. 2). The non-zero elements of the matrix are then stored sequentially in the order of the matrix rows (columns).

Specifically, referring to fig. 2, the total number S of non-zero elements of the matrix is 12, and the storage space of the matrix element to be prepared is M [12 ]. The non-zero elements in the 5 x 6 matrix are then stored in order in M [12 ].

at step (113), a flag is stored for whether each element in each row (column) is non-zero.

In this step, an array F [ R ] (shown as 102 in FIG. 2) for which non-zero flags are prepared to be stored according to the row (column) number R of the matrix, where F [ i-1] is that the non-zero flags of all matrix elements in the ith row (column) are stored consecutively from low to high, each bit corresponds to a matrix element, and a bit value of 1 represents that this matrix element is non-zero.

the storage space required by the matrix F [ i ] with the number of columns (rows) not more than 32 is a 32-bit int variable, the storage space required by the matrix F [ i ] with the number of columns (rows) not more than 64 is a 64-bit int variable, the storage space required by the matrix F [ i ] with the number of columns (rows) exceeding 64 is ceil (matrix column number/64) 64-bit int variables, ceil is rounding-up operation, namely when the number of matrix columns (rows) is more than 64, the state of each row (column) is continuously stored in a plurality of 64-bit int type variables.

At step (114), the position of the starting non-zero element of each row (column) is stored.

in this step, the position indicated is the position ordinal number of the first non-zero element in each row (column) in the continuous storage space M [ S ] indicated by (112), and a data space is additionally added at the end to store the total number of non-zero elements of the whole matrix, that is, the position ordinal number of the last non-zero element in the last row (column) of the matrix in M [ S ] in (112) is increased by one.

Specifically, in this step, an array P [ R +1] (shown as 103 in fig. 2) of row (column) start positions is prepared for storing the matrix according to the number of rows R of the matrix, where P [0] ═ 0, i.e., the start position of M, P [ i ] ═ P [ i-1] + the number of non-zero elements in the ith row (column), and P [ R ] ═ the number of non-zero elements in the matrix.

Therefore, when data is stored in rows, one matrix a can be expressed as a set of the following data:

The number of rows R, the number of columns C, the total number of non-zero elements S of the matrix

a non-zero metadata array M of length S

Non-zero flag array F, length R × ceil (C/64)

Head of line position array P, length R +1

If the number of matrix columns is not greater than 32, the length of the non-zero-element flag array is R.

At step 12, the data is read.

in this step, data reading is performed based on the data storage in step 11. Specifically, the step further comprises the steps of:

(121) reading whether the jth bit of ith data in an array F of matrix storage data is true, and judging whether the ith row (column) and jth column (row) elements of the matrix are nonzero;

(122) Reading the ith data in the array P of matrix storage data to obtain the position P of the first non-zero element in the ith row (column) in the array M of matrix storage data_i；

(123) Calculating the sum of the (j-1) th bit to the (1) th bit of the ith data in the array F of the matrix storage data to obtain the relative position Q of the (j) th row (column) row (row) element to the first non-zero element of the row (column) in the array M of the matrix storage data_i；

(124)P_iAnd Q_iAnd adding to obtain the position of the ith row (column) and the jth column (row) elements in an array M of the matrix storage data, and then carrying out value taking and assignment operations.

specifically, in step (121), the matrix element a is judged_ijWhether it is non-zero.

Taking out the (i-1) × ceil (C/64) + (j-1)/64 data F in the non-zero element mark array F,

The result r of modulo 64 by j-1 is calculated,

Fetch the r-th bit of the data f, 1 then a_ijis a non-zero element, is 0_ijIs zero;

for a matrix of no more than 32 columns then the simplification is:

taking the (i-1) th data F in the non-zero element flag array F,

Fetch the j-1 th bit of the data f, which is 1 then a_ijis a non-zero element, is 0_ijIs zero;

at step (122), the first non-zero element position P of the ith row of the read matrix is P [ i-1 ].

At step (123), the matrix element a is calculated_ijoffset with respect to the first non-zero element position of row i.

the initial position offset q is 0 and,

traversing the ith row matrix from 1 to element a of j-1_ikif a_ikq is q +1 if the element is a non-zero element;

For the case where j reaches 64, the first 64 non-zero-bit flag data bits can be fetched at once and summed using the instructions of the GPU to increase speed.

at step (124), the matrix element a is read_ij＝M[p+q]writing matrix element M [ p + q ]]＝a_ij(only at a)_ijvalid for non-zero elements) a value assignment operation is performed.

in step 13, data calculations are performed.

in this step, the addition (subtraction), multiplication, and LU decomposition operations of the matrix may be implemented based on the data storage and data reading methods of step 11 and step 12.

For the addition (subtraction) method of the matrix, assuming that two matrixes participating in the operation are respectively a matrix A and a matrix B, respectively judging the element a of the matrix A and the element a of the matrix B at the ith row and the jth column_ijAnd b_ijWhether it is non-zero, the result is f_AAnd f_B：

(1) if f_Aand f_Breading element values of the matrixes A and B at the ith row and j column, and performing addition (subtraction) operation to obtain the element values of the ith row and j column of the result matrix C;

(2) If f_AIs 1 and f_BTaking the element value of the matrix A at the ith row and j column as the element value of the ith row and j column of the result matrix C, wherein the element value of the matrix A at the ith row and j column is 0;

(3) If f_Ais 0 and f_BTaking the positive (negative) element value of matrix B at ith row and j column as the element value of ith row and j column of result matrix C;

(4) If f_AAnd f_BAre all 0, and the element value of the ith row and j column of the result matrix C is also zero.

for multiplication of matrices A and B, it is required that the number of columns of matrix A is equal to the number of rows of matrix B, i.e. C_AAnd R_BEquality can be achieved by the following algorithm:

initializing variable v is 0;

Obtaining the result C_ij＝v。

the elements of each position of the addition and subtraction and multiplication operations of the matrix are multithread safe, and the result of one position can be calculated by each thread by utilizing a multithread parallel algorithm in GPU operation.

The LU decomposition operation of the matrix requires data access between threads using shared memory and thread synchronization techniques.

Fig. 3 is a flowchart of matrix LU decomposition according to the calculation method decomposition of the embodiment of the present invention, which will be described in detail below with reference to fig. 3:

(1) reading data of a compressed matrix (the number of rows and columns of the matrix should be consistent) described by the invention into a complete and continuous dense matrix D, wherein the storage space of the dense matrix is a shared memory to ensure that any thread of a GPU can access, and the space is R multiplied by C:

If a is_ijIs a non-zero element d_ij＝a_ijOtherwise d_ij＝0；

synchronizing all threads of the GPU to ensure the consistency of data in the D matrix;

(2) Traversing the kth row and the kth column of the decomposed D matrix in the order from 1 to R (C), and repeating the following steps (3) to (6);

(3) Taking the k-th row element from k to C column element of the D matrix as the k-th row result of the U matrix:

If u is_kjIs a non-zero element then u_kj＝d_kjJ ranges from k +1 to C;

(4) Dividing the elements of the k column from k +1 to R column of the D matrix by D_kkThe k column result of the L matrix is obtained:

If l is_ikIs a non-zero element_ik＝d_ik/d_kkI ranges from k +1 to R;

(5) Updating all elements on the lower right side of k column of k row of the D matrix:

If l is_ikand u_kjAll are non-zero elements then d_ij＝d_ij-l_ik×u_kjI ranges from k +1 to R, j ranges from k +1 to C;

(6) And all threads of the GPU are synchronized, and the consistency of all data in the D matrix is ensured.

In the decomposition process, the calculation process of L, U, D matrix elements under the condition of the same k value can use the multithreading technology of the GPU for parallel calculation, and each thread processes different (i, j) positions to obtain better parallel efficiency. Different values of k need to be calculated in order and the data in the shared memory is synchronized by using thread synchronization operation.

The invention further provides a computer-readable storage medium, on which computer instructions are stored, and the computer instructions execute the steps of the matrix storage and calculation method suitable for the GPU hardware when running, and the matrix storage and calculation method suitable for the GPU hardware is described in the foregoing section and is not described again.

those of ordinary skill in the art will understand that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. a matrix storage and calculation method suitable for GPU hardware comprises the following steps:

3) and performing matrix operation by using the GPU.

2. the method for storing and computing matrices adapted for use in GPU hardware of claim 1, wherein the step 1) further comprises:

Storing the number of rows and columns of the matrix;

3. A method for matrix storage and computation suitable for GPU hardware as claimed in claim 2, wherein the step of sequentially storing non-zero elements of the matrix into the first array in the order of rows or columns of the matrix further comprises determining the size of the first array according to the total number of non-zero elements of the matrix.

4. The method of claim 2, wherein the step of storing the flag indicating whether each element in each row or column is non-zero in the second array further comprises determining the size of the second array according to the number of rows or columns of the matrix, and storing the non-zero flags of the matrix elements in the rows or columns respectively and continuously in the second array from low to high, wherein each bit corresponds to one matrix element, and a bit value of 1 indicates that the corresponding matrix element is non-zero.

5. The method for storing and computing matrices adapted for use in GPU hardware of claim 1, wherein said step 2) further comprises the steps of:

Calculating the position of the data to be read in the first array, and recording the position as a third position, wherein the third position = the first position + the second position;

6. A matrix storage and calculation method suitable for GPU hardware as claimed in claim 1, wherein said step 3) comprises matrix addition, matrix subtraction, matrix multiplication and matrix LU decomposition algorithm.

7. A method for matrix storage and computation for GPU hardware as in claim 6, where the matrix addition comprises the steps of:

8. a matrix storage and computation method suitable for GPU hardware according to claim 6, characterized in that the matrix multiplication comprises the following steps:

initialization variable v = 0;

Traverse the ith row of the A matrix from 1 to C_Aelement a of_ikfrom 1 to R in jth column of matrix B_Belement b of_kjIf a_ikAnd b_kjV = v + a if both are non-zero elements_ik×b_kj；

obtaining the result C_ij = v。

9. a method for matrix storage and computation for GPU hardware as in claim 6, wherein the LU decomposition algorithm for the matrix comprises the following steps:

Reading the compressed matrix data into a dense matrix D, wherein if a_ijIs a non-zero element d_ij=a_ijOtherwise d_ij=0, all threads of the GPU are synchronized in sequence, and data consistency in the D matrix is guaranteed;

Taking the k-th row element from k to C column element of the D matrix as the k-th row result of the U matrix, wherein if U_kjIs a non-zero element then u_kj = d_kjJ ranges from k +1 to C;

Dividing the elements of the k column from k +1 to R column of the D matrix by D_kkget the k column result of L matrix, wherein if L_ikis a non-zero element_ik = d_ik/d_kkI ranges from k +1 to R;

updating all elements at the lower right side of k column of k row of the D matrix, wherein if l_ikAnd u_kjAll are non-zero elements then d_ij = d_ij - l_ik×u_kjI ranges from k +1 to R, j ranges from k +1 to C;

10. a computer readable storage medium having stored thereon computer instructions, wherein the computer instructions when executed perform the steps of the matrix storage and computation method for GPU hardware of any of claims 1 to 9.