CN112560356A

CN112560356A - Sparse matrix vector multiply many-core optimization method for many-core architecture

Info

Publication number: CN112560356A
Application number: CN201910919675.9A
Authority: CN
Inventors: 郭恒; 陈鑫; 刘鑫; 陈德训; 李芳�; 徐金秀; 孙唯哲
Original assignee: Wuxi Jiangnan Computing Technology Institute
Current assignee: Wuxi Jiangnan Computing Technology Institute
Priority date: 2019-09-26
Filing date: 2019-09-26
Publication date: 2021-03-26

Abstract

The invention discloses a many-core architecture-oriented sparse matrix vector multiply many-core optimization method, which comprises the following steps of: s1, known: a sparse matrix A with m rows and n columns and a vector x with the length of n; solving a vector y with the length of m, wherein y = Ax is a dot product of the sparse matrix A and the vector x; s1, defining the size blk _ x _ size of the x vector block, and blocking the x vector elements according to the subscripts of the x vector elements to block the x vector; and S2, counting the number of the x vector block corresponding to the column number of each row non-zero element in the original sparse matrix, namely the sparse matrix A, according to the blocking information of the x vector, namely the number information of the x vector block where the x vector element obtained by solving in S1 is located, thereby counting the number information of the x vector block needed by each row of the sparse matrix during sparse matrix vector multiplication. The invention improves the integral many-core acceleration performance, improves the locality of data access, and has obvious optimization effect on the non-structural grid CFD application.

Description

Sparse matrix vector multiply many-core optimization method for many-core architecture

Technical Field

The invention belongs to the technical field of sparse matrix vector multiplication, and particularly relates to a sparse matrix vector multiplication many-core optimization method for many-core architecture.

Background

With the continuous deepening of the application research of the unstructured grid CFD and the rapid development of the supercomputer technology, the many-core acceleration of sparse matrix vector multiplication becomes one of the key points of the optimization research of the CFD application.

As the non-zero elements of the sparse matrix generated by the CFD application of the unstructured grid are loosely distributed and the coding span of different non-zero elements is large, one of the computational cores of the program, namely sparse matrix vector multiplication (Spmv), has a quite obvious discrete access problem, which also becomes a difficulty of the multi-core optimization of the CFD application of the unstructured grid.

With the increase of the grid scale, the problem of discrete access in sparse matrix vector multiplication becomes more obvious, great performance loss is brought to the performance of CFD application, and sometimes the problem even becomes the main bottleneck of the whole application. In order to solve the problem of sparse matrix vector multiply many-core optimization, aiming at the application characteristics of the non-structural grid CFD, the property of the sparse matrix applied by the non-structural grid CFD needs to be fully utilized, a more definite and more detailed many-core optimization algorithm is provided, and the overall performance of a program is improved.

For the many-core optimization problem of sparse matrix vector multiplication, the traditional method is to perform row-column partitioning on an original sparse matrix according to a proper size, and each acceleration core is responsible for calculating a part of matrix block data. The method can effectively reduce the problem of discrete access in sparse matrix vector multiplication, but the sparse matrix generated in the non-structural grid CFD application is often extremely sparse (each row of the sparse matrix is often only provided with a plurality of non-zero elements), and the distribution of the non-zero elements has a certain statistical rule. Because the matrix is extremely sparse, a very serious load balancing problem occurs in the traditional fixed-width row-column blocking method, and the data transmission bandwidth between the CPU and the many-core coprocessor cannot be effectively utilized, so that the Spmv many-core optimization effect for CFD application is poor.

Disclosure of Invention

The invention aims to provide a sparse matrix vector multiply many-core optimization method facing a many-core architecture, which greatly improves the locality of data access, effectively utilizes the data transmission bandwidth between a CPU and a many-core coprocessor and has obvious optimization effect on the application of a non-structural grid CFD.

In order to achieve the purpose, the invention adopts the technical scheme that: a sparse matrix vector multiply many-core optimization method for many-core architecture comprises the following steps:

s1, known: a sparse matrix A with m rows and n columns and a vector x with the length of n; solving a vector y with the length of m, wherein the y is the dot product of the sparse matrix A and the vector x;

s1, defining the size blk _ x _ size of the x vector block, and blocking the x vector elements according to the subscripts of the x vector elements to block the x vector;

s2, counting the number of an original sparse matrix, namely the number of an x vector block corresponding to the column number of each row of non-zero elements in the sparse matrix A according to the blocking information of the x vector, namely the number information of the x vector block where the x vector element obtained by solving in S1 is located, and thus counting the number information of the x vector block needed by each row of the sparse matrix when the sparse matrix is multiplied by the vector;

s3, combining all matrix rows with the same number of x vector blocks required in sparse matrix vector multiplication to form all small _ blocks;

s4, taking block as a performance evaluation function when Spmv, and combining part small _ block to form big _ block on the premise that the number of the dependent x vector blocks does not exceed a threshold value;

s5, establishing a sparse matrix A ' according to big _ block information, wherein the sparse matrix A comprises a plurality of lines in the sparse matrix A and x vector block information required in sparse matrix vector multiplication calculation, mapping the lines in the A to the A ' according to big _ block array sequence, and writing the lines in the A to the A ' according to big _ block array sequence;

s6, carrying out Spmv many-core acceleration according to matrix information after the block division of the preprocessing stages S1-S5, namely in a sparse matrix A', non-zero elements are arranged according to a big _ block array sequence, in a Spmv many-core acceleration algorithm, a computing unit is big _ block, in one core, non-zero element data in big _ block is imported, x vector block data required in big _ block is imported, and then a computing result is written back to a main memory;

the big _ block is a basic data unit processed by each computing core in the many cores.

The technical scheme of further improvement in the technical scheme is as follows:

1. in the above scheme, in step S1, the x-vector chunk size blk _ x _ size is 256 or 512, the number of x chunks into which the element at the lower index _ x position in the x-vector is divided is ind _ x/blk _ x _ size, and then ind _ x/blk _ x _ size is rounded down.

2. In the foregoing solution, the statistical manner in step S2 is as follows: assuming that the column number of a column in which a non-zero element in the sparse matrix A is located is c, the number of x vector blocks required by the non-zero element in the sparse matrix vector multiplication is c

Therefore, the number information of the x vector blocks corresponding to each row of non-zero elements in the matrix A can be calculated, and in one row of the matrix A, if the number of the x vector blocks calculated by a plurality of non-zero elements is the same, the number is only recorded once when the number of the x vector blocks required by the row in the matrix A is recorded.

3. In the foregoing scheme, in step S3, the small _ block is calculated in S2 to obtain x vector block number information required by each row of non-zero elements in the matrix a when performing sparse matrix vector multiplication, the small _ block is a small block, row numbers of multiple rows in the matrix a are recorded in one small block, and the x vector block number information required by the rows in a corresponding to the row numbers in one small block in a when performing sparse matrix vector multiplication is completely the same.

4. In the foregoing solution, the operation in step S4 is specifically:

s41, for a small _ block, assuming that it includes k rows in the sparse matrix a, where the k rows in a include a total number of nonzero elements sum, and the number of x vector blocks required by the rows in the small _ block when performing sparse matrix vector multiplication is n _ blk _ x, for the small _ block, defining a performance evaluation function value ═ sum/(n _ blk _ x _ size), setting a threshold value of an upper limit of the number of x vector blocks to 10, that is, if n _ blk _ x >10, then ue is 0;

s42, traversing the original small _ blocks, when traversing to the ith small _ block, calculating a value obtained by combining the small _ block with the rest small _ blocks which do not participate in combination, selecting the jth small _ block which can enable the value to reach the maximum value, and combining the ith small _ block with the jth small _ block to form a new small _ block;

and S43, traversing the small _ block array once to form a new small _ block array, then traversing again on the basis of the new small _ block array, and repeating for 3-10 times, so that the finally formed small _ block array is renamed to be a big block, namely a big _ block array.

5. In the above scheme, the merging of the two small _ blocks means: and merging the lines in the two small _ blocks into a new small _ block, and recalculating the x vector block information required by the lines in the new small _ block when the sparse matrix vector multiplication is carried out.

6. In the above scheme, the big _ block includes multiple rows of non-zero elements in the original sparse matrix, and the number of x vector blocks required in the Spmv calculation is set according to a threshold, and is at most 10.

7. In the above scheme, the threshold is set to 8, 10 or 12.

Due to the application of the technical scheme, compared with the prior art, the invention has the following advantages:

the sparse matrix vector multiplication and many-core optimization method oriented to the many-core architecture fully utilizes the sparse matrix structural characteristics of the CFD application of the non-structural grid, fully utilizes the data transmission bandwidth between a CPU and a many-core coprocessor by utilizing an irregular blocking mode, improves the integral many-core acceleration performance, improves the locality of data access, and has obvious optimization effect on the CFD application of the non-structural grid; the method has good applicability to non-structural mesh CFD application, and has good application prospect to other science applications with similar matrix properties generated by non-structural meshes.

Drawings

FIG. 1 is a schematic diagram of the irregular partition of the sparse matrix according to the present invention.

Detailed Description

The invention is further described below with reference to the following examples:

example (b): a sparse matrix vector multiply many-core optimization method for many-core architecture comprises the following steps:

the x-vector chunk size blk _ x _ size in step S1 above is 256 or 512, and the number of x chunks into which the elements at the lower index _ x position in the x-vector are divided is ind _ x/blk _ x _ size, and then ind _ x/blk _ x _ size is rounded down.

The statistical method in the step S2 is as follows: assuming that the column number of a column in which a non-zero element in the sparse matrix A is located is c, the number of x vector blocks required by the non-zero element during the sparse matrix vector multiplication is c

In the step S3, the small _ block is calculated in S2 to obtain the number information of x vector blocks required by each row of non-zero elements in the matrix a when performing sparse matrix vector multiplication, the small _ block is a small block, the row numbers of multiple rows in the matrix a are recorded in one small block, and the number information of x vector blocks required by the rows in a corresponding to the row numbers in one small block, when performing sparse matrix vector multiplication, in a small block, are completely the same.

The operation in step S4 is specifically:

The two small _ block combinations refer to: and merging the rows in the two small _ blocks into a new small _ block, and recalculating the x vector block information required by the rows in the new small _ block when the sparse matrix vector multiplication is carried out.

The big _ block comprises a plurality of lines of non-zero elements in the original sparse matrix, and the number of x vector blocks required in Spmv calculation is set according to a threshold value and is at most 10.

The threshold is set to 8, 10 or 12.

The above-mentioned aspects of the invention are further explained as follows:

at present, the solution of the large-scale sparse matrix linear equation set mainly adopts iteration methods such as PCG (pulse code generator) and the like. Spmv is a relatively time-consuming part of a single iteration of the PCG. In the PCG, the value of the sparse matrix is kept unchanged, so that the preprocessing part only needs to do once when the PCG solves a large-scale sparse matrix linear equation set, and the time can be ignored compared with the time consumed by iteration. In addition, for the mapping operation of the row elements of the original sparse matrix, the PCG is essentially to solve the equation set Ax ═ b, and simultaneously exchange the row elements at corresponding positions in a, x and b, and the equation is still true, that is, the mapping operation of the row of the sparse matrix only needs to be performed once in the preprocessing stage and the PCG iteration ending stage, and the loss caused by the mapping operation can be ignored compared with the whole iteration time.

After the optimization, the sparse matrix structure characteristic applied by the non-structural grid CFD can be fully utilized in the whole Spmv many-core acceleration process, the data transmission bandwidth between the CPU and the many-core coprocessor is fully utilized by utilizing an irregular blocking mode, and the whole many-core acceleration performance is improved. The experimental tests show that the many-core acceleration strategy provided by the invention can effectively improve the Spmv performance by 12-14 times compared with a main core within 10 ten thousand grids.

When the sparse matrix vector multiply-many-core optimization method facing the many-core architecture is adopted, the property of a sparse matrix in the application of the non-structural grid CFD is fully utilized, the data transmission bandwidth between a CPU and a many-core coprocessor is fully utilized in a sparse matrix row mapping and irregular blocking mode, and the Spmv many-core performance is improved.

To facilitate a better understanding of the invention, the terms used herein will be briefly explained as follows:

discrete memory access: the English language is discrete-time storage, and due to the data structure characteristics of the unstructured grid, stored data are discrete and irregular, so that the Cache hit rate in the calculation is low, and a CPU (central processing unit) often needs to frequently access a memory. In application programs, discrete access problems widely occur in problems such as flux calculation, large-scale linear equation set solution, sparse matrix vector multiplication and the like. This is a common phenomenon for scientific computing type applications.

Spmv: sparse Matrix-Vector Multiplication (Sparse Matrix-Vector Multiplication) is one of the common computational cores in the scientific computational problem, and the solved problem is in the form of y ═ Ax, wherein a and x are known Sparse moments and vectors respectively, y is a Vector to be solved, and the solving method is Matrix Vector Multiplication of the Sparse Matrix a and the known Vector x. Because the non-zero elements in the sparse matrix are irregularly distributed, the Spmv faces a more serious scattered memory access problem, and the point of reducing or avoiding the scattered memory access is the optimization of the Spmv.

Ax ═ b: a large scale sparse matrix linear system of equations. Knowing the sparse matrix a and the vector b, the x vector satisfying the condition is solved.

PCG: conjugate gradient method with preconditioner. The method is a common effective method for solving a large-scale sparse matrix equation set (Ax ═ b). As an iterative method, each step of iteration of the PCG has at least one Spmv operation, and the Spmv operation occupies a larger time proportion in a single iteration and is one of the key points of the optimization of the PCG.

CFD: and calculating fluid mechanics. The fluid flow problem is solved numerically.

The above embodiments are merely illustrative of the technical concepts and features of the present invention, and the purpose of the embodiments is to enable those skilled in the art to understand the contents of the present invention and implement the present invention, and not to limit the protection scope of the present invention. All equivalent changes and modifications made according to the spirit of the present invention should be covered within the protection scope of the present invention.

Claims

1. A sparse matrix vector multiply many-core optimization method for many-core architecture is characterized by comprising the following steps: knowing a sparse matrix A with m rows and n columns and a vector x with the length of n; solving a vector y with the length of m, wherein the y is the dot product of the sparse matrix A and the vector x;

the sparse matrix vector multiply-many kernel optimization method comprises the following steps:

s1, defining the size blk _ x _ size of the x vector block, and blocking the x vector elements according to the subscripts of the x vector elements;

s2, counting an original sparse matrix, namely the number of an x vector block corresponding to the column number of each row of non-zero elements in the sparse matrix A according to the blocking information of the x vector, namely the number information of the x vector block where the x vector element is located obtained in S1, and thus counting the number information of the x vector block needed by each row of the sparse matrix when the sparse matrix vector is multiplied;

2. The many-core architecture-oriented sparse matrix vector multiply many-core optimization method of claim 1, wherein: in step S1, the x-vector chunk size blk _ x _ size is 256 or 512, the number of x chunks into which the element at the position of the lower index _ x in the x-vector is divided is ind _ x/blk _ x _ size, and then ind _ x/blk _ x _ size is rounded down.

3. The many-core architecture-oriented sparse matrix vector multiply many-core optimization method of claim 1, wherein: the statistical method in step S2 is as follows: assuming that the column number of a column in which a non-zero element in the sparse matrix A is located is c, the number of x vector blocks required by the non-zero element during the sparse matrix vector multiplication is c

4. The many-core architecture-oriented sparse matrix vector multiply many-core optimization method of claim 1, wherein: in the step S3, the small _ block is calculated in S2 to obtain the number information of x vector blocks required by each row of non-zero elements in the matrix a when sparse matrix vector multiplication is performed, the small _ block is a small block, the row numbers of multiple rows in the matrix a are recorded in one small block, and the number information of x vector blocks required by the rows in the matrix a corresponding to the row numbers in one small block, when sparse matrix vector multiplication is performed, are completely the same.

5. The many-core architecture-oriented sparse matrix vector multiply many-core optimization method of claim 1, wherein: the specific operation in step S4 is:

s41, for a small _ block, assuming that it includes k rows in the sparse matrix a, where the total number of non-zero elements included in the k rows in a is sum, and the number of x vector blocks required by the rows in the small _ block when performing sparse matrix vector multiplication is n _ blk _ x, defining a performance evaluation function value/(n _ blk _ x _ size) for the small _ block, and setting a threshold of an upper limit of the number of x vector blocks to 10, that is, if n _ blk _ x >10, the value is 0;

6. The many-core architecture-oriented sparse matrix vector multiply many-core optimization method of claim 5, wherein: the two small _ block combinations refer to: and merging the rows in the two small _ blocks into a new small _ block, and recalculating the x vector block information required by the rows in the new small _ block when the sparse matrix vector multiplication is carried out.

7. The many-core architecture-oriented sparse matrix vector multiply many-core optimization method of claim 1, wherein: the big _ block comprises a plurality of lines of non-zero elements in the original sparse matrix, the number of x vector blocks required in Spmv calculation is set according to a threshold value, and the number is 10 at most.

8. The many-core architecture-oriented sparse matrix vector multiply many-core optimization method of claim 1, wherein: the threshold is set to 8, 10 or 12.