CN112540718A

CN112540718A - Sparse matrix storage method for Schenk core architecture

Info

Publication number: CN112540718A
Application number: CN201910898286.2A
Authority: CN
Inventors: 陈德训; 李芳�; 赵朋朋; 刘鑫; 徐金秀; 孙唯哲; 陈鑫; 郭恒
Original assignee: Wuxi Jiangnan Computing Technology Institute
Current assignee: Wuxi Jiangnan Computing Technology Institute
Priority date: 2019-09-23
Filing date: 2019-09-23
Publication date: 2021-03-23

Abstract

The invention discloses a sparse matrix storage method facing a Shenwei many-core architecture, wherein a many-core processor consists of 4 heterogeneous groups, each heterogeneous group comprises a master core, a slave core cluster consisting of 64 slave cores, a heterogeneous group interface and a storage controller, and the whole chip is provided with 260 computing cores; the sparse matrix storage format comprises the steps of: s1, grouping the sparse matrix on a core group array of the Shenwei many-core processor according to rows, grouping a plurality of rows from each secondary core, and grouping 64 secondary cores into 64 groups; and S2, compressing and storing the non-zero elements of the sparse matrix in each group according to columns, compressing column coordinates by storing the number of the non-zero elements contained in each column of the sparse matrix, and storing column indexes, row coordinates and non-zero element values. The invention can provide a uniform data organization form for the whole process solving of the application software based on the many-core processor, thereby improving the adaptability of the problem to the architecture of the Shenwei many-core.

Description

Sparse matrix storage method for Schenk core architecture

Technical Field

The invention belongs to the technical field of scientific computing, and particularly relates to a sparse matrix storage method for a Schenwark kernel architecture.

Background

Sparse matrices are key data structures and performance bottlenecks in numerical simulation calculations in many fields of natural and social sciences. Since the non-zero elements in the sparse matrix are extremely rare, the sparse matrix cannot be stored in a two-dimensional array according to a general dense matrix storage mode. So far, there has been considerable work on how to store a Sparse matrix, the most basic storage methods COO (coordinate list), CSR (Sparse Row compression), CSC (Sparse Column compression), ELLPACK, etc.

In recent years, a heterogeneous architecture high-performance computing platform based on a many-core processor gradually becomes mainstream, the number of processor cores is increased, the vectorization length is increased, and the number of cache levels and the capacity are enhanced. Under a new computer architecture, the traditional sparse matrix application oriented to the multi-core CPU cannot increase the ship height with the improvement of hardware performance, and a new sparse matrix storage mode and a corresponding parallel algorithm must be researched on the basis of an Shenzhong core processor.

Based on a GPU, new sparse matrix storage formats and algorithms such as a HYB mixed format, an ELL-R method, an ELL method, a CSMR format, a BCCOO format and the like gradually appear, but the formats and the algorithms are not suitable for the Schenwarren core architecture. In order to improve the computing performance of the sparse matrix algorithm based on an Schenwein core processor, a new sparse matrix storage format needs to be researched.

When a large sparse matrix stored according to a traditional storage format is used for matrix operation by using a many-core processor, the problems of discrete access, write conflict, poor data reusability, unbalanced load and the like are faced. Some application software based on the Schenweiman core architecture locally optimizes certain sparse matrix operation according to specific sparse matrix characteristics, and the sparse matrix operation needs to be restored to an original matrix format after being finished, so that on one hand, the conversion cost of a matrix storage type cannot be avoided, and on the other hand, the integrity of the application software is damaged.

Disclosure of Invention

The invention aims to provide a sparse matrix storage method for an application-oriented numerous-core architecture, and the sparse matrix storage format can provide a uniform data organization form for the whole process solving of application software based on a many-core processor, so that the adaptability of the problem to the application-oriented numerous-core architecture is improved.

In order to achieve the purpose, the invention adopts the technical scheme that: a sparse matrix storage method facing a Shenwei many-core architecture is characterized in that a many-core processor is composed of 4 heterogeneous groups, each heterogeneous group comprises a control core, a slave core cluster composed of 64 computing cores, heterogeneous group interfaces and a storage controller, and the whole many-core processor is provided with 260 computing cores;

the sparse matrix storage format comprises the steps of:

s1, grouping the sparse matrix on the core group array of the Shenwei many-core processor according to rows: setting the number of rows of the sparse matrix as N, setting the number of computing cores of the core group array as 64, and setting the number of rows of each computing core as 64/N, and sequentially distributing;

s2, compressing and storing non-zero elements of the sparse matrix in the group according to columns, compressing column coordinates by storing the number of the non-zero elements contained in each column of the sparse matrix, and storing column offset Col _ p, Row coordinates Row _ i and non-zero element Value;

s21, recording each non-zero element in the matrix in the order of columns:

s211, recording the column offset Col _ p of the first non-zero element in each column, and supplementing the total number of the non-zero elements of the matrix at the last of the column offset;

s212, recording the Row coordinate Row _ i of each nonzero element;

and S213, recording the Value of each non-zero element.

The technical scheme of further improvement in the technical scheme is as follows:

1. in the scheme, the computing core is mainly responsible for fine-grained parallel computing tasks, the computing core can directly access the main memory discretely, also can access the main memory in batches in a DMA mode, and can perform high-efficiency communication in a many-core array in a register communication mode.

Due to the application of the technical scheme, compared with the prior art, the invention has the following advantages:

the sparse matrix storage method facing the Shenwei many-core architecture adopts a sparse matrix group inner column compression storage format, can avoid the problem of write conflict and the problem of discrete access to a main memory, can improve the sparse matrix calculation efficiency, reduces the format conversion overhead during multiple matrix operations in application software, can provide a uniform data organization form for sparse matrix operations, and obtains higher solving efficiency on a many-core processor, thereby improving the many-core architecture adaptability of the application problems; the format has good adaptability of the Schmitt-Kernel framework to any sparse matrix type (asymmetric, non-centralized and irregular).

Drawings

FIG. 1 is a schematic diagram of a many-core processor architecture of the present invention;

FIG. 2 is a schematic diagram of the sparse matrix grouping of the present invention.

Detailed Description

The invention is further described below with reference to the following examples:

example (b): a sparse matrix storage method facing a Shenwei many-core architecture is characterized in that a many-core processor is composed of 4 heterogeneous groups, each heterogeneous group comprises a control core, a slave core cluster composed of 64 computing cores, heterogeneous group interfaces and a storage controller, and the whole many-core processor is provided with 260 computing cores; the control core has the functions of a conventional multi-core CPU such as calculation, communication, I/O and the like, and is also responsible for control operations such as loading and recovery of slave core tasks;

the sparse matrix storage format comprises the steps of:

s21, recording each non-zero element in the matrix in the order of columns:

s212, recording the Row coordinate Row _ i of each nonzero element;

and S213, recording the Value of each non-zero element.

The computing core is mainly responsible for fine-grained parallel computing tasks, the computing core can directly access the main memory discretely, also can access the main memory in batches in a DMA mode, and can carry out high-efficiency communication in a register communication mode in the many-core array.

The above-mentioned aspects of the invention are further explained as follows:

sparse matrix: english is sparse matrix, and in the matrix, if the number of elements with the numerical value of zero is far more than the number of non-zero elements and the distribution of the non-zero elements is not regular, the matrix is called sparse matrix.

Many cores: english is many core, and the processor integrates a plurality of computing cores and is oriented to the field of high-performance computing.

Aiming at the architecture of the Shenwey many-core processor, a Sparse matrix storage format CSGC is provided, namely a format of 'Sparse Group Column compression'. The sparse matrix is firstly grouped on a core group array of the Shenweiong core processor according to rows, a plurality of rows from each core are grouped into one group, and 64 secondary cores are grouped into 64 groups. Secondly, compressing and storing non-zero elements of the sparse matrix in the group according to columns. Compressing the column coordinates by saving the number of non-zero elements each column of the sparse matrix contains requires saving the column index (Col _ p), the Row coordinates (Row _ i) and the non-zero element Value (Value).

The sparse matrix shown in fig. 2 is distributed into 64 slave cores by rows, wherein the slave cores from 0 to 3 are respectively divided into three rows of sparse matrices, and the other slave cores are similar to the sparse matrices;

for slave core No. 0, CSGC:

Col_p= [0，1，2，4，4，5，6，6，7]

Row_i = [0，1，2，0，1，2，0]

Value= [3，4，6，5，-1，2，7]

for slave core No. 1, CSGC:

Col_p= [0，2，2，3，4，5，6，7，7]

Row_i = [0，1，2，0，1，2，0]

Value= [-2，4，5，6，7，8，15]

for slave core No. 2, CSGC:

Col_p= [0，0，1，1，2，3，4，5，6]

Row_i = [1，0，1，2，0，1]

Value= [7，-3，12，9，9，10]

a CSGC method is used for storing the sparse matrix, and sparse matrix vector multiplication is carried out on the Shenwei many-core processor, so that the write conflict problem and the discrete access main memory problem can be avoided. Sparse Matrix-Vector Multiplication (referred to as SpMv for short) is an example to introduce a many-core optimization algorithm. Sparse matrix vector multiplication is a very important core function in numerical calculation for calculating b = a ∅, where a is the sparse matrix, ∅ is the known vector, and b is the resultant vector. An example of a simple vector multiplication of a sparse matrix of size n x n is shown below.

Firstly, according to the CSGC sparse matrix storage method, sparse matrices are grouped on a core group array according to rows, a result vector b is correspondingly grouped on the core group array according to rows, all calculations related to b are sequentially executed in a slave core, and the problem of write conflict does not exist.

Secondly, the sparse matrix non-zero elements required by each computational core are compactly stored continuously according to a CSGC method and can be accessed in batches through DMA.

Finally, if the LDM does not put all ∅ vectors down during the calculation, multiple batch accesses ∅ are made by DMA. Therefore, the discrete access problem of the sparse matrix is solved.

In the data structure design of the whole application software, sparse matrix operation is involved, such as sparse matrix coefficient construction, sparse linear equation set solving, sparse matrix vector multiplication and the like, and a CSGC (in-group column compression) format is adopted, so that the sparse matrix calculation efficiency can be improved, and the conversion overhead of the matrix format can be reduced.

When the sparse matrix storage method facing the Schenwein kernel architecture is adopted, the sparse matrix calculation efficiency can be improved, and the cost of format conversion during multiple matrix operations in application software is reduced; the format has good adaptability of the Schmitt-Kernel framework to any sparse matrix type (asymmetric, non-centralized and irregular).

The above embodiments are merely illustrative of the technical ideas and features of the present invention, and the purpose thereof is to enable those skilled in the art to understand the contents of the present invention and implement the present invention, and not to limit the protection scope of the present invention. All equivalent changes and modifications made according to the spirit of the present invention should be covered within the protection scope of the present invention.

Claims

1. A sparse matrix storage method for a Schenwarren core architecture is characterized by comprising the following steps: the many-core processor is composed of 4 heterogeneous groups, each heterogeneous group comprises a control core, a slave core cluster composed of 64 computing cores, a heterogeneous group interface and a storage controller, and the whole many-core processor is provided with 260 computing cores;

the sparse matrix storage format comprises the steps of:

s21, recording each non-zero element in the matrix in the order of columns:

s212, recording the Row coordinate Row _ i of each nonzero element;

and S213, recording the Value of each non-zero element.

2. The sparse matrix storage method for the scheimpflug core architecture as claimed in claim 1, wherein: the computing core is mainly responsible for fine-grained parallel computing tasks, the computing core can directly access the main memory discretely, also can access the main memory in batches in a DMA mode, and can carry out high-efficiency communication in a register communication mode in the many-core array.