CN115048215A - Method for realizing diagonal matrix SPMV (sparse matrix) on GPU (graphics processing Unit) based on mixed compression format - Google Patents

Method for realizing diagonal matrix SPMV (sparse matrix) on GPU (graphics processing Unit) based on mixed compression format Download PDF

Info

Publication number
CN115048215A
CN115048215A CN202210569070.3A CN202210569070A CN115048215A CN 115048215 A CN115048215 A CN 115048215A CN 202210569070 A CN202210569070 A CN 202210569070A CN 115048215 A CN115048215 A CN 115048215A
Authority
CN
China
Prior art keywords
matrix
dia
zero elements
array
spmv
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210569070.3A
Other languages
Chinese (zh)
Inventor
徐悦竹
崔环宇
韩启龙
王念滨
宋洪涛
王宇华
刘成刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Engineering University
Original Assignee
Harbin Engineering University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Engineering University filed Critical Harbin Engineering University
Priority to CN202210569070.3A priority Critical patent/CN115048215A/en
Publication of CN115048215A publication Critical patent/CN115048215A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/505Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2237Vectors, bitmaps or matrices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0608Saving storage space on storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Human Computer Interaction (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Evolutionary Computation (AREA)
  • Geometry (AREA)
  • Computing Systems (AREA)
  • Algebra (AREA)
  • Instructional Devices (AREA)

Abstract

The invention belongs to the field of ship marine navigation simulation, and particularly relates to a method for realizing a diagonal matrix SPMV (sinusoidal pulse vector modulation) on a GPU (graphics processing unit) based on a mixed compression format. Inputting a ship marine navigation simulation matrix data file in a COO format, and converting the ship marine navigation simulation matrix data file into a traditional matrix form; dividing the matrix into a DIA matrix and a diagonal offset array based on a standard deviation minimum strategy of the number of non-zero elements; storing by using the residual data in the converted matrix and a CSR mode based on a block strategy; transferring DIA matrix data and CSR related data from a host end to an equipment end respectively, and performing GPU parallel SPMV operation in a mode that each thread processes one line; and transmitting the calculation results of the two stages from the equipment end to the host end, and integrating at the host end to realize the simulation of the marine navigation of the ship. The method is used for improving the calculation efficiency of the sparse matrix algorithm for ship marine navigation simulation.

Description

Method for realizing diagonal matrix SPMV (sparse matrix) on GPU (graphics processing Unit) based on mixed compression format
Technical Field
The invention belongs to the field of ship marine navigation simulation, and particularly relates to a method for realizing a diagonal matrix SPMV (sinusoidal pulse vector modulation) on a GPU (graphics processing unit) based on a mixed compression format.
Background
Sparse matrix-vector multiplication (SPMV) is one of the basic operations for solving linear systems. For many scientific and engineering applications, such as the field of marine vessel navigation simulation, the resulting matrix dimensions can be large and sparse. Furthermore, these sparse matrices have various sparse properties: irregularity, symmetry, diagonal, etc. Implementing and optimizing SPMVs using a suitable matrix compression algorithm is a challenging problem in a given parallel computing environment.
Due to the complexity of the ship marine navigation simulation data grid and the irregularity of the boundary of the calculation domain, the sparse mode of the linear system becomes more diversified. When the sparse matrix is stored and calculated, a large number of zero elements not only occupy extra storage space, but also cause calculation redundancy. Therefore, the parallel computing performance of the SPMV can be effectively improved by designing a proper sparse matrix storage format. For a tri-Diagonal/multi-Diagonal matrix, a Diagonal (DIA) storage method has better parallel computing efficiency than a line Compression (CSR), Coordinate Format (COO), and hybrid Format (HYB) method. The DIA method is stored according to the number of non-zero diagonals, but in order to maintain the diagonal structure of the matrix, a large number of zero elements need to be filled. This type of matrix may be referred to as a far matrix when there is a zero-long portion on a diagonal, or when there are multiple diagonals farther from the main diagonal, or there are multiple discrete points in the matrix. Compressing the far matrix requires a large amount of zero-element padding, which can lead to a reduction in GPU parallel computing performance.
The DIA method has a problem of load imbalance between blocks in addition to the problem of zero elements. Load imbalance may affect the performance of the parallel SPMV, possibly causing part of the computing resources to be idle, reducing the utilization of the GPU. Therefore, the research of the load balancing strategy is also an important direction in the fields of ship marine navigation simulation and high-performance calculation.
In recent years, there have been a number of SPMV methods that optimize the diagonal matrix. The basic idea of the DIA and CSR Hybrid (HDC) approach is to store only the diagonals with a sufficient number of non-zero elements in the DIA compression matrix (typically the threshold is half the number of elements in the main diagonal) and to store the other non-zero elements in the CSR format. HDC works well when the number of non-zero elements in most diagonals is close to the number of main diagonal elements. However, when the number of non-zero elements in most diagonals approaches the threshold, a large number of zero elements need to be filled in the DIA compression matrix, which degrades computational performance. A Diagonal compression format (BRCSD) method based on Row Blocks introduces a concept of slice points and completes a parallel block partitioning strategy based on block cutting, so that the compression ratio of the DIA algorithm is improved, but the problem of unbalanced load among Blocks exists. The improved HDC (M-HDC, modification-HDC) method can quickly and effectively extract partial diagonal structure. However, the distribution of the parallel blocks requires a lot of experiments to obtain the distribution form to be judged, and the preprocessing time is long.
Disclosure of Invention
The invention provides a method for realizing a diagonal matrix SPMV (sinusoidal pulse vector) based on a mixed compression format on a GPU (graphics processing unit), which is used for improving the SPMV parallel computing efficiency of ship marine navigation simulation.
The invention provides an electronic device.
The invention provides a non-transitory computer readable storage medium.
The invention is realized by the following technical scheme:
a method for implementing a diagonal matrix SPMV (sparse matrix) based on a hybrid compression format on a GPU (graphics processing unit), comprising the following steps of:
step 1: inputting a COO format marine navigation simulation matrix data file of the ship, and converting the COO format marine navigation simulation matrix data file into a traditional matrix form;
step 2: dividing the matrix converted in the step 1 into a DIA matrix and a diagonal offset array based on a standard deviation minimum strategy of the number of non-zero elements;
and step 3: storing by using the residual data in the input matrix converted in the step 1 and a CSR (common resource record) mode based on a block strategy;
and 4, step 4: respectively transmitting the DIA matrix data in the step 2 and the CSR related data in the step 3 from a host end to an equipment end, and performing GPU parallel SPMV operation according to a mode that each thread processes one line;
and 5: and (4) transmitting the calculation results of the two stages in the step (4) from the equipment end to the host end, and integrating at the host end to improve the simulation efficiency of the marine navigation of the ship.
A method for implementing a diagonal matrix SPMV based on a mixed compression format on a GPU (graphics processing Unit), wherein the step 1 specifically comprises the following steps:
step 1.1: acquiring a matrix file in a COO format in a sparse matrix data set for ship marine navigation simulation, and taking the matrix file as an input matrix;
step 1.2: reading the parameter values of the rows, the columns and the non-zero elements in the matrix file in the step 1.1;
step 1.3: and setting a dynamic two-dimensional pointer, reading the matrix data in the COO format into the pointer, and generating a traditional sparse matrix.
A method for implementing a diagonal matrix SPMV based on a hybrid compression format on a GPU, wherein the step 2 specifically comprises the following steps:
step 2.1: based on the matrix converted in the step 1, obtaining the number of non-zero elements in each diagonal line, and calculating the total number of the non-zero elements;
step 2.2: obtaining the average value of the number of non-zero elements in each diagonal line;
step 2.3: taking the average value of the number of the non-zero elements as a threshold value, and obtaining a DIA matrix part and an offset array in an input matrix;
step 2.4: obtaining the number of non-zero elements of each column from the DIA matrix, wherein the number of the non-zero elements corresponds to the number of the non-zero elements of each row in the input matrix; storing the number of non-zero elements of each column by using an array A, and calculating the average value of the array on each block;
step 2.5: taking the average value obtained from each block in the step 2.4 as a threshold value, and sequentially inserting the non-zero element values which are greater than the number in the array A into a new array B i From (1) to (B) i Until the sum of B is greater than or equal to the average value, B i Saving as a block; calculating the standard deviation between the blocks to obtain the block number with the minimum standard deviation as the optimal block number;
step 2.6: non-zero elements in the DIA matrix are stored in a diagonal manner while obtaining a diagonal offset.
A method for implementing a diagonal matrix SPMV based on a hybrid compression format on a GPU, wherein the step 3 specifically comprises the following steps:
step 3.1: after the DIA matrix is removed from the input matrix, the number of non-zero elements in each row is obtained and stored into an array C;
step 3.2: setting the number i of the blocks, and storing the previous i number in the array C in the step 3.1 into a new two-dimensional pointer D in a form of a column i Performing the following steps; judging again D i Recording the line number of the line with the minimum value of the sum of the number of the non-zero elements in each line;
step 3.3: the data in the array C are sequentially inserted into the two-dimensional pointer D based on the row number in step 3.2 i The preparation method comprises the following steps of (1) performing;
step 3.4: obtaining two-dimensional pointers D for multiple blocks i And the standard deviation of the non-zero elements in each block is calculated. Comparing the standard deviation between the blocks if D i Is the maximum standard deviation ofIf the block number is small, the block i is the optimal block number;
the method for implementing the diagonal matrix SPMV based on the hybrid compression format on the GPU comprises the following steps of 4,
step 4.1: passing the DIA matrix and diagonal offset from the host side to the device side; start the parallel DIA _ SPMV operation;
step 4.2: transferring the CSR array, the column index and the offset of the non-zero element from the host end to the equipment end; the operation of the parallel CSR _ SPMV starts.
A method for implementing a diagonal matrix SPMV based on a hybrid compression format on a GPU, wherein the step 5 comprises the following steps:
step 5.1: setting a CPU and GPU synchronization function;
step 5.2: based on the synchronization function of step 5.1, the result after parallel DIA computation is transferred from the device side to the host side;
step 5.3: based on the host side in step 5.1, transmitting the result after the parallel CSR calculation from the equipment side to the host side;
step 5.4: the results of step 5.2 and step 5.3 are integrated.
A method for realizing a diagonal matrix SPMV based on a mixed compression format on a GPU (graphics processing Unit), wherein in the step 5.2, parallel DIA calculation specifically comprises the steps of setting a calculation starting position of each thread;
setting the position of the calculated y value of each thread; setting the maximum non-zero element calculation number of each thread;
and calculating the SPMV calculation of each thread, and accumulating the number of elements in one iteration by each thread to obtain the value of y.
5.3, parallel CSR calculation is specifically to set a value which can be accessed to a corresponding range for storing a non-zero element array by each thread;
calculating the length of the array accessed by the thread based on the maximum data length in each block in the step 3.3;
starting to execute the parallel computing operation of the SPMV, and positioning to the correct position of the x value according to the column coordinates of the non-zero elements to carry out multiplication;
each thread iterates to carry out product operation and executes combination operation;
until each non-zero element in the CSR parallel stage completes SPMV calculation, and the y value is saved in the GPU.
An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the computer program to implement the method as described in any of the above.
A non-transitory computer-readable storage medium, having stored thereon a computer program which, when executed by a processor, implements the method of any one of the above.
The invention has the beneficial effects that:
the invention only stores the non-zero elements and the corresponding index addresses, thereby reducing unnecessary storage and calculation expenses.
The invention improves the prior art from multiple aspects such as a compression structure and an algorithm of a sparse diagonal matrix by combining the characteristics of a ship marine navigation simulation data structure.
Drawings
Fig. 1 is a schematic structural view of the present invention.
FIG. 2 is a diagram of DIA compression matrix partitioning in the PHCM storage format of the present invention.
Fig. 3 is a diagram of an irregular matrix according to the present invention.
Fig. 4 is a diagram illustrating a partitioning scheme of the number of non-zero elements in the matrix according to the present invention.
Figure 5 is a flow chart of the optimized DIA parallel algorithm of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Due to the adoption of a compression storage mode, the requirement of a storage space is reduced, and simultaneously, a large amount of discrete memory access and inter-address memory access operations are introduced, so that the memory access overhead is increased.
The sparse matrix algorithm is a typical irregular algorithm with limited access and memory, and the calculation access and memory ratio is low. The compressed storage destroys the time and space locality in the calculation process, so that the calculation efficiency of the sparse matrix algorithm on the traditional cache-based general processor is very low.
Due to the existence of a large number of zero elements in the sparse matrix, the storage and calculation method adopting the traditional dense matrix will cause the waste of storage and calculation resources, and possibly even cause the insufficient storage space of the system.
CPU preprocessing stage
The present invention requires an efficient compression process of the matrix prior to parallel computation. The compressed matrix can reduce zero element filling on a large scale, thereby reducing calculation redundancy and storage redundancy. In order to more efficiently process multi/tri-diagonal matrices and far matrices with each diagonal being further away from the main diagonal, a DIA compression method needs to be employed. However, the DIA may have a problem of excessive zero element filling, and therefore, the present invention introduces the setting of the threshold value on the basis of the DIA-based compression method, and combines the advantages of the CSR method, so as to effectively improve the matrix compression degree.
A method for realizing a diagonal matrix SPMV based on a mixed compression format on a GPU is mainly divided into the following parts in the CPU preprocessing stage:
step 1: before matrix compression operation, a sparse matrix data set of marine navigation simulation of a ship needs to be read into a program to serve as an input matrix. The data set is presented in the format of COO, which first needs to be converted into the conventional matrix form.
Step 2: aiming at the sum of the diagonal number and the number of the nonzero elements, the parameter avg of the average number of the nonzero elements in each block can be obtained. The input matrix is divided into DIA compression matrix and CSR matrix based on the value of avg. In this step, the DIA compression matrix is reasonably partitioned and efficient storage is accomplished.
And step 3: and storing the matrix data after DIA stage division in a CSR mode. Since the number of non-zero elements in each row is not uniformly distributed, the CSR storage method needs to be optimized to satisfy the SPMV load balancing condition based on the CSR compression method. If the number similarity of the non-zero elements among each block is maximum, the parallel computing efficiency can be effectively improved.
And 4, step 4: and data transmission between the CPU and the GPU is realized.
Further, the step 1 comprises the following steps:
step 1.1: and analyzing and obtaining a data file in a COO format of a diagonal matrix and an irregular matrix in a data set of ship marine navigation simulation.
Step 1.2: and restoring the data file in the COO format into a traditional sparse matrix form which can be accommodated by the memory.
Further, the step 2 comprises the following steps:
step 2.1: and acquiring the number of non-zero elements on each diagonal line in the input matrix, and storing the number into a specified array.
Step 2.2: and setting a threshold value, storing diagonal data which is larger than the threshold value in the input matrix by using a DIA compression matrix, and storing the rest matrix data by using a CSR matrix.
Step 2.3: the upper limit and the lower limit of the number of blocks divided in the DIA matrix are analyzed, and the number of non-zero elements of each column in the DIA compression matrix is obtained.
Step 2.4: and judging the size between the value of each column of elements and the average value, dividing the elements into the same block if the value is larger than or equal to the average value, and generating a plurality of groups according to the strategy.
Step 2.5: and calculating the standard deviation among a plurality of blocks in each group, and recording the group number of the minimum standard deviation, so that the optimal block number participating in the SPMV parallel calculation and the data distribution can be obtained.
Further, the step 3 comprises the following steps:
step 3.1: and compressing the data left after the DIA stage by adopting a CSR storage mode.
Step 3.2: and acquiring the number of non-zero elements in each row in the residual matrix, and rearranging. And setting a new matrix, and putting the number of the non-zero elements of the previous block number in the array into the new matrix as a first column according to different block numbers.
Step 3.3: and based on the principle that the total number of the non-zero elements in each row in the new matrix is minimum, the residual data is effectively divided, and the almost consistent number of the non-zero elements between the blocks is realized.
Further, the step 4 comprises the following steps:
step 4.1: transferring the DIA compressed data from the host side to the device side;
step 4.2: transferring CSR compressed data from host end to equipment end
(II) GPU parallel computing stage
A method for realizing a diagonal matrix SPMV based on a mixed compression format on a GPU is mainly divided into the following parts in a GPU parallel computing stage:
and 5: the invention provides a remote matrix parallel optimization algorithm based on a PHCM (hybrid compressed physical memory) compression method. Load balancing between blocks is guaranteed. The algorithm is divided into two stages, a DIA parallel computing stage and a CSR parallel computing stage. The y-value of the DIA part is first obtained and then the y-value of the CSR part is calculated.
Step 6: and returning the calculation result in the GPU to the CPU, and combining the y values generated in the two stages for the simulation process of the marine navigation of the ship.
Further, the step 5 comprises the following steps:
step 5.1: in the DIA parallel computing stage, based on a reasonable partition strategy and a thread allocation scheme, the design and implementation of a parallel algorithm are completed aiming at data in a diagonal form.
Step 5.2: in the CSR parallel computing stage, aiming at the distribution of non-zero elements in an irregular matrix, a parallel computing mode of processing one row by one thread is adopted, and simultaneously, a blocking strategy is combined to ensure the load balance among blocks.
Further, the step 6 comprises the following steps:
step 6.1: and the device end transmits the calculated result y values to the host end respectively.
Step 6.2: after the host receives, the y values generated by the two stages are merged.
An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the computer program to implement the method as described in any of the above.
A non-transitory computer-readable storage medium, having stored thereon a computer program which, when executed by a processor, implements the method of any one of the above.
In the following detailed description, reference is made to the accompanying drawings, which form a part hereof. As shown in fig. 1, the implementation method of the diagonal matrix SPMV based on the hybrid compression format on the GPU mainly includes two stages: CPU preprocessing phase and GPU parallel computing phase. The CPU preprocessing stage mainly comprises DIA matrix compression method optimization and CSR matrix compression method optimization. On the basis of keeping the diagonal form, the quasi-diagonal matrix is stored by using the DIA after optimization as much as possible, and the scattered point or irregular matrix is stored by using the CSR method after optimization. And the GPU parallel computing stage is used for carrying out parallel SPMV computing based on the optimized compression method PHCM.
The CPU preprocessing stage mainly comprises four steps: a data read-in phase, a DIA optimization phase, a CSR optimization phase, and a data transfer phase between the CPU and the GPU.
CPU preprocessing stage
Step 1: during the process of performing the SPMV, the real data needs to be tested and analyzed. The data is matrix data extracted from an engineering application. It is read into a matrix compression algorithm.
The step 1 specifically comprises the following steps:
step 1.1: acquiring a data file displayed in a COO format from a sparse matrix data set for ship marine navigation simulation: mtx file as input matrix.
Step 1.2: the file in COO format is read in and converted into the form of a conventional sparse matrix, i.e. an input matrix containing zero element padding.
And 2, step: a DIA optimization stage;
the step 2 specifically comprises the following steps:
step 2.1: based on the matrix data in step 1, the number of non-zero elements on each diagonal is obtained first, and is stored in the array dia _ num. To ensure the continuity of the diagonal matrix, more diagonal data satisfying the condition needs to be stored in the DIA matrix. Taking matrix a as an example (as shown in equation (1)), in order from the diagonal containing non-zero elements at the bottom left to the diagonal containing non-zero elements at the top right, dia _ num ═ 1, 2, 4, 3, 3} can be obtained.
Figure BDA0003659491380000091
Step 2.2: the sum of the arrays of dia _ num is calculated, resulting in the average value k _ Diagram of dia _ num. The matrix A is divided into DIA matrix and CSR matrix according to k _ Diagram. As shown in matrix a, k _ Diagonal is 2. And storing the diagonal lines with the number of non-zero elements larger than 2 in each diagonal line in the matrix A into the DIA compression matrix, and storing other diagonal lines containing non-zero elements into the CSR array in a row-major order. The DIA compression Matrix (DIA-Matrix) can be obtained by dividing according to the Matrix a, as shown in formula (2):
Figure BDA0003659491380000092
a total of three diagonals are placed in the DIA-Matrix. Wherein each column represents the distribution of each row of data in matrix a. In addition, it can be seen from the DIA-Matrix that 8 zero elements still need to be filled at this time. Further optimization of the DIA-Matrix is required.
Step 2.3: the number of non-zero elements in each column of the DIA-Matrix to be stored in the Array _ DIA is set. According to equation (2), the Array _ DIA Array form can be obtained as follows ═ 3, 3, 2, 2 }. A threshold for blocks is set, which is the number of blocks that need to be partitioned over the GPU for the DIA-Matrix. The minimum value of the number of Blocks (Blocks) ranges from 2 and the maximum value is the maximum number of columns of non-zero elements in the DIA-Matrix. It can be seen from the DIA-Matrix that Blocks is set to a minimum of 2 and a maximum of 4.
Step 2.4: and (3) giving a calculation formula of the optimal block number:
Array_DIA={x 0 ,x 1 ,x 2 ......x k } (3)
avg=(x 0 +x 1 +x 2 +...+x k )/Blocks (4)
according to equations (3) and (4), the elements in the Array _ DIA Array are compared with avg in order, from left to right. If x 0 >avg or x 0 Avg, then put in one block. Otherwise, continue to add x 1 Until the number of non-zero elements within a block is greater than or equal to avg. In this way, a distribution of a plurality of different groups of blocks can be obtained.
Step 2.5: according to step 2.4, the intra-block standard deviation of each group is calculated, and then which group has the smallest standard deviation is judged, which indicates that the total number of non-zero elements among the blocks of the group is almost equal, and the load balancing condition is met. One set contains a plurality of blocks, and the standard deviation formula is:
Figure BDA0003659491380000101
taking DIA-Matrix as an example, when Blocks is 2, avg is 10/2 is 5. The final block is { [3, 3 ]],[2,2]}. When Blocks is 3, avg is 10/3 is 3. According to equation (5), when Blocks is 2, s _ removal is √ ((6-5) 2 +(4-5) 2 ) And/2 is 1. When Blocks is 3, the obtained block { [3 ]],[3],[2,2]},s_deviation=√((3-3) 2 +(3-3) 2 +(4-3) 2 ) And/3 √ 0.3. Therefore, three blocks are selected by the division of the matrix, so that the load is more balanced. The distribution of data in the blocks after the DIA-Matrix partitioning is shown in fig. 2.
As can be seen from fig. 2, there are 2 zero elements in the DIA compression matrix of the PHCM method that need to be filled. Although the filling times are the same as those of the HDC method, the PHCM method ensures the continuity of DIA diagonal and x continuous data in the storage process, and improves the efficiency of sparse matrix access. Compared with the BRCSD method, the PHCM method has less filling of zero elements and more balanced load among blocks.
And 3, step 3: a CSR optimization stage;
the step 3 specifically comprises the following steps:
step 3.1: after the input matrix is compressed in the DIA stage, the remaining data needs to be efficiently stored using the CSR algorithm. The matrix data according to equation (1) is 11, 12, 13 after the DIA optimization stage. This example is small in scale and does not sufficiently reflect the problems in CSR. When the DIA phase ends, all remaining data is stored in the CSR. When the matrix size is large, the problem of load imbalance is likely to occur. In order to solve this problem, a processing method based on optimal blocking is adopted. An example of a matrix is shown in fig. 3.
Step 3.2: as can be seen from fig. 3, the matrix in the figure is an irregular matrix. This matrix is assumed to be the remaining non-zero element distribution of matrix a after the DIA stage is processed. First, an Array Row _ Array _ Old of non-zero elements is obtained for each Row of the matrix, as shown in equation (6).
Row_Array_Old={5,3,7,1,3,1,1} (6)
The rows Row _ Array _ Old are rearranged in descending order from large to small as shown in equation (7):
Row_Array={7,5,3,3,1,1,1} (7)
next, the CSR part in the matrix starts to be effectively divided into a plurality of blocks. To describe whether each block is the best partition, we first explain how to partition the array using the example of 3 blocks. The first three rows of the Array Row _ Array are shown as a column, as shown in fig. 4 (a). A new Matrix Segment is created to store information for each block.
Step 3.3: since 3 is the minimum of "7", "5", and "3" in fig. 4(a), "3" in Row _ Array is added to the "3" Row in fig. 4(a), and fig. 4(b) is formed. Our goal is to minimize the total number of non-zero elements in each line of Matrix _ Segment. Each row in the Matrix _ Segment corresponds to the number of non-zero elements of each block in the original Matrix. In fig. 4(b), the sum of the numbers of non-zero elements of the three blocks is calculated to be 7, 5, and 6, respectively. The remaining numbers in the Array Row _ Array are inserted into the non-zero elements and the smallest Row in the Matrix _ Segment in sequence. In FIG. 4(b), the total number of three rows of non-zero elements is: "7", "5", and "6". Since "5" is the minimum of "7", "5", and "6", the next value of the Array Row _ Array: "1" is inserted into the Row of FIG. 4(b) where "5" is located, resulting in FIG. 4 (c). This strategy is followed until every element of the Array Row _ Array is inserted into the Matrix _ Segment, as shown in fig. 4 (e). After partitioning the Matrix, the Array Row _ Array is updated with the values in Matrix _ Segment in Row-major order.
Based on this strategy, as shown in fig. 4(e), the final block distribution can be obtained as shown in equation (8):
Block_Array={[7],[5,1,1],[3,3,1]} (8)
the number of blocks needs to be set. The average of the number of non-zero elements in each block is calculated. Formulas (9-10) are given:
Row_Array=[a 1 ,a 2 ,a 3 ,......,a n ] (9)
Row_Array_Mean=(a 1 +a 2 +......a n )/Block i (10)
combining the calculation formula of the standard deviation of the formula (5), if the standard deviation s _ deviation of the ith Block is minimum, Block i Is the optimum number of blocks. When the matrix in fig. 4 selects Block to be 3, the standard deviation is the smallest, and load balancing between blocks can be achieved.
In order to more conveniently store the matrix using the CSR method, the original matrix needs to be rearranged according to the division result of the Array Row _ Array. It is important to note here that if there are few non-zero elements left after the DIA phase, it is recommended to use CSR directly. Through experimental analysis, it is suggested to use CSR directly for non-zero element quantities within 100. Our approach has no significant advantage over CSR when dealing with small amounts of non-zero elements, but may increase the time of the pre-treatment operation.
And 4, step 4: and a data transmission stage between the CPU and the GPU.
The step 4 specifically comprises the following steps:
step 4.1: transmitting an array related to parallel DIA from a host end to an equipment end, wherein a transfer function adopts cudaMemcpy, and a time synchronization function adopts cudaDeviceSynchronize ();
step 4.2: transmitting the array related to the parallel CSR from the host computer end to the equipment end, wherein a transfer function adopts cudaMemcpy, and a time synchronization function adopts cudaDeviceSynchronite ();
(II) GPU parallel computing stage
And 5: the invention provides a remote matrix parallel optimization algorithm based on a PHCM (hybrid compressed physical memory) compression method. The algorithm adopts the CUDA parallel optimization technology, and ensures the load balance among the blocks on the basis of reducing the filling of zero elements in the far matrix. The algorithm is divided into two stages, namely a DIA parallel computing stage (PHCM-DIA parallel algorithm) and a CSR (PHCM-CSR parallel algorithm) parallel computing stage. The y-value of the DIA part is calculated first and then the y-value of the CSR part is calculated. After both phases have been calculated, the y values are combined.
The step 5 specifically comprises the following steps:
step 5.1: DIA parallel computing process
Besides the Array dev _ Array _ DIA storing data, parameters x, dev _ Block _ Ele _ offset, dev _ Block _ Row _ offset, dev _ Array _ Index _ DIA, dev _ num _ flag, dev _ Block _ Row, and dev _ NNZ _ K are delivered to the device side. Where dev _ Block _ Ele _ offset represents the offset of the number of data elements in the Block. dev _ Block _ Row _ offset represents the number of matrix rows contained in each Block. dev _ Array _ Index _ DIA represents the column Index coordinates of the data relative to the original matrix. dev _ num _ diag represents the number of diagonals stored in the DIA compression matrix. dev _ Block _ Row represents the offset of the number of matrix rows contained in each Block. dev _ NNZ _ K represents the number of diagonals in each block.
In fig. 5, the data in the DIA matrix is divided into three blocks according to a partitioning policy, with one thread in each block processing a row in the matrix a. Each column of data in fig. 5 corresponds to a row of data in matrix a. Thus, when y is calculated in the GPU as Ax, each thread iteratively performs multiply and accumulate operations to get the value of y. Similarly, in fig. 5, dev _ Block _ ele _ offset is {0, 3, 6, 12}, dev _ Block _ Row _ offset is {1, 1, 2}, dev _ Block _ Row is {0, 4}, dev _ Array _ Index _ DIA is {0, 1, 2, 1, 2, 3, 3, 4, 4, 5}, and dev _ NNZ _ K is {3, 3, 4 }. The specific implementation process is shown as an algorithm PHCM-DIA:
Figure BDA0003659491380000121
Figure BDA0003659491380000131
step 5.2: CSR parallel computing process
And performing CSR parallel SPMV calculation according to the partition strategy in the step 3.3 and the array transferred to the GPU in the step 4.2. The invention provides a PHCM-CSR parallel algorithm, which mainly comprises the following steps:
1. each thread is configured to have access to values in the corresponding range in which the array of non-zero elements is stored.
2. Based on the maximum data length within each block in step 3.3, the length of the array accessed by the thread is calculated.
3. The parallel computation operation of the SPMV starts to be executed, and the product is positioned to the correct position of the x value according to the column coordinate of the non-zero element.
4. Each thread iterates the product operation and performs the merge operation.
5. Until each non-zero element in the CSR parallel stage completes SPMV calculation, and the y value is saved in the GPU.
And finally, transferring the result of the DIA and the result of the CSR from the equipment side to the host side, wherein the transferred function adopts cudammcmpy, and the special parameter is cudammcmpy DeviceToHost. And combining the y values obtained by the SPMV in the two stages at the host end to obtain a final result.
And 6: and returning the calculation result from the GPU to the CPU, and simultaneously combining the y values of the two algorithms.
The step 6 specifically comprises the following steps:
step 6.1: the calculated y values of CSR and DIA are passed through the cudaMemcpy function, respectively, and inherited by the host-side correlation array.
Step 6.2: and at the host end, combining the y values calculated by the CSR and the DIA to finish the generation of the final result, and iteratively realizing the simulation process of the marine navigation of the ship.

Claims (10)

1. A method for implementing a diagonal matrix SPMV based on a hybrid compression format on a GPU is characterized by comprising the following steps:
step 1: inputting a COO format marine navigation simulation matrix data file of the ship, and converting the COO format marine navigation simulation matrix data file into a traditional matrix form;
step 2: dividing the matrix converted in the step 1 into a DIA matrix and a diagonal offset array based on a standard deviation minimum strategy of the number of non-zero elements;
and step 3: storing by using the residual data in the matrix converted in the step 1 and by using a CSR mode based on a partitioning strategy;
and 4, step 4: respectively transmitting the DIA matrix data in the step 2 and the CSR related data in the step 3 from a host end to an equipment end, and performing GPU parallel SPMV operation according to a mode that each thread processes one line;
and 5: and (4) transmitting the calculation results of the two stages in the step (4) from the equipment end to the host end, and integrating at the host end to realize the simulation of the marine navigation of the ship.
2. The method according to claim 1, wherein the step 1 specifically includes the following steps:
step 1.1: acquiring a matrix file in a COO format in a sparse matrix data set for ship marine navigation simulation;
step 1.2: reading the parameter values of the rows, the columns and the non-zero elements in the matrix file in the step 1.1;
step 1.3: and setting a dynamic two-dimensional pointer, reading the matrix data in the COO format into the pointer, and generating a traditional sparse matrix.
3. The method according to claim 1, wherein the step 2 specifically includes the following steps:
step 2.1: based on the matrix converted in the step 1, obtaining the number of non-zero elements in each diagonal line, and calculating the total number of the non-zero elements;
step 2.2: obtaining the average value of the number of non-zero elements in each diagonal line;
step 2.3: taking the average value of the number of the non-zero elements as a threshold value, and obtaining a DIA matrix part and an offset array in an original matrix;
step 2.4: obtaining the number of non-zero elements of each column from the DIA matrix, wherein the number of the non-zero elements of each row in the original matrix corresponds to the number of the non-zero elements of each column in the DIA matrix; storing the number of non-zero elements of each column by using an array A, and calculating the average value of the array on each block;
step 2.5: taking the average value obtained from each block in the step 2.4 as a threshold value, and sequentially inserting the non-zero element values which are greater than the number in the array A into a new array B i From (1) to (B) i Until the sum of B is greater than or equal to the average value, B i Saving as a block; calculating the standard deviation among the blocks to obtain the block number with the minimum standard deviation as the optimal block number;
step 2.6: non-zero elements in the DIA matrix are stored in a diagonal manner while obtaining a diagonal offset.
4. The method according to claim 1, wherein the step 3 specifically includes the following steps:
step 3.1: after the DIA matrix is removed from the original matrix, the number of non-zero elements in each row is obtained and stored into an array C;
step 3.2: setting the number i of the blocks, and storing the previous i number in the array C in the step 3.1 into a new two-dimensional pointer D in a form of a column i Performing the following steps; and then judging D i Recording the line number of the line with the minimum value of the sum of the number of the non-zero elements in each line;
step 3.3: the data in the array C are sequentially inserted into the two-dimensional pointer D based on the row number in step 3.2 i Performing the following steps;
step 3.4: obtaining two-dimensional pointers D for multiple blocks i And calculating the standard deviation of the number of non-zero elements in each block. If D is i If the standard deviation of (a) is the minimum, block i is the optimal number of partitions.
5. The method according to claim 4, wherein the step 4 is specifically,
step 4.1: passing the DIA matrix and diagonal offset from the host side to the device side; start the parallel DIA _ SPMV operation;
step 4.2: transferring the CSR array, the column index and the offset of the non-zero element from the host end to the equipment end; the operation of the parallel CSR _ SPMV starts.
6. The method for implementing a diagonal matrix SPMV on a GPU based on a hybrid compression format according to claim 5, wherein the step 5 comprises the following steps:
step 5.1: setting a CPU and GPU synchronization function;
and step 5.2: based on the synchronization function of step 5.1, the result after parallel DIA computation is transferred from the device side to the host side;
step 5.3: based on the host side in step 5.1, transmitting the result after the parallel CSR calculation from the equipment side to the host side;
step 5.4: the results of step 5.2 and step 5.3 are integrated.
7. The method of claim 6, wherein the step 5.2 of parallel DIA computation specifically sets the computation start position of each thread;
setting the position of the calculated y value of each thread; setting the maximum processing non-zero number of each thread;
and calculating the SPMV calculation of each thread, and accumulating the number of elements in an iterative mode by each thread to obtain the value of y.
8. The method according to claim 6, wherein the step 5.3 of parallel CSR calculation specifically sets a value that each thread can access to a corresponding range storing an array of non-zero elements;
calculating the length of the array accessed by the thread based on the maximum data length in each block in the step 3.3;
starting to execute the parallel computing operation of the SPMV, and positioning to the correct position of the x value according to the column coordinates of the non-zero elements to carry out multiplication;
each thread iterates to carry out product operation and executes combination operation;
until each non-zero element in the CSR parallel stage completes SPMV calculation, and the y value is saved in the GPU.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the computer program to implement the method according to any one of claims 1 to 8.
10. A non-transitory computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the implementation method according to any one of claims 1 to 8.
CN202210569070.3A 2022-05-24 2022-05-24 Method for realizing diagonal matrix SPMV (sparse matrix) on GPU (graphics processing Unit) based on mixed compression format Pending CN115048215A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210569070.3A CN115048215A (en) 2022-05-24 2022-05-24 Method for realizing diagonal matrix SPMV (sparse matrix) on GPU (graphics processing Unit) based on mixed compression format

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210569070.3A CN115048215A (en) 2022-05-24 2022-05-24 Method for realizing diagonal matrix SPMV (sparse matrix) on GPU (graphics processing Unit) based on mixed compression format

Publications (1)

Publication Number Publication Date
CN115048215A true CN115048215A (en) 2022-09-13

Family

ID=83159431

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210569070.3A Pending CN115048215A (en) 2022-05-24 2022-05-24 Method for realizing diagonal matrix SPMV (sparse matrix) on GPU (graphics processing Unit) based on mixed compression format

Country Status (1)

Country Link
CN (1) CN115048215A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115937199A (en) * 2023-01-06 2023-04-07 山东济宁圣地电业集团有限公司 Spraying quality detection method for insulating layer of power distribution cabinet
CN116186526A (en) * 2023-05-04 2023-05-30 中国人民解放军国防科技大学 Feature detection method, device and medium based on sparse matrix vector multiplication

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115937199A (en) * 2023-01-06 2023-04-07 山东济宁圣地电业集团有限公司 Spraying quality detection method for insulating layer of power distribution cabinet
CN115937199B (en) * 2023-01-06 2023-05-23 山东济宁圣地电业集团有限公司 Spraying quality detection method for insulating layer of power distribution cabinet
CN116186526A (en) * 2023-05-04 2023-05-30 中国人民解放军国防科技大学 Feature detection method, device and medium based on sparse matrix vector multiplication
CN116186526B (en) * 2023-05-04 2023-07-18 中国人民解放军国防科技大学 Feature detection method, device and medium based on sparse matrix vector multiplication

Similar Documents

Publication Publication Date Title
Lu et al. SpWA: An efficient sparse winograd convolutional neural networks accelerator on FPGAs
CN115048215A (en) Method for realizing diagonal matrix SPMV (sparse matrix) on GPU (graphics processing Unit) based on mixed compression format
Demmel et al. Communication-optimal parallel and sequential QR and LU factorizations
CN111898733B (en) Deep separable convolutional neural network accelerator architecture
CN108170639B (en) Tensor CP decomposition implementation method based on distributed environment
CN112200300B (en) Convolutional neural network operation method and device
CN110516316B (en) GPU acceleration method for solving Euler equation by interrupted Galerkin method
CN113469350B (en) Deep convolutional neural network acceleration method and system suitable for NPU
Andrzejewski et al. Efficient spatial co-location pattern mining on multiple GPUs
WO2014143514A1 (en) System and method for database searching
Ida Lattice H-matrices on distributed-memory systems
Margaris et al. Parallel implementations of the jacobi linear algebraic systems solve
Meng et al. Efficient memory partitioning for parallel data access in multidimensional arrays
Cevahir et al. Site-based partitioning and repartitioning techniques for parallel pagerank computation
Yang et al. A Winograd-based CNN accelerator with a fine-grained regular sparsity pattern
CN112765094A (en) Sparse tensor canonical decomposition method based on data division and calculation distribution
CN116257209A (en) Compressed storage of sparse matrix and parallel processing method of vector multiplication thereof
JP7046171B2 (en) Arithmetic logic unit
Chow et al. An efficient sparse conjugate gradient solver using a Beneš permutation network
Slimani et al. K-MLIO: enabling k-means for large data-sets and memory constrained embedded systems
Jain-Mendon et al. A hardware–software co-design approach for implementing sparse matrix vector multiplication on FPGAs
Gebali et al. Parallel multidimensional lookahead sorting algorithm
Wu et al. Optimizing dynamic programming on graphics processing units via data reuse and data prefetch with inter-block barrier synchronization
Riha et al. Task scheduling for GPU accelerated OLAP systems
Mironowicz et al. A task-scheduling approach for efficient sparse symmetric matrix-vector multiplication on a GPU

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination