CN114491402A - Calculation method for sparse matrix vector multiplication access optimization - Google Patents

Calculation method for sparse matrix vector multiplication access optimization Download PDF

Info

Publication number
CN114491402A
CN114491402A CN202210066814.XA CN202210066814A CN114491402A CN 114491402 A CN114491402 A CN 114491402A CN 202210066814 A CN202210066814 A CN 202210066814A CN 114491402 A CN114491402 A CN 114491402A
Authority
CN
China
Prior art keywords
block
calculation
thread block
threads
sparse matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210066814.XA
Other languages
Chinese (zh)
Inventor
胡长军
储根深
何远杰
丁哲昭
董玲玉
戴潮虎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology Beijing USTB
Original Assignee
University of Science and Technology Beijing USTB
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology Beijing USTB filed Critical University of Science and Technology Beijing USTB
Priority to CN202210066814.XA priority Critical patent/CN114491402A/en
Publication of CN114491402A publication Critical patent/CN114491402A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a calculation method for sparse matrix vector multiplication access optimization, which is suitable for DCU (distributed computing Unit) and GPU (graphics processing Unit) architectures, and comprises the following steps: dividing an original sparse matrix into a plurality of blocks according to a fixed row number, wherein each block is independently calculated by one thread block, and a space with a fixed size is opened up for each thread block in an LDS (laser direct structuring); calculating the number of turns required to be calculated by each thread block; in one round of calculation, all threads in each thread block finish a plurality of times of non-zero element calculation and write the result into an LDS; one or more threads in each thread block sum the multiplication results of the LDS and store the results to a register; and after all rounds of calculation are finished, further calculating the result in the register, and writing the result back to the video memory. The invention is based on the original CSR format, does not need preprocessing, can fully utilize the access merging characteristic and realizes higher access bandwidth utilization.

Description

Calculation method for sparse matrix vector multiplication access optimization
Technical Field
The invention relates to the technical field of high-performance calculation and algorithm, in particular to a calculation method for sparse matrix vector multiplication access optimization suitable for DCU (distributed computing Unit) and general GPU (graphics processing Unit) architecture.
Background
Sparse Matrix-Vector Multiplication (SpMV) is widely used in the fields of scientific computation, graphic analysis, signal processing, and the like. The matrix involved in the calculation is sparse, i.e. most elements in the matrix are zero. Due to the efficient compression rate, the Compressed Sparse Row compression Format (CSR) is the most common storage form for Sparse matrices. In sparse matrix vector multiplication, the performance bottleneck is memory access, and how to improve the memory access performance becomes a key.
Graphics Processing Units (GPUs) can provide powerful parallel computing power and huge data throughput, are excellent in handling large-scale numerical computing problems, and provide a solid computing platform for high-performance computing systems. A DCU (Deep Computing Unit) is one type of GPU accelerator that provides high memory access bandwidth. This makes it possible to implement efficient SpMV computation on DCU accelerators and GPU platforms.
The existing SpMV algorithms (such as CSR-Scale, CSR-Vector and other methods) often face the problem of discontinuous access and storage, and cannot fully utilize the access and storage combination characteristics of a GPU, so that the access and storage bandwidth utilization is not high, and the overall performance loss is large. Other schemes (such as the CSR-Stream algorithm, the CSR-Adaptive algorithm, the HOLA algorithm, and the CSR5) that can efficiently use the memory bandwidth often need to perform preprocessing or convert the CSR format into other formats, and thus, the overhead is large.
Disclosure of Invention
The invention provides a calculation method and a calculation device for sparse matrix vector multiplication memory access optimization, which aim to solve the technical problems of low memory access bandwidth utilization, large overall performance loss and large extra cost in the prior art.
In order to solve the technical problems, the invention provides the following technical scheme:
on one hand, the invention provides a calculation method for sparse matrix vector multiplication access optimization, which is suitable for DCU and general GPU architecture, and comprises the following steps:
dividing an original sparse matrix into a plurality of blocks according to a fixed number of rows, wherein each block is independently calculated by one thread block, and respectively opening up a space with a fixed size for each thread block in an LDS (shared memory);
calculating the number of nonzero elements of each thread block in charge of calculation, and calculating the number of turns required to be calculated by each thread block according to the calculation result by combining the preset number of nonzero elements of each thread block which are calculated at most in each turn;
in one round of calculation, all threads in each thread block finish a plurality of times of calculation of products of non-zero elements and corresponding elements in corresponding vectors, and a multiplication result is written into an LDS space corresponding to the current thread block;
dividing all threads in each thread block into a group of single or multiple threads, summing the multiplication results of the LDS by each group of threads, and storing the summation results in a register;
and when all rounds of calculation are finished, obtaining a final calculation result based on the summation result stored in the register, and writing the final calculation result back to the video memory.
Further, the original sparse matrix is stored using CSR format.
Further, when the original sparse matrix is divided into a plurality of blocks according to the fixed number of rows, except for the last block, the number of rows of each block is equal and the number of rows is N, and the number of rows of the last block is less than or equal to N.
Furthermore, the number of non-zero elements calculated at most in each round of each thread block is R multiplied by M; wherein M is the number of threads contained in each thread block, and R is a positive integer;
in a single calculation turn, each thread in each thread block completes the calculation of the product of at most R non-zero elements in the sparse matrix block for which the current thread block is responsible and the corresponding elements in the corresponding vector.
Further, when all threads in each thread block are divided into a group of single or multiple threads, the number of threads in each group is a non-negative integer power of 2;
and the total number of each thread block is greater than or equal to the matrix row number processed by the thread block.
The technical scheme provided by the invention has the beneficial effects that at least:
1. the access bandwidth is high. Because all threads in the thread block calculate the nonzero elements participating in multiple lines together, the reading of the nonzero elements of the sparse matrix among the threads is continuous, the locality of data access is improved, and the access bandwidth can be effectively improved.
2. The acceleration effect is good. Because all threads participate in the calculation in a single round of calculation together, only part of the threads in one round are in an idle state under the worst condition, the utilization rate of calculation resources is improved, and the calculation efficiency can be effectively improved.
3. Saving LDS space. By adopting batch calculation, only the LDS space with a fixed size needs to be distributed for each thread block, and the space is repeatedly utilized in the calculation process without distributing the space for the calculation of each batch.
4. No pre-processing operation is required. The current preprocessing-based algorithms are relatively expensive, and even the preprocessing time is longer than the calculation time. The method of the invention does not need pretreatment operation and has higher performance.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic diagram of a calculation method for sparse matrix vector multiplication access optimization according to an embodiment of the present invention;
fig. 2 is a flowchart of a calculation method for sparse matrix vector multiplication access optimization according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.
First embodiment
In order to achieve a better effect of the SpMV in the DCU and GPU architectures and to solve the problem of discontinuous access of non-zero elements in each row of the sparse matrix between threads, the embodiment provides a sparse matrix vector multiplication access optimization calculation method suitable for the DCU and general GPU architectures. In each calculation round circulation of the block, data are loaded from a video memory to carry out multiplication calculation, results are written into an LDS, multiplication results are read from the LDS to carry out specification summation, and the results are written into a register. After the calculation round circulation of the block is completed, data is required to be read from the register to calculate a final result, and the result is written back to the video memory.
The implementation principle of the sparse matrix vector multiplication memory access optimization calculation method suitable for DCU and general GPU architectures in this embodiment is shown in fig. 1, where in fig. 1, italic characters are memory access or LDS operations, and bold represents a storage space in a video memory or a storage space in LDS. Each block is distributed with the same number of rows, the internal thread of the block continuously accesses and stores the non-zero element value of the matrix, calculates the product of the non-zero element and the corresponding x vector element value, and stores the multiplication result in the LDS. After all threads in the block finish the writing of the LDS, one or more threads are used for reading the values in the LDS for stipulation and summation, and finally, the result is written back to the memory. In fig. 1, row _ range and reduce _ range respectively represent the index range of the matrix row for which the thread is responsible and the range of the specification for which the thread is responsible in the current round of calculation. Taking thread 1 as an example, thread 1 is responsible for calculating row 1, the number range of row 1 in the value array is [3,11), and thread 1 only calculates the element with the number of 10 in the second round of calculation, so row _ range of thread 1 is [3,11), and reduce _ range is [10, 11).
Based on the above, the execution flow of the sparse matrix vector multiplication access optimization calculation method applicable to DCU and general GPU architecture of the present embodiment is shown in fig. 2, and includes the following steps:
and S1, dividing the calculation task and initializing.
The original sparse matrix is divided into a plurality of blocks according to the fixed line number, each block is independently calculated by one thread block, and a space with a fixed size is respectively opened up for each thread block in the LDS.
And S2, counting the calculation turns.
And calculating the number of nonzero elements of each thread block in charge of calculation, and calculating the number of turns required to be calculated by each thread block according to the calculation result and the preset number of nonzero elements of each thread block which are calculated at most in each turn.
And S3, calculating a multiplication result.
In one round of calculation, all threads in each thread block finish a plurality of times of calculation of products of non-zero elements and corresponding elements in corresponding vectors, and the multiplication results are written into the LDS space corresponding to the current thread block.
S4, sum reduction.
All threads in each thread block are divided into a group of single threads or multiple threads, each group of threads sums the multiplication results of the LDS respectively, and the summation results are stored in a register.
S5, write back the final result.
And after finishing all rounds of calculation, further calculating according to needs based on the summation result stored in the register to obtain a final calculation result, and writing the final calculation result back to the video memory.
Multiplication by sparse matrix vector
Figure BDA0003480520380000041
For example, the method of the present embodiment is further described. Wherein A is a sparse matrix with m rows, n columns and s non-zero elements,
Figure BDA0003480520380000042
is a dense vector of length t (where t is n),
Figure BDA0003480520380000043
is a vector of length m, and both alpha and beta are constants. The matrix a is stored using the CSR format, i.e. comprising three arrays: value [ s ]],row_ptr[m+1]And col _ index [ s ]]. The value of each non-zero element in the value array storage matrix A, the position of the first non-zero element of each row of the row _ ptr array storage matrix A in the value array, and the column sequence number of each non-zero element in the col _ index array storage matrix A. Vector quantity
Figure BDA0003480520380000044
Using arrays x [ t ] respectively]And an array y [ m ]]And (5) storing.
Based on the above, the specific implementation of the calculation method of the embodiment is as follows:
dividing a computing task and initializing:
(1) defining the number of rows of a matrix processed by each block to be N, dividing the matrix A into a plurality of blocks with equal rows, namely, except the last block, the number of rows of the rest blocks is equal and the number of rows is N, and the number of rows of the last block is less than or equal to N. The available matrix A can be divided into
Figure BDA0003480520380000051
The block, for the block numbered i, consists of all the rows in matrix a between the starting row number N × i and the ending row number min (N × (i +1), m).
(2) According to the block information of the matrix A, each block is allocated with one block, the number of the blocks is equal to the number of the blocks, and the number of the blocks is
Figure BDA0003480520380000052
Wherein the content of the first and second substances,
Figure BDA0003480520380000053
is a round-up operation. I.e. each block processes a fixed number of matrix rows.
(3) And according to the block information of the matrix A, each block corresponds to one block. This results in the starting and ending row numbers for each block responsible row. For block i, the line numbers of its start and end are block _ row _ begin ═ N × i, and block _ row _ end ═ min (N × (i +1), m), respectively.
(4) For each block, by using a row _ ptr array, the number of nonzero elements contained in the block corresponding to the block, block _ nnz _ nums, can be obtained. Specifically, the index of the first non-zero element of the start row of the block corresponding to the block in the value array can be obtained through the row _ ptr array, and similarly, the index of the first non-zero element of the stop row in the value array can be obtained, so that block _ nnz _ nums is row _ ptr [ block _ row _ end ] -row _ ptr [ block _ row _ begin ]. And block _ row _ inx _ begin is written as block _ ptr [ block _ row _ begin ], block _ row _ inx _ end is written as block _ row _ end ].
(5) Allocating an LDS array shared _ val: setting the number of threads in each block as M, and allocating an array shared _ val [ R × M ] with a fixed size in the LDS for each block]To save the non-zero element and of the block corresponding to the block
Figure BDA0003480520380000054
The product of the corresponding elements in the vector. Wherein R is a positive integer, but shared _ val [ R × M]Is no longer than the available LDS space size.
(6) Each thread in the block initializes a variable sum to 0 for subsequent specification summing.
Counting and calculating turns:
and counting the number of times of each block batch calculation, and ensuring that the shared _ val array can store the calculation result of each round. At most R × M nonzero elements can be calculated in each round (only the last round may be less than R × M elements, and the rest rounds are R × M elements), so that the number of times rounds needed by each block to be batched is obtained
Figure BDA0003480520380000055
Wherein the content of the first and second substances,
Figure BDA0003480520380000056
indicating rounding up.
And calculating a multiplication result:
(1) and calculating the index ranges of the corresponding value array and the col _ index array calculated by the round in the block. Let the current round number be k (0 ≦ k < rounds, i.e., k is numbered from 0), the start index of these two arrays is block _ round _ inx _ start ═ block _ row _ inx _ begin + kxrxm, and the end index is block _ round _ inx _ end ═ min (block _ round _ inx _ start + rxxm, block _ round _ inx _ end). The start index and the end index are represented in interval form: round _ range [ block _ round _ inx _ start, block _ round _ inx _ end).
(2) In a single round of computation, each thread in the block completes at most R non-zero elements and vectors in the block for which the block is responsible
Figure BDA0003480520380000061
And (4) calculating the product of the corresponding element in the array, and writing the product into a shared _ val array. For integer i, if block _ round _ inx _ start is satisfied, i is less than or equal to<block _ round _ inx _ end, i is the index number at the round. For index number i meeting the above condition, multiply calculate value [ i]×x[col_index[i]]Will be processed by a thread with sequence number mod (i-block _ row _ inx _ begin, M) in block, and the multiplication result will be written into shared _ val [ mod (i-row _ ptr [ block _ row _ begin ] of LDS],R×M)]Location. Here, mod (a, b) is a modulo operation, representing the remainder of a/b.
(3) Waiting for all threads in the block to complete the computation and ensuring that all results have been written to the LDS.
And (3) sum of specifications:
(1) dividing all threads in each block into a group of single or multiple threads, wherein the number of threads in each group is a non-negative integer power of 2, and the total number of threads of each block is greater than or equal to the number of matrix rows processed by the block. Let the number of threads in each group be 2KK is a non-negative integer and satisfies M/2KThe number of the thread group with the number of i in each block is more than or equal to N, and the number of the thread group with the number of i in each block is obtained to be i/2KThe integer part of (2).
(2) And according to the grouping information of the threads in the block, allocating a row to each group of threads with the group number smaller than N, wherein all the threads in the group of threads are responsible for the summation of the specifications of the row and writing back the final result to the video memory. The thread group numbered i in each block (where i < N) is responsible for the row number r of one row in the matrix block _ row _ begin + i.
(3) The start and end indices of the array of non-zero element values corresponding to the row for which thread group i is responsible are calculated (the row does not include the non-zero element corresponding to the end index), as shown in the example of fig. 1. The start index and the end index are expressed in interval form: row _ range [ reduce _ row _ idx _ begin, reduce _ row _ idx _ end ]). Wherein, the start index reduce _ row _ idx _ begin is row _ ptr [ r ], and the end index is reduce _ row _ idx _ end is row _ ptr [ r +1 ]. In addition, since the start and end indexes are fixed in each calculation round, the step can be performed in advance in the calculation task division.
(4) Before the thread group i is subjected to reduction, the start index and the end index of a shared _ val array in the LDS to be traversed are calculated. As shown in the example of fig. 1. The start index and the end index are expressed in interval form: reduce _ range [ reduce _ start, reduce _ end). The reduce _ range interval is the intersection of the range row _ range and the range round _ range. The value of "reduce _ start" is max (block _ round _ inx _ start, reduce _ row _ idx _ begin), and the value of "reduce _ end" is min (block _ round _ inx _ end, reduce _ row _ idx _ end).
(5) If the intersection in the previous step is not empty, that is, reduce _ start < reduce _ end, that is, reduce _ row _ idx _ begin < block _ round _ inx _ end and reduce _ row _ idx _ end > block _ round _ inx _ start, the thread group i executes the reduction operation in the next step, otherwise, the reduction operation in the next step is skipped.
(6) For each set of threads i meeting the condition in the block (wherein the condition is i)<N) that together read the multiplication result of the row (i.e., the non-zero element and vector corresponding to the row) from the shared _ val array of the LDS
Figure BDA0003480520380000071
The product of corresponding elements) and adding the read value to the variable sum of the thread with the number of 0 in the thread group by using shfl _ down and a parallel reduction mode. Specifically, the thread group i reads the elements (excluding the reduce _ end) indexed from the reduce _ start to the reduce _ end in parallel from the LDS, and accumulates them, and the accumulated result is denoted as sum 1. If a thread group comprises a plurality of threads, shfl _ down operation is needed to be utilized for inter-thread communication, and the sum1 specification of the accumulation result is added into the thread group to be compiledAbove the variable sum for the thread with number 0. If the thread group contains only one thread, the accumulation result sum1 in the thread is directly added to sum.
(7) And waiting for all thread operations in the block to complete.
(8) And if the block completes all rounds, namely the multiplication calculation of all the non-zero elements in the matrix block and the sum of the specifications of all the rows are completed, performing a step of writing back a final result. Otherwise, jumping to the step of calculating multiplication results, and executing the next round.
Write back the final result:
and (3) each group of threads in the block is numbered with 0, the variable sum is a protocol summation result, if the group of threads is responsible for the calculation of the r-th row in the matrix A, calculating alpha multiplied sum + beta multiplied by y [ r ], obtaining a final calculation result, and writing the final calculation result back to y [ r ].
To sum up, the calculation method of the embodiment divides the sparse matrix according to the number of rows of the matrix, all threads in each thread block complete the calculation of the same number of rows together, each thread is responsible for the calculation of a plurality of elements in a plurality of rows, and the storage space is allocated in advance in the LDS. The computing method can fully utilize the architecture characteristics of the DCU and the GPU, and utilize the access and memory merging characteristics of hardware as much as possible, so that the access and memory bandwidth can be greatly improved.
It should also be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.
Finally, it should be noted that while the above describes a preferred embodiment of the invention, it will be appreciated by those skilled in the art that, once the basic inventive concepts have been learned, numerous changes and modifications may be made without departing from the principles of the invention, which shall be deemed to be within the scope of the invention. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Claims (5)

1. A sparse matrix vector multiplication memory access optimization computing method is suitable for DCU and GPU architectures, and is characterized by comprising the following steps:
dividing an original sparse matrix into a plurality of blocks according to a fixed number of rows, wherein each block is independently calculated by one thread block, and a space with a fixed size is respectively opened up for each thread block in an LDS (laser direct structuring) mode;
calculating the number of nonzero elements of each thread block in charge of calculation, and calculating the number of turns required to be calculated by each thread block according to the calculation result by combining the preset number of nonzero elements of each thread block which are calculated at most in each turn;
in one round of calculation, all threads in each thread block finish a plurality of times of calculation of products of non-zero elements and corresponding elements in corresponding vectors, and a multiplication result is written into an LDS space corresponding to the current thread block;
dividing all threads in each thread block into a group of single or multiple threads, summing the multiplication results of the LDS by each group of threads, and storing the summation results in a register;
and when all rounds of calculation are finished, obtaining a final calculation result based on the summation result stored in the register, and writing the final calculation result back to the video memory.
2. The sparse matrix vector multiplication access optimization calculation method of claim 1, wherein the original sparse matrix is stored using a CSR format.
3. The sparse matrix vector multiplication access optimization calculation method of claim 1, wherein when the original sparse matrix is divided into a plurality of blocks according to fixed number of rows, except for the last block, the number of rows of each block is equal and the number of rows is N, and the number of rows of the last block is less than or equal to N.
4. The sparse matrix vector multiplication access optimization calculation method of claim 1, wherein the number of non-zero elements calculated at most per round of each thread block is R x M; wherein M is the number of threads contained in each thread block, and R is a positive integer;
in a single calculation turn, each thread in each thread block completes the calculation of the product of at most R non-zero elements in the sparse matrix block for which the current thread block is responsible and the corresponding elements in the corresponding vector.
5. The sparse matrix vector multiply access optimization computing method of claim 1, wherein when all threads in each thread block are divided in groups of single or multiple threads, the number of threads in each group is a non-negative integer power of 2;
and the total number of each thread block is greater than or equal to the matrix row number processed by the thread block.
CN202210066814.XA 2022-01-20 2022-01-20 Calculation method for sparse matrix vector multiplication access optimization Pending CN114491402A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210066814.XA CN114491402A (en) 2022-01-20 2022-01-20 Calculation method for sparse matrix vector multiplication access optimization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210066814.XA CN114491402A (en) 2022-01-20 2022-01-20 Calculation method for sparse matrix vector multiplication access optimization

Publications (1)

Publication Number Publication Date
CN114491402A true CN114491402A (en) 2022-05-13

Family

ID=81472378

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210066814.XA Pending CN114491402A (en) 2022-01-20 2022-01-20 Calculation method for sparse matrix vector multiplication access optimization

Country Status (1)

Country Link
CN (1) CN114491402A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115718724A (en) * 2023-01-09 2023-02-28 阿里巴巴(中国)有限公司 GPU (graphics processing Unit), data selection method and chip
CN116150553A (en) * 2023-03-01 2023-05-23 北京科技大学 Sparse AMG optimization method for CPU+DCU heterogeneous mixed architecture
CN116610424A (en) * 2023-03-06 2023-08-18 北京科技大学 Template calculation two-dimensional thread block selection method based on GPU (graphics processing Unit) merging memory access
CN117725348A (en) * 2024-02-07 2024-03-19 蓝象智联(杭州)科技有限公司 Thread management method and system in GPU computing large-scale array summation process

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115718724A (en) * 2023-01-09 2023-02-28 阿里巴巴(中国)有限公司 GPU (graphics processing Unit), data selection method and chip
CN115718724B (en) * 2023-01-09 2023-05-09 阿里巴巴(中国)有限公司 GPU, data selection method and chip
CN116150553A (en) * 2023-03-01 2023-05-23 北京科技大学 Sparse AMG optimization method for CPU+DCU heterogeneous mixed architecture
CN116150553B (en) * 2023-03-01 2023-07-21 北京科技大学 Sparse AMG optimization method for CPU+DCU heterogeneous mixed architecture
CN116610424A (en) * 2023-03-06 2023-08-18 北京科技大学 Template calculation two-dimensional thread block selection method based on GPU (graphics processing Unit) merging memory access
CN116610424B (en) * 2023-03-06 2024-04-26 北京科技大学 Template calculation two-dimensional thread block selection method based on GPU (graphics processing Unit) merging memory access
CN117725348A (en) * 2024-02-07 2024-03-19 蓝象智联(杭州)科技有限公司 Thread management method and system in GPU computing large-scale array summation process
CN117725348B (en) * 2024-02-07 2024-05-10 蓝象智联(杭州)科技有限公司 Thread management method and system in GPU computing large-scale array summation process

Similar Documents

Publication Publication Date Title
CN114491402A (en) Calculation method for sparse matrix vector multiplication access optimization
US20220012593A1 (en) Neural network accelerator and neural network acceleration method based on structured pruning and low-bit quantization
US10936941B2 (en) Efficient data access control device for neural network hardware acceleration system
CN110070178B (en) Convolutional neural network computing device and method
US20180046895A1 (en) Device and method for implementing a sparse neural network
CN112465110B (en) Hardware accelerator for convolution neural network calculation optimization
CN108985450B (en) Vector processor-oriented convolution neural network operation vectorization method
CN111242277B (en) Convolutional neural network accelerator supporting sparse pruning based on FPGA design
CN110807170B (en) Method for realizing Same convolution vectorization of multi-sample multi-channel convolution neural network
CN112668708B (en) Convolution operation device for improving data utilization rate
CN110796236B (en) Vectorization implementation method for pooling of multi-sample multi-channel convolutional neural network
CN110415157A (en) A kind of calculation method and device of matrix multiplication
CN109840585B (en) Sparse two-dimensional convolution-oriented operation method and system
CN109993293B (en) Deep learning accelerator suitable for heap hourglass network
CN115186802A (en) Block sparse method and device based on convolutional neural network and processing unit
KR20220154764A (en) Inference engine circuit architecture
CN114491401A (en) Adaptive sparse matrix vector multiplication strategy selection and optimization method
CN115309333A (en) Data storage format of strip-shaped sparse matrix and multiplication acceleration method thereof
CN113762493A (en) Neural network model compression method and device, acceleration unit and computing system
CN113721982A (en) Sparse matrix storage method, vector calculation method and electronic equipment
CN115828044B (en) Dual sparsity matrix multiplication circuit, method and device based on neural network
CN116842304A (en) Method and system for calculating irregular sparse matrix
CN111191774B (en) Simplified convolutional neural network-oriented low-cost accelerator architecture and processing method thereof
CN116992203A (en) FPGA-based large-scale high-throughput sparse matrix vector integer multiplication method
CN114819127B (en) Back pressure index type combined calculation unit based on FPGA

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination