CN110580675A - Matrix storage and calculation method suitable for GPU hardware - Google Patents

Matrix storage and calculation method suitable for GPU hardware Download PDF

Info

Publication number
CN110580675A
CN110580675A CN201910859641.5A CN201910859641A CN110580675A CN 110580675 A CN110580675 A CN 110580675A CN 201910859641 A CN201910859641 A CN 201910859641A CN 110580675 A CN110580675 A CN 110580675A
Authority
CN
China
Prior art keywords
matrix
zero
column
data
elements
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910859641.5A
Other languages
Chinese (zh)
Inventor
邵雪
王晓光
周振亚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huada Empyrean Software Co Ltd
Beijing CEC Huada Electronic Design Co Ltd
Original Assignee
Beijing CEC Huada Electronic Design Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing CEC Huada Electronic Design Co Ltd filed Critical Beijing CEC Huada Electronic Design Co Ltd
Priority to CN201910859641.5A priority Critical patent/CN110580675A/en
Publication of CN110580675A publication Critical patent/CN110580675A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/60Memory management

Landscapes

  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Complex Calculations (AREA)

Abstract

a matrix storage and calculation method suitable for GPU hardware comprises the following steps: 1) storing the row number, the column number, the nonzero elements, a mark of whether each element is nonzero or not and the position of the initial nonzero element; 2) accessing matrix elements through a GPU (graphics processing Unit), acquiring whether the matrix elements are nonzero and the values of the nonzero elements, and setting the values of the nonzero elements of the matrix; 3) and performing matrix operation by using the GPU. The matrix storage and calculation method suitable for GPU hardware can realize multi-thread high-speed access to any element in the matrix under the GPU hardware, thereby greatly improving the matrix calculation speed in the GPU.

Description

matrix storage and calculation method suitable for GPU hardware
Technical Field
The invention relates to the field of high-performance calculation of GPU (graphics processing Unit) hardware, in particular to the technical field of high-performance calculation of matrix multiplication and LU decomposition by the GPU hardware, and particularly relates to a matrix storage and calculation method suitable for the GPU hardware.
background
in recent years, the scale of matrix operation in high-performance computation is getting larger and stronger, and the traditional CPU framework is limited by the power consumption bottleneck, is difficult to further improve the performance, and cannot meet the computation requirement. Compared with the GPU, the GPU has the advantages of sufficient computing resources and high data access bandwidth, and can be accelerated by tens of times compared with the CPU under the ideal condition. However, the matrix decomposition has high correlation, so that the algorithm optimization difficulty is high, and the GPU application progress is slow.
Disclosure of Invention
In order to solve the defects in the prior art, the invention aims to provide a matrix storage and calculation method suitable for GPU hardware, which makes full use of the characteristics of the GPU hardware and realizes high-performance calculation of a matrix.
in order to achieve the above object, the matrix storage and calculation method suitable for GPU hardware provided by the present invention comprises the following steps:
1) Storing the row number, the column number, the nonzero elements, a mark of whether each element is nonzero or not and the position of the initial nonzero element;
2) Accessing matrix elements through a GPU (graphics processing Unit), acquiring whether the matrix elements are nonzero and the values of the nonzero elements, and setting the values of the nonzero elements of the matrix;
3) and performing matrix operation by using the GPU.
Further, the step 1) further comprises:
storing the number of rows and columns of the matrix;
Sequentially storing non-zero elements of the matrix to a first array according to the sequence of rows or columns of the matrix;
Storing a flag indicating whether each element in each row or column is non-zero in a second array;
storing the position of the starting non-zero element of each row or column in the first array to a third array.
further, the step of sequentially storing the non-zero elements of the matrix to the first array in the order of the rows or columns of the matrix further includes determining the size of the first array according to the total number of the non-zero elements of the matrix.
further, the step of storing the flag indicating whether each element in each row or column is non-zero into the second array further includes determining the size of the second array according to the number of rows or columns of the matrix, and respectively and continuously storing the non-zero flags of the matrix elements in the rows or columns into the second array according to the sequence from low to high, wherein each bit corresponds to one matrix element, and a bit value of 1 represents that the corresponding matrix element is non-zero.
Further, the step 2) further comprises the following steps:
Acquiring a non-zero zone bit according to the position information of the data to be read;
Reading the position of the first non-zero element of the row or column of the data to be read, and recording the position as a first position;
Calculating the position difference between the data to be read and the first non-zero element of the row or column where the data to be read is located, and recording the position difference as a second position;
Calculating the position of the data to be read in the first array, and recording the position as a third position, wherein the third position is the first position plus the second position;
and reading the value of the data to be read in the first array according to the third position.
Further, the step 3) includes matrix addition, matrix subtraction, matrix multiplication and matrix LU decomposition algorithm.
Further, the matrix addition comprises the steps of:
Judging whether two corresponding data in the two matrixes participating in the addition operation are nonzero or not;
if the two corresponding data are both zero elements, the addition operation or subtraction operation result of the two corresponding data is zero;
If only one of the two corresponding data is not zero, the result of the addition operation or the subtraction operation of the two corresponding data is a positive value or a negative value of the non-zero data;
And if the two corresponding data are not zero, the addition operation or the subtraction operation of the two data is the addition operation or the subtraction operation of the two corresponding data.
further, the matrix multiplication comprises the steps of:
Firstly to Cijmaking a judgment if CijIf the element is a non-zero element, ending, otherwise, continuing;
Initializing variable v is 0;
Traverse the ith row of the A matrix from 1 to CAElement a ofikFrom 1 to R in jth column of matrix BBElement b ofkjif aikAnd bkjAll non-zero elements are then v ═ v + aik×bkj
Obtaining the result Cij=v。
further, the LU decomposition algorithm of the matrix includes the following steps:
Reading the compressed matrix data into a dense matrix D, wherein if aijis a non-zero element dij=aijotherwise dijWhen the number of the threads is 0, sequentially synchronizing all the threads of the GPU to ensure the consistency of data in the D matrix;
traversing the kth row and the kth column of the decomposed D matrix in the order from 1 to R or C, and repeating the following steps:
Taking the k-th row element from k to C column element of the D matrix as the k-th row result of the U matrix, wherein if UkjIs a non-zero element then ukj=dkjj ranges from k +1 to C;
Dividing the elements of the k column from k +1 to R column of the D matrix by Dkkget the k column result of L matrix, wherein if Likis a non-zero elementik=dik/dkkI ranges from k +1 to R;
Updating all elements at the lower right side of k column of k row of the D matrix, wherein if likAnd ukjAll are non-zero elements then dij=dij-lik×ukjI ranges from k +1 to R, j ranges from k +1 to C;
And all threads of the GPU are synchronized, and the consistency of all data in the D matrix is ensured.
To achieve the above object, the present invention further provides a computer readable storage medium, on which computer instructions are stored, and the computer instructions execute the steps of the above matrix storage and calculation method suitable for GPU hardware when executed.
Has the advantages that: according to the matrix storage and calculation method suitable for GPU hardware, the positions of the non-zero elements of the matrix in the storage space are calculated by using the non-zero marks, any element in the multi-thread high-speed access matrix under the GPU hardware can be accessed, the access efficiency of the GPU hardware to the matrix elements is improved, and therefore the calculation speed of the matrix on the GPU hardware is greatly improved.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:
FIG. 1 is a flow chart of a matrix storage and computation method for GPU hardware according to the present invention;
FIG. 2 is a schematic diagram of a decomposed matrix store data according to an embodiment of the invention;
FIG. 3 is a flow diagram of a matrix LU decomposition of a computational method decomposition according to an embodiment of the invention.
Detailed Description
The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.
Fig. 1 is a flowchart of a matrix storage and calculation method suitable for GPU hardware according to the present invention, and the following describes in detail the matrix storage and calculation method suitable for GPU hardware according to the present invention with reference to fig. 1.
at step 11, the data is stored.
In this step, storing the data further comprises the steps of:
(111) Storing the number of rows and columns of the matrix;
(112) sequentially storing non-zero elements of the matrix according to the sequence of rows (columns) of the matrix;
(113) Storing a flag for whether each element in each row (column) is non-zero;
(114) the position of the starting non-zero element of each row (column) is stored.
fig. 2 is a schematic diagram of decomposed matrix storage data according to an embodiment of the present invention, and steps (111) to (114) are described below with reference to fig. 2.
First, in step (111), the number of rows and columns of the matrix are stored.
In the embodiment shown in fig. 2, the matrix has 5 rows and 6 columns, and is stored.
At step (112), the non-zero elements of the matrix are stored sequentially in the order of the matrix rows (columns).
in this step, the storage order of data in an existing matrix is determined as needed, and the following description will be given by taking a row-first scheme as an example. A memory space M S of matrix elements is prepared for the matrix according to its total number S of non-zero elements (shown as 101 in fig. 2). The non-zero elements of the matrix are then stored sequentially in the order of the matrix rows (columns).
Specifically, referring to fig. 2, the total number S of non-zero elements of the matrix is 12, and the storage space of the matrix element to be prepared is M [12 ]. The non-zero elements in the 5 x 6 matrix are then stored in order in M [12 ].
at step (113), a flag is stored for whether each element in each row (column) is non-zero.
In this step, an array F [ R ] (shown as 102 in FIG. 2) for which non-zero flags are prepared to be stored according to the row (column) number R of the matrix, where F [ i-1] is that the non-zero flags of all matrix elements in the ith row (column) are stored consecutively from low to high, each bit corresponds to a matrix element, and a bit value of 1 represents that this matrix element is non-zero.
the storage space required by the matrix F [ i ] with the number of columns (rows) not more than 32 is a 32-bit int variable, the storage space required by the matrix F [ i ] with the number of columns (rows) not more than 64 is a 64-bit int variable, the storage space required by the matrix F [ i ] with the number of columns (rows) exceeding 64 is ceil (matrix column number/64) 64-bit int variables, ceil is rounding-up operation, namely when the number of matrix columns (rows) is more than 64, the state of each row (column) is continuously stored in a plurality of 64-bit int type variables.
At step (114), the position of the starting non-zero element of each row (column) is stored.
in this step, the position indicated is the position ordinal number of the first non-zero element in each row (column) in the continuous storage space M [ S ] indicated by (112), and a data space is additionally added at the end to store the total number of non-zero elements of the whole matrix, that is, the position ordinal number of the last non-zero element in the last row (column) of the matrix in M [ S ] in (112) is increased by one.
Specifically, in this step, an array P [ R +1] (shown as 103 in fig. 2) of row (column) start positions is prepared for storing the matrix according to the number of rows R of the matrix, where P [0] ═ 0, i.e., the start position of M, P [ i ] ═ P [ i-1] + the number of non-zero elements in the ith row (column), and P [ R ] ═ the number of non-zero elements in the matrix.
Therefore, when data is stored in rows, one matrix a can be expressed as a set of the following data:
The number of rows R, the number of columns C, the total number of non-zero elements S of the matrix
a non-zero metadata array M of length S
Non-zero flag array F, length R × ceil (C/64)
Head of line position array P, length R +1
If the number of matrix columns is not greater than 32, the length of the non-zero-element flag array is R.
At step 12, the data is read.
in this step, data reading is performed based on the data storage in step 11. Specifically, the step further comprises the steps of:
(121) reading whether the jth bit of ith data in an array F of matrix storage data is true, and judging whether the ith row (column) and jth column (row) elements of the matrix are nonzero;
(122) Reading the ith data in the array P of matrix storage data to obtain the position P of the first non-zero element in the ith row (column) in the array M of matrix storage datai
(123) Calculating the sum of the (j-1) th bit to the (1) th bit of the ith data in the array F of the matrix storage data to obtain the relative position Q of the (j) th row (column) row (row) element to the first non-zero element of the row (column) in the array M of the matrix storage datai
(124)PiAnd QiAnd adding to obtain the position of the ith row (column) and the jth column (row) elements in an array M of the matrix storage data, and then carrying out value taking and assignment operations.
specifically, in step (121), the matrix element a is judgedijWhether it is non-zero.
Taking out the (i-1) × ceil (C/64) + (j-1)/64 data F in the non-zero element mark array F,
The result r of modulo 64 by j-1 is calculated,
Fetch the r-th bit of the data f, 1 then aijis a non-zero element, is 0ijIs zero;
for a matrix of no more than 32 columns then the simplification is:
taking the (i-1) th data F in the non-zero element flag array F,
Fetch the j-1 th bit of the data f, which is 1 then aijis a non-zero element, is 0ijIs zero;
at step (122), the first non-zero element position P of the ith row of the read matrix is P [ i-1 ].
At step (123), the matrix element a is calculatedijoffset with respect to the first non-zero element position of row i.
the initial position offset q is 0 and,
traversing the ith row matrix from 1 to element a of j-1ikif aikq is q +1 if the element is a non-zero element;
For the case where j reaches 64, the first 64 non-zero-bit flag data bits can be fetched at once and summed using the instructions of the GPU to increase speed.
at step (124), the matrix element a is readij=M[p+q]writing matrix element M [ p + q ]]=aij(only at a)ijvalid for non-zero elements) a value assignment operation is performed.
in step 13, data calculations are performed.
in this step, the addition (subtraction), multiplication, and LU decomposition operations of the matrix may be implemented based on the data storage and data reading methods of step 11 and step 12.
For the addition (subtraction) method of the matrix, assuming that two matrixes participating in the operation are respectively a matrix A and a matrix B, respectively judging the element a of the matrix A and the element a of the matrix B at the ith row and the jth columnijAnd bijWhether it is non-zero, the result is fAAnd fB
(1) if fAand fBreading element values of the matrixes A and B at the ith row and j column, and performing addition (subtraction) operation to obtain the element values of the ith row and j column of the result matrix C;
(2) If fAIs 1 and fBTaking the element value of the matrix A at the ith row and j column as the element value of the ith row and j column of the result matrix C, wherein the element value of the matrix A at the ith row and j column is 0;
(3) If fAis 0 and fBTaking the positive (negative) element value of matrix B at ith row and j column as the element value of ith row and j column of result matrix C;
(4) If fAAnd fBAre all 0, and the element value of the ith row and j column of the result matrix C is also zero.
for multiplication of matrices A and B, it is required that the number of columns of matrix A is equal to the number of rows of matrix B, i.e. CAAnd RBEquality can be achieved by the following algorithm:
Firstly to CijMaking a judgment if CijIf the element is a non-zero element, ending, otherwise, continuing;
initializing variable v is 0;
Traverse the ith row of the A matrix from 1 to CAelement a ofikFrom 1 to R in jth column of matrix BBelement b ofkjIf aikand bkjAll non-zero elements are then v ═ v + aik×bkj
Obtaining the result Cij=v。
the elements of each position of the addition and subtraction and multiplication operations of the matrix are multithread safe, and the result of one position can be calculated by each thread by utilizing a multithread parallel algorithm in GPU operation.
The LU decomposition operation of the matrix requires data access between threads using shared memory and thread synchronization techniques.
Fig. 3 is a flowchart of matrix LU decomposition according to the calculation method decomposition of the embodiment of the present invention, which will be described in detail below with reference to fig. 3:
(1) reading data of a compressed matrix (the number of rows and columns of the matrix should be consistent) described by the invention into a complete and continuous dense matrix D, wherein the storage space of the dense matrix is a shared memory to ensure that any thread of a GPU can access, and the space is R multiplied by C:
If a isijIs a non-zero element dij=aijOtherwise dij=0;
synchronizing all threads of the GPU to ensure the consistency of data in the D matrix;
(2) Traversing the kth row and the kth column of the decomposed D matrix in the order from 1 to R (C), and repeating the following steps (3) to (6);
(3) Taking the k-th row element from k to C column element of the D matrix as the k-th row result of the U matrix:
If u iskjIs a non-zero element then ukj=dkjJ ranges from k +1 to C;
(4) Dividing the elements of the k column from k +1 to R column of the D matrix by DkkThe k column result of the L matrix is obtained:
If l isikIs a non-zero elementik=dik/dkkI ranges from k +1 to R;
(5) Updating all elements on the lower right side of k column of k row of the D matrix:
If l isikand ukjAll are non-zero elements then dij=dij-lik×ukjI ranges from k +1 to R, j ranges from k +1 to C;
(6) And all threads of the GPU are synchronized, and the consistency of all data in the D matrix is ensured.
In the decomposition process, the calculation process of L, U, D matrix elements under the condition of the same k value can use the multithreading technology of the GPU for parallel calculation, and each thread processes different (i, j) positions to obtain better parallel efficiency. Different values of k need to be calculated in order and the data in the shared memory is synchronized by using thread synchronization operation.
The invention further provides a computer-readable storage medium, on which computer instructions are stored, and the computer instructions execute the steps of the matrix storage and calculation method suitable for the GPU hardware when running, and the matrix storage and calculation method suitable for the GPU hardware is described in the foregoing section and is not described again.
those of ordinary skill in the art will understand that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. a matrix storage and calculation method suitable for GPU hardware comprises the following steps:
1) storing the row number, the column number, the nonzero elements, a mark of whether each element is nonzero or not and the position of the initial nonzero element;
2) Accessing matrix elements through a GPU (graphics processing Unit), acquiring whether the matrix elements are nonzero and the values of the nonzero elements, and setting the values of the nonzero elements of the matrix;
3) and performing matrix operation by using the GPU.
2. the method for storing and computing matrices adapted for use in GPU hardware of claim 1, wherein the step 1) further comprises:
Storing the number of rows and columns of the matrix;
Sequentially storing non-zero elements of the matrix to a first array according to the sequence of rows or columns of the matrix;
Storing a flag indicating whether each element in each row or column is non-zero in a second array;
Storing the position of the starting non-zero element of each row or column in the first array to a third array.
3. A method for matrix storage and computation suitable for GPU hardware as claimed in claim 2, wherein the step of sequentially storing non-zero elements of the matrix into the first array in the order of rows or columns of the matrix further comprises determining the size of the first array according to the total number of non-zero elements of the matrix.
4. The method of claim 2, wherein the step of storing the flag indicating whether each element in each row or column is non-zero in the second array further comprises determining the size of the second array according to the number of rows or columns of the matrix, and storing the non-zero flags of the matrix elements in the rows or columns respectively and continuously in the second array from low to high, wherein each bit corresponds to one matrix element, and a bit value of 1 indicates that the corresponding matrix element is non-zero.
5. The method for storing and computing matrices adapted for use in GPU hardware of claim 1, wherein said step 2) further comprises the steps of:
Acquiring a non-zero zone bit according to the position information of the data to be read;
Reading the position of the first non-zero element of the row or column of the data to be read, and recording the position as a first position;
Calculating the position difference between the data to be read and the first non-zero element of the row or column where the data to be read is located, and recording the position difference as a second position;
Calculating the position of the data to be read in the first array, and recording the position as a third position, wherein the third position = the first position + the second position;
And reading the value of the data to be read in the first array according to the third position.
6. A matrix storage and calculation method suitable for GPU hardware as claimed in claim 1, wherein said step 3) comprises matrix addition, matrix subtraction, matrix multiplication and matrix LU decomposition algorithm.
7. A method for matrix storage and computation for GPU hardware as in claim 6, where the matrix addition comprises the steps of:
Judging whether two corresponding data in the two matrixes participating in the addition operation are nonzero or not;
if the two corresponding data are both zero elements, the addition operation or subtraction operation result of the two corresponding data is zero;
If only one of the two corresponding data is not zero, the result of the addition operation or the subtraction operation of the two corresponding data is a positive value or a negative value of the non-zero data;
And if the two corresponding data are not zero, the addition operation or the subtraction operation of the two data is the addition operation or the subtraction operation of the two corresponding data.
8. a matrix storage and computation method suitable for GPU hardware according to claim 6, characterized in that the matrix multiplication comprises the following steps:
Firstly to CijMaking a judgment if CijIf the element is a non-zero element, ending, otherwise, continuing;
initialization variable v = 0;
Traverse the ith row of the A matrix from 1 to CAelement a ofikfrom 1 to R in jth column of matrix BBelement b ofkjIf aikAnd bkjV = v + a if both are non-zero elementsik×bkj
obtaining the result Cij = v。
9. a method for matrix storage and computation for GPU hardware as in claim 6, wherein the LU decomposition algorithm for the matrix comprises the following steps:
Reading the compressed matrix data into a dense matrix D, wherein if aijIs a non-zero element dij=aijOtherwise dij=0, all threads of the GPU are synchronized in sequence, and data consistency in the D matrix is guaranteed;
Traversing the kth row and the kth column of the decomposed D matrix in the order from 1 to R or C, and repeating the following steps:
Taking the k-th row element from k to C column element of the D matrix as the k-th row result of the U matrix, wherein if UkjIs a non-zero element then ukj = dkjJ ranges from k +1 to C;
Dividing the elements of the k column from k +1 to R column of the D matrix by Dkkget the k column result of L matrix, wherein if Likis a non-zero elementik = dik/dkkI ranges from k +1 to R;
updating all elements at the lower right side of k column of k row of the D matrix, wherein if likAnd ukjAll are non-zero elements then dij = dij - lik×ukjI ranges from k +1 to R, j ranges from k +1 to C;
and all threads of the GPU are synchronized, and the consistency of all data in the D matrix is ensured.
10. a computer readable storage medium having stored thereon computer instructions, wherein the computer instructions when executed perform the steps of the matrix storage and computation method for GPU hardware of any of claims 1 to 9.
CN201910859641.5A 2019-09-11 2019-09-11 Matrix storage and calculation method suitable for GPU hardware Pending CN110580675A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910859641.5A CN110580675A (en) 2019-09-11 2019-09-11 Matrix storage and calculation method suitable for GPU hardware

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910859641.5A CN110580675A (en) 2019-09-11 2019-09-11 Matrix storage and calculation method suitable for GPU hardware

Publications (1)

Publication Number Publication Date
CN110580675A true CN110580675A (en) 2019-12-17

Family

ID=68811910

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910859641.5A Pending CN110580675A (en) 2019-09-11 2019-09-11 Matrix storage and calculation method suitable for GPU hardware

Country Status (1)

Country Link
CN (1) CN110580675A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112306922A (en) * 2020-11-12 2021-02-02 山东云海国创云计算装备产业创新中心有限公司 Multi-data-pair multi-port arbitration method and related device
CN113094648A (en) * 2021-04-02 2021-07-09 算筹信息科技有限公司 Method for solving triangular matrix and matrix inner product by outer product accumulation

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105593843A (en) * 2013-08-30 2016-05-18 微软技术许可有限责任公司 Sparse matrix data structure
CN106775598A (en) * 2016-12-12 2017-05-31 温州大学 A kind of Symmetric Matrices method of the compression sparse matrix based on GPU
CN108052309A (en) * 2017-12-26 2018-05-18 杭州迪普科技股份有限公司 A kind of object order method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105593843A (en) * 2013-08-30 2016-05-18 微软技术许可有限责任公司 Sparse matrix data structure
CN106775598A (en) * 2016-12-12 2017-05-31 温州大学 A kind of Symmetric Matrices method of the compression sparse matrix based on GPU
CN108052309A (en) * 2017-12-26 2018-05-18 杭州迪普科技股份有限公司 A kind of object order method and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
尹孟嘉等: "GPU稀疏矩阵向量乘的性能模型构造", 《计算机科学》 *
战同胜等: "《实用数值算法-电子计算机应用数学》", 31 January 1992, 大连理工出版社 *
朱静华等: "《数据结构题例分析》", 31 August 1995, 华中理工大学出版社 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112306922A (en) * 2020-11-12 2021-02-02 山东云海国创云计算装备产业创新中心有限公司 Multi-data-pair multi-port arbitration method and related device
CN112306922B (en) * 2020-11-12 2023-09-22 山东云海国创云计算装备产业创新中心有限公司 Multi-data-to-multi-port arbitration method and related device
CN113094648A (en) * 2021-04-02 2021-07-09 算筹信息科技有限公司 Method for solving triangular matrix and matrix inner product by outer product accumulation
CN113094648B (en) * 2021-04-02 2022-08-09 算筹(深圳)信息科技有限公司 Method for solving triangular matrix and matrix inner product by outer product accumulation

Similar Documents

Publication Publication Date Title
CN107340993B (en) Arithmetic device and method
US10255547B2 (en) Indirectly accessing sample data to perform multi-convolution operations in a parallel processing system
US10346507B2 (en) Symmetric block sparse matrix-vector multiplication
CN109934331A (en) Device and method for executing artificial neural network forward operation
CN107170019B (en) Rapid low-storage image compression sensing method
US20200234129A1 (en) Techniques for removing masks from pruned neural networks
CN103336758A (en) Sparse matrix storage method CSRL (Compressed Sparse Row with Local Information) and SpMV (Sparse Matrix Vector Multiplication) realization method based on same
CN107341507B (en) GPU and cascade hash based rapid image SIFT feature matching method
DE102020112826A1 (en) PROCESS FOR EFFICIENT PERFORMANCE OF DATA REDUCTION IN PARALLEL PROCESSING UNITS
CN110580675A (en) Matrix storage and calculation method suitable for GPU hardware
WO2018129930A1 (en) Fast fourier transform processing method and device, and computer storage medium
US20190138922A1 (en) Apparatus and methods for forward propagation in neural networks supporting discrete data
CN103177414A (en) Structure-based dependency graph node similarity concurrent computation method
CN102647588B (en) GPU (Graphics Processing Unit) acceleration method used for hierarchical searching motion estimation
CN110796236A (en) Vectorization implementation method for pooling of multi-sample multi-channel convolutional neural network
CN106780415B (en) Histogram statistical circuit and multimedia processing system
CN110019184A (en) A kind of method of the orderly integer array of compression and decompression
JP2023070746A (en) Information processing program, information processing apparatus, and information processing method
CN106484532A (en) GPGPU parallel calculating method towards SPH fluid simulation
CN111797985A (en) Convolution operation memory access optimization method based on GPU
CN117539546A (en) Sparse matrix vector multiplication acceleration method and device based on non-empty column storage
Nishimura et al. Accelerating the Smith-waterman algorithm using bitwise parallel bulk computation technique on GPU
CN111191774B (en) Simplified convolutional neural network-oriented low-cost accelerator architecture and processing method thereof
CN111832144A (en) Full-amplitude quantum computation simulation method
CN202093573U (en) Parallel acceleration device used in industrial CT image reconstruction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100102 floor 2, block a, No.2, lizezhong 2nd Road, Chaoyang District, Beijing

Applicant after: Beijing Huada Jiutian Technology Co.,Ltd.

Address before: 100102 floor 2, block a, No.2, lizezhong 2nd Road, Chaoyang District, Beijing

Applicant before: HUADA EMPYREAN SOFTWARE Co.,Ltd.

CB02 Change of applicant information
RJ01 Rejection of invention patent application after publication

Application publication date: 20191217

RJ01 Rejection of invention patent application after publication