CN110580675A - Matrix storage and calculation method suitable for GPU hardware - Google Patents
Matrix storage and calculation method suitable for GPU hardware Download PDFInfo
- Publication number
- CN110580675A CN110580675A CN201910859641.5A CN201910859641A CN110580675A CN 110580675 A CN110580675 A CN 110580675A CN 201910859641 A CN201910859641 A CN 201910859641A CN 110580675 A CN110580675 A CN 110580675A
- Authority
- CN
- China
- Prior art keywords
- matrix
- zero
- column
- data
- elements
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T1/00—General purpose image data processing
- G06T1/20—Processor architectures; Processor configuration, e.g. pipelining
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T1/00—General purpose image data processing
- G06T1/60—Memory management
Landscapes
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Complex Calculations (AREA)
Abstract
a matrix storage and calculation method suitable for GPU hardware comprises the following steps: 1) storing the row number, the column number, the nonzero elements, a mark of whether each element is nonzero or not and the position of the initial nonzero element; 2) accessing matrix elements through a GPU (graphics processing Unit), acquiring whether the matrix elements are nonzero and the values of the nonzero elements, and setting the values of the nonzero elements of the matrix; 3) and performing matrix operation by using the GPU. The matrix storage and calculation method suitable for GPU hardware can realize multi-thread high-speed access to any element in the matrix under the GPU hardware, thereby greatly improving the matrix calculation speed in the GPU.
Description
Technical Field
The invention relates to the field of high-performance calculation of GPU (graphics processing Unit) hardware, in particular to the technical field of high-performance calculation of matrix multiplication and LU decomposition by the GPU hardware, and particularly relates to a matrix storage and calculation method suitable for the GPU hardware.
background
in recent years, the scale of matrix operation in high-performance computation is getting larger and stronger, and the traditional CPU framework is limited by the power consumption bottleneck, is difficult to further improve the performance, and cannot meet the computation requirement. Compared with the GPU, the GPU has the advantages of sufficient computing resources and high data access bandwidth, and can be accelerated by tens of times compared with the CPU under the ideal condition. However, the matrix decomposition has high correlation, so that the algorithm optimization difficulty is high, and the GPU application progress is slow.
Disclosure of Invention
In order to solve the defects in the prior art, the invention aims to provide a matrix storage and calculation method suitable for GPU hardware, which makes full use of the characteristics of the GPU hardware and realizes high-performance calculation of a matrix.
in order to achieve the above object, the matrix storage and calculation method suitable for GPU hardware provided by the present invention comprises the following steps:
1) Storing the row number, the column number, the nonzero elements, a mark of whether each element is nonzero or not and the position of the initial nonzero element;
2) Accessing matrix elements through a GPU (graphics processing Unit), acquiring whether the matrix elements are nonzero and the values of the nonzero elements, and setting the values of the nonzero elements of the matrix;
3) and performing matrix operation by using the GPU.
Further, the step 1) further comprises:
storing the number of rows and columns of the matrix;
Sequentially storing non-zero elements of the matrix to a first array according to the sequence of rows or columns of the matrix;
Storing a flag indicating whether each element in each row or column is non-zero in a second array;
storing the position of the starting non-zero element of each row or column in the first array to a third array.
further, the step of sequentially storing the non-zero elements of the matrix to the first array in the order of the rows or columns of the matrix further includes determining the size of the first array according to the total number of the non-zero elements of the matrix.
further, the step of storing the flag indicating whether each element in each row or column is non-zero into the second array further includes determining the size of the second array according to the number of rows or columns of the matrix, and respectively and continuously storing the non-zero flags of the matrix elements in the rows or columns into the second array according to the sequence from low to high, wherein each bit corresponds to one matrix element, and a bit value of 1 represents that the corresponding matrix element is non-zero.
Further, the step 2) further comprises the following steps:
Acquiring a non-zero zone bit according to the position information of the data to be read;
Reading the position of the first non-zero element of the row or column of the data to be read, and recording the position as a first position;
Calculating the position difference between the data to be read and the first non-zero element of the row or column where the data to be read is located, and recording the position difference as a second position;
Calculating the position of the data to be read in the first array, and recording the position as a third position, wherein the third position is the first position plus the second position;
and reading the value of the data to be read in the first array according to the third position.
Further, the step 3) includes matrix addition, matrix subtraction, matrix multiplication and matrix LU decomposition algorithm.
Further, the matrix addition comprises the steps of:
Judging whether two corresponding data in the two matrixes participating in the addition operation are nonzero or not;
if the two corresponding data are both zero elements, the addition operation or subtraction operation result of the two corresponding data is zero;
If only one of the two corresponding data is not zero, the result of the addition operation or the subtraction operation of the two corresponding data is a positive value or a negative value of the non-zero data;
And if the two corresponding data are not zero, the addition operation or the subtraction operation of the two data is the addition operation or the subtraction operation of the two corresponding data.
further, the matrix multiplication comprises the steps of:
Firstly to Cijmaking a judgment if CijIf the element is a non-zero element, ending, otherwise, continuing;
Initializing variable v is 0;
Traverse the ith row of the A matrix from 1 to CAElement a ofikFrom 1 to R in jth column of matrix BBElement b ofkjif aikAnd bkjAll non-zero elements are then v ═ v + aik×bkj;
Obtaining the result Cij=v。
further, the LU decomposition algorithm of the matrix includes the following steps:
Reading the compressed matrix data into a dense matrix D, wherein if aijis a non-zero element dij=aijotherwise dijWhen the number of the threads is 0, sequentially synchronizing all the threads of the GPU to ensure the consistency of data in the D matrix;
traversing the kth row and the kth column of the decomposed D matrix in the order from 1 to R or C, and repeating the following steps:
Taking the k-th row element from k to C column element of the D matrix as the k-th row result of the U matrix, wherein if UkjIs a non-zero element then ukj=dkjj ranges from k +1 to C;
Dividing the elements of the k column from k +1 to R column of the D matrix by Dkkget the k column result of L matrix, wherein if Likis a non-zero elementik=dik/dkkI ranges from k +1 to R;
Updating all elements at the lower right side of k column of k row of the D matrix, wherein if likAnd ukjAll are non-zero elements then dij=dij-lik×ukjI ranges from k +1 to R, j ranges from k +1 to C;
And all threads of the GPU are synchronized, and the consistency of all data in the D matrix is ensured.
To achieve the above object, the present invention further provides a computer readable storage medium, on which computer instructions are stored, and the computer instructions execute the steps of the above matrix storage and calculation method suitable for GPU hardware when executed.
Has the advantages that: according to the matrix storage and calculation method suitable for GPU hardware, the positions of the non-zero elements of the matrix in the storage space are calculated by using the non-zero marks, any element in the multi-thread high-speed access matrix under the GPU hardware can be accessed, the access efficiency of the GPU hardware to the matrix elements is improved, and therefore the calculation speed of the matrix on the GPU hardware is greatly improved.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:
FIG. 1 is a flow chart of a matrix storage and computation method for GPU hardware according to the present invention;
FIG. 2 is a schematic diagram of a decomposed matrix store data according to an embodiment of the invention;
FIG. 3 is a flow diagram of a matrix LU decomposition of a computational method decomposition according to an embodiment of the invention.
Detailed Description
The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.
Fig. 1 is a flowchart of a matrix storage and calculation method suitable for GPU hardware according to the present invention, and the following describes in detail the matrix storage and calculation method suitable for GPU hardware according to the present invention with reference to fig. 1.
at step 11, the data is stored.
In this step, storing the data further comprises the steps of:
(111) Storing the number of rows and columns of the matrix;
(112) sequentially storing non-zero elements of the matrix according to the sequence of rows (columns) of the matrix;
(113) Storing a flag for whether each element in each row (column) is non-zero;
(114) the position of the starting non-zero element of each row (column) is stored.
fig. 2 is a schematic diagram of decomposed matrix storage data according to an embodiment of the present invention, and steps (111) to (114) are described below with reference to fig. 2.
First, in step (111), the number of rows and columns of the matrix are stored.
In the embodiment shown in fig. 2, the matrix has 5 rows and 6 columns, and is stored.
At step (112), the non-zero elements of the matrix are stored sequentially in the order of the matrix rows (columns).
in this step, the storage order of data in an existing matrix is determined as needed, and the following description will be given by taking a row-first scheme as an example. A memory space M S of matrix elements is prepared for the matrix according to its total number S of non-zero elements (shown as 101 in fig. 2). The non-zero elements of the matrix are then stored sequentially in the order of the matrix rows (columns).
Specifically, referring to fig. 2, the total number S of non-zero elements of the matrix is 12, and the storage space of the matrix element to be prepared is M [12 ]. The non-zero elements in the 5 x 6 matrix are then stored in order in M [12 ].
at step (113), a flag is stored for whether each element in each row (column) is non-zero.
In this step, an array F [ R ] (shown as 102 in FIG. 2) for which non-zero flags are prepared to be stored according to the row (column) number R of the matrix, where F [ i-1] is that the non-zero flags of all matrix elements in the ith row (column) are stored consecutively from low to high, each bit corresponds to a matrix element, and a bit value of 1 represents that this matrix element is non-zero.
the storage space required by the matrix F [ i ] with the number of columns (rows) not more than 32 is a 32-bit int variable, the storage space required by the matrix F [ i ] with the number of columns (rows) not more than 64 is a 64-bit int variable, the storage space required by the matrix F [ i ] with the number of columns (rows) exceeding 64 is ceil (matrix column number/64) 64-bit int variables, ceil is rounding-up operation, namely when the number of matrix columns (rows) is more than 64, the state of each row (column) is continuously stored in a plurality of 64-bit int type variables.
At step (114), the position of the starting non-zero element of each row (column) is stored.
in this step, the position indicated is the position ordinal number of the first non-zero element in each row (column) in the continuous storage space M [ S ] indicated by (112), and a data space is additionally added at the end to store the total number of non-zero elements of the whole matrix, that is, the position ordinal number of the last non-zero element in the last row (column) of the matrix in M [ S ] in (112) is increased by one.
Specifically, in this step, an array P [ R +1] (shown as 103 in fig. 2) of row (column) start positions is prepared for storing the matrix according to the number of rows R of the matrix, where P [0] ═ 0, i.e., the start position of M, P [ i ] ═ P [ i-1] + the number of non-zero elements in the ith row (column), and P [ R ] ═ the number of non-zero elements in the matrix.
Therefore, when data is stored in rows, one matrix a can be expressed as a set of the following data:
The number of rows R, the number of columns C, the total number of non-zero elements S of the matrix
a non-zero metadata array M of length S
Non-zero flag array F, length R × ceil (C/64)
Head of line position array P, length R +1
If the number of matrix columns is not greater than 32, the length of the non-zero-element flag array is R.
At step 12, the data is read.
in this step, data reading is performed based on the data storage in step 11. Specifically, the step further comprises the steps of:
(121) reading whether the jth bit of ith data in an array F of matrix storage data is true, and judging whether the ith row (column) and jth column (row) elements of the matrix are nonzero;
(122) Reading the ith data in the array P of matrix storage data to obtain the position P of the first non-zero element in the ith row (column) in the array M of matrix storage datai;
(123) Calculating the sum of the (j-1) th bit to the (1) th bit of the ith data in the array F of the matrix storage data to obtain the relative position Q of the (j) th row (column) row (row) element to the first non-zero element of the row (column) in the array M of the matrix storage datai;
(124)PiAnd QiAnd adding to obtain the position of the ith row (column) and the jth column (row) elements in an array M of the matrix storage data, and then carrying out value taking and assignment operations.
specifically, in step (121), the matrix element a is judgedijWhether it is non-zero.
Taking out the (i-1) × ceil (C/64) + (j-1)/64 data F in the non-zero element mark array F,
The result r of modulo 64 by j-1 is calculated,
Fetch the r-th bit of the data f, 1 then aijis a non-zero element, is 0ijIs zero;
for a matrix of no more than 32 columns then the simplification is:
taking the (i-1) th data F in the non-zero element flag array F,
Fetch the j-1 th bit of the data f, which is 1 then aijis a non-zero element, is 0ijIs zero;
at step (122), the first non-zero element position P of the ith row of the read matrix is P [ i-1 ].
At step (123), the matrix element a is calculatedijoffset with respect to the first non-zero element position of row i.
the initial position offset q is 0 and,
traversing the ith row matrix from 1 to element a of j-1ikif aikq is q +1 if the element is a non-zero element;
For the case where j reaches 64, the first 64 non-zero-bit flag data bits can be fetched at once and summed using the instructions of the GPU to increase speed.
at step (124), the matrix element a is readij=M[p+q]writing matrix element M [ p + q ]]=aij(only at a)ijvalid for non-zero elements) a value assignment operation is performed.
in step 13, data calculations are performed.
in this step, the addition (subtraction), multiplication, and LU decomposition operations of the matrix may be implemented based on the data storage and data reading methods of step 11 and step 12.
For the addition (subtraction) method of the matrix, assuming that two matrixes participating in the operation are respectively a matrix A and a matrix B, respectively judging the element a of the matrix A and the element a of the matrix B at the ith row and the jth columnijAnd bijWhether it is non-zero, the result is fAAnd fB:
(1) if fAand fBreading element values of the matrixes A and B at the ith row and j column, and performing addition (subtraction) operation to obtain the element values of the ith row and j column of the result matrix C;
(2) If fAIs 1 and fBTaking the element value of the matrix A at the ith row and j column as the element value of the ith row and j column of the result matrix C, wherein the element value of the matrix A at the ith row and j column is 0;
(3) If fAis 0 and fBTaking the positive (negative) element value of matrix B at ith row and j column as the element value of ith row and j column of result matrix C;
(4) If fAAnd fBAre all 0, and the element value of the ith row and j column of the result matrix C is also zero.
for multiplication of matrices A and B, it is required that the number of columns of matrix A is equal to the number of rows of matrix B, i.e. CAAnd RBEquality can be achieved by the following algorithm:
Firstly to CijMaking a judgment if CijIf the element is a non-zero element, ending, otherwise, continuing;
initializing variable v is 0;
Traverse the ith row of the A matrix from 1 to CAelement a ofikFrom 1 to R in jth column of matrix BBelement b ofkjIf aikand bkjAll non-zero elements are then v ═ v + aik×bkj;
Obtaining the result Cij=v。
the elements of each position of the addition and subtraction and multiplication operations of the matrix are multithread safe, and the result of one position can be calculated by each thread by utilizing a multithread parallel algorithm in GPU operation.
The LU decomposition operation of the matrix requires data access between threads using shared memory and thread synchronization techniques.
Fig. 3 is a flowchart of matrix LU decomposition according to the calculation method decomposition of the embodiment of the present invention, which will be described in detail below with reference to fig. 3:
(1) reading data of a compressed matrix (the number of rows and columns of the matrix should be consistent) described by the invention into a complete and continuous dense matrix D, wherein the storage space of the dense matrix is a shared memory to ensure that any thread of a GPU can access, and the space is R multiplied by C:
If a isijIs a non-zero element dij=aijOtherwise dij=0;
synchronizing all threads of the GPU to ensure the consistency of data in the D matrix;
(2) Traversing the kth row and the kth column of the decomposed D matrix in the order from 1 to R (C), and repeating the following steps (3) to (6);
(3) Taking the k-th row element from k to C column element of the D matrix as the k-th row result of the U matrix:
If u iskjIs a non-zero element then ukj=dkjJ ranges from k +1 to C;
(4) Dividing the elements of the k column from k +1 to R column of the D matrix by DkkThe k column result of the L matrix is obtained:
If l isikIs a non-zero elementik=dik/dkkI ranges from k +1 to R;
(5) Updating all elements on the lower right side of k column of k row of the D matrix:
If l isikand ukjAll are non-zero elements then dij=dij-lik×ukjI ranges from k +1 to R, j ranges from k +1 to C;
(6) And all threads of the GPU are synchronized, and the consistency of all data in the D matrix is ensured.
In the decomposition process, the calculation process of L, U, D matrix elements under the condition of the same k value can use the multithreading technology of the GPU for parallel calculation, and each thread processes different (i, j) positions to obtain better parallel efficiency. Different values of k need to be calculated in order and the data in the shared memory is synchronized by using thread synchronization operation.
The invention further provides a computer-readable storage medium, on which computer instructions are stored, and the computer instructions execute the steps of the matrix storage and calculation method suitable for the GPU hardware when running, and the matrix storage and calculation method suitable for the GPU hardware is described in the foregoing section and is not described again.
those of ordinary skill in the art will understand that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (10)
1. a matrix storage and calculation method suitable for GPU hardware comprises the following steps:
1) storing the row number, the column number, the nonzero elements, a mark of whether each element is nonzero or not and the position of the initial nonzero element;
2) Accessing matrix elements through a GPU (graphics processing Unit), acquiring whether the matrix elements are nonzero and the values of the nonzero elements, and setting the values of the nonzero elements of the matrix;
3) and performing matrix operation by using the GPU.
2. the method for storing and computing matrices adapted for use in GPU hardware of claim 1, wherein the step 1) further comprises:
Storing the number of rows and columns of the matrix;
Sequentially storing non-zero elements of the matrix to a first array according to the sequence of rows or columns of the matrix;
Storing a flag indicating whether each element in each row or column is non-zero in a second array;
Storing the position of the starting non-zero element of each row or column in the first array to a third array.
3. A method for matrix storage and computation suitable for GPU hardware as claimed in claim 2, wherein the step of sequentially storing non-zero elements of the matrix into the first array in the order of rows or columns of the matrix further comprises determining the size of the first array according to the total number of non-zero elements of the matrix.
4. The method of claim 2, wherein the step of storing the flag indicating whether each element in each row or column is non-zero in the second array further comprises determining the size of the second array according to the number of rows or columns of the matrix, and storing the non-zero flags of the matrix elements in the rows or columns respectively and continuously in the second array from low to high, wherein each bit corresponds to one matrix element, and a bit value of 1 indicates that the corresponding matrix element is non-zero.
5. The method for storing and computing matrices adapted for use in GPU hardware of claim 1, wherein said step 2) further comprises the steps of:
Acquiring a non-zero zone bit according to the position information of the data to be read;
Reading the position of the first non-zero element of the row or column of the data to be read, and recording the position as a first position;
Calculating the position difference between the data to be read and the first non-zero element of the row or column where the data to be read is located, and recording the position difference as a second position;
Calculating the position of the data to be read in the first array, and recording the position as a third position, wherein the third position = the first position + the second position;
And reading the value of the data to be read in the first array according to the third position.
6. A matrix storage and calculation method suitable for GPU hardware as claimed in claim 1, wherein said step 3) comprises matrix addition, matrix subtraction, matrix multiplication and matrix LU decomposition algorithm.
7. A method for matrix storage and computation for GPU hardware as in claim 6, where the matrix addition comprises the steps of:
Judging whether two corresponding data in the two matrixes participating in the addition operation are nonzero or not;
if the two corresponding data are both zero elements, the addition operation or subtraction operation result of the two corresponding data is zero;
If only one of the two corresponding data is not zero, the result of the addition operation or the subtraction operation of the two corresponding data is a positive value or a negative value of the non-zero data;
And if the two corresponding data are not zero, the addition operation or the subtraction operation of the two data is the addition operation or the subtraction operation of the two corresponding data.
8. a matrix storage and computation method suitable for GPU hardware according to claim 6, characterized in that the matrix multiplication comprises the following steps:
Firstly to CijMaking a judgment if CijIf the element is a non-zero element, ending, otherwise, continuing;
initialization variable v = 0;
Traverse the ith row of the A matrix from 1 to CAelement a ofikfrom 1 to R in jth column of matrix BBelement b ofkjIf aikAnd bkjV = v + a if both are non-zero elementsik×bkj;
obtaining the result Cij = v。
9. a method for matrix storage and computation for GPU hardware as in claim 6, wherein the LU decomposition algorithm for the matrix comprises the following steps:
Reading the compressed matrix data into a dense matrix D, wherein if aijIs a non-zero element dij=aijOtherwise dij=0, all threads of the GPU are synchronized in sequence, and data consistency in the D matrix is guaranteed;
Traversing the kth row and the kth column of the decomposed D matrix in the order from 1 to R or C, and repeating the following steps:
Taking the k-th row element from k to C column element of the D matrix as the k-th row result of the U matrix, wherein if UkjIs a non-zero element then ukj = dkjJ ranges from k +1 to C;
Dividing the elements of the k column from k +1 to R column of the D matrix by Dkkget the k column result of L matrix, wherein if Likis a non-zero elementik = dik/dkkI ranges from k +1 to R;
updating all elements at the lower right side of k column of k row of the D matrix, wherein if likAnd ukjAll are non-zero elements then dij = dij - lik×ukjI ranges from k +1 to R, j ranges from k +1 to C;
and all threads of the GPU are synchronized, and the consistency of all data in the D matrix is ensured.
10. a computer readable storage medium having stored thereon computer instructions, wherein the computer instructions when executed perform the steps of the matrix storage and computation method for GPU hardware of any of claims 1 to 9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910859641.5A CN110580675A (en) | 2019-09-11 | 2019-09-11 | Matrix storage and calculation method suitable for GPU hardware |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910859641.5A CN110580675A (en) | 2019-09-11 | 2019-09-11 | Matrix storage and calculation method suitable for GPU hardware |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110580675A true CN110580675A (en) | 2019-12-17 |
Family
ID=68811910
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910859641.5A Pending CN110580675A (en) | 2019-09-11 | 2019-09-11 | Matrix storage and calculation method suitable for GPU hardware |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110580675A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112306922A (en) * | 2020-11-12 | 2021-02-02 | 山东云海国创云计算装备产业创新中心有限公司 | Multi-data-pair multi-port arbitration method and related device |
CN113094648A (en) * | 2021-04-02 | 2021-07-09 | 算筹信息科技有限公司 | Method for solving triangular matrix and matrix inner product by outer product accumulation |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105593843A (en) * | 2013-08-30 | 2016-05-18 | 微软技术许可有限责任公司 | Sparse matrix data structure |
CN106775598A (en) * | 2016-12-12 | 2017-05-31 | 温州大学 | A kind of Symmetric Matrices method of the compression sparse matrix based on GPU |
CN108052309A (en) * | 2017-12-26 | 2018-05-18 | 杭州迪普科技股份有限公司 | A kind of object order method and device |
-
2019
- 2019-09-11 CN CN201910859641.5A patent/CN110580675A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105593843A (en) * | 2013-08-30 | 2016-05-18 | 微软技术许可有限责任公司 | Sparse matrix data structure |
CN106775598A (en) * | 2016-12-12 | 2017-05-31 | 温州大学 | A kind of Symmetric Matrices method of the compression sparse matrix based on GPU |
CN108052309A (en) * | 2017-12-26 | 2018-05-18 | 杭州迪普科技股份有限公司 | A kind of object order method and device |
Non-Patent Citations (3)
Title |
---|
尹孟嘉等: "GPU稀疏矩阵向量乘的性能模型构造", 《计算机科学》 * |
战同胜等: "《实用数值算法-电子计算机应用数学》", 31 January 1992, 大连理工出版社 * |
朱静华等: "《数据结构题例分析》", 31 August 1995, 华中理工大学出版社 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112306922A (en) * | 2020-11-12 | 2021-02-02 | 山东云海国创云计算装备产业创新中心有限公司 | Multi-data-pair multi-port arbitration method and related device |
CN112306922B (en) * | 2020-11-12 | 2023-09-22 | 山东云海国创云计算装备产业创新中心有限公司 | Multi-data-to-multi-port arbitration method and related device |
CN113094648A (en) * | 2021-04-02 | 2021-07-09 | 算筹信息科技有限公司 | Method for solving triangular matrix and matrix inner product by outer product accumulation |
CN113094648B (en) * | 2021-04-02 | 2022-08-09 | 算筹(深圳)信息科技有限公司 | Method for solving triangular matrix and matrix inner product by outer product accumulation |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107340993B (en) | Arithmetic device and method | |
US10255547B2 (en) | Indirectly accessing sample data to perform multi-convolution operations in a parallel processing system | |
US10346507B2 (en) | Symmetric block sparse matrix-vector multiplication | |
CN109934331A (en) | Device and method for executing artificial neural network forward operation | |
CN107170019B (en) | Rapid low-storage image compression sensing method | |
US20200234129A1 (en) | Techniques for removing masks from pruned neural networks | |
CN103336758A (en) | Sparse matrix storage method CSRL (Compressed Sparse Row with Local Information) and SpMV (Sparse Matrix Vector Multiplication) realization method based on same | |
CN107341507B (en) | GPU and cascade hash based rapid image SIFT feature matching method | |
DE102020112826A1 (en) | PROCESS FOR EFFICIENT PERFORMANCE OF DATA REDUCTION IN PARALLEL PROCESSING UNITS | |
CN110580675A (en) | Matrix storage and calculation method suitable for GPU hardware | |
WO2018129930A1 (en) | Fast fourier transform processing method and device, and computer storage medium | |
US20190138922A1 (en) | Apparatus and methods for forward propagation in neural networks supporting discrete data | |
CN103177414A (en) | Structure-based dependency graph node similarity concurrent computation method | |
CN102647588B (en) | GPU (Graphics Processing Unit) acceleration method used for hierarchical searching motion estimation | |
CN110796236A (en) | Vectorization implementation method for pooling of multi-sample multi-channel convolutional neural network | |
CN106780415B (en) | Histogram statistical circuit and multimedia processing system | |
CN110019184A (en) | A kind of method of the orderly integer array of compression and decompression | |
JP2023070746A (en) | Information processing program, information processing apparatus, and information processing method | |
CN106484532A (en) | GPGPU parallel calculating method towards SPH fluid simulation | |
CN111797985A (en) | Convolution operation memory access optimization method based on GPU | |
CN117539546A (en) | Sparse matrix vector multiplication acceleration method and device based on non-empty column storage | |
Nishimura et al. | Accelerating the Smith-waterman algorithm using bitwise parallel bulk computation technique on GPU | |
CN111191774B (en) | Simplified convolutional neural network-oriented low-cost accelerator architecture and processing method thereof | |
CN111832144A (en) | Full-amplitude quantum computation simulation method | |
CN202093573U (en) | Parallel acceleration device used in industrial CT image reconstruction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: 100102 floor 2, block a, No.2, lizezhong 2nd Road, Chaoyang District, Beijing Applicant after: Beijing Huada Jiutian Technology Co.,Ltd. Address before: 100102 floor 2, block a, No.2, lizezhong 2nd Road, Chaoyang District, Beijing Applicant before: HUADA EMPYREAN SOFTWARE Co.,Ltd. |
|
CB02 | Change of applicant information | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20191217 |
|
RJ01 | Rejection of invention patent application after publication |