CN102411558A - Vector processor oriented large matrix multiplied vectorization realizing method - Google Patents
Vector processor oriented large matrix multiplied vectorization realizing method Download PDFInfo
- Publication number
- CN102411558A CN102411558A CN2011103381088A CN201110338108A CN102411558A CN 102411558 A CN102411558 A CN 102411558A CN 2011103381088 A CN2011103381088 A CN 2011103381088A CN 201110338108 A CN201110338108 A CN 201110338108A CN 102411558 A CN102411558 A CN 102411558A
- Authority
- CN
- China
- Prior art keywords
- matrix
- multiplier
- multiplicand
- vector
- parallel processing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Landscapes
- Complex Calculations (AREA)
Abstract
The invention discloses a vector processor oriented large matrix multiplied vectorization realizing method, which comprises the following steps of: (1) inputting a multiplicand matrix A and a multiplier B; transporting the multiplicand matrix A and multiplier B to a vector storing unit by a DMA (direct memory access) controller; in transporting process, ordering the first to number n lines of the multiplier B into first to number n columns; (2) loading elements in one line of the multiplicand matrix A and in one column of the multiplier B to K parallel processing units and multiplying the elements in a one-to-one correspondence manner; reducing and summing the multiplied results in one pointed parallel processing unit; storing the summed result as a result matrix element in a vector storing unit; and (3) transferring to next line of the multiplicand matrix A and next column of the multiplier B; re-executing the step (2) until calculating all data frames and acquiring a result matrix C composed of matrix elements. The vectorization realizing method disclosed by the invention has the advantages of simple principle, convenient operation and capability of improving calculating efficiency.
Description
Technical field
The present invention relates generally to vector processor and data processing field, relates in particular to the vectorization implementation method that a kind of large matrix multiplies each other.
Background technology
All can relate to matrix multiplication operation at many science calculation tasks with in using; Like Flame Image Process, the signal codec in the communication system etc. are for larger matrix multiple calculation task; Owing to relate to a large amount of multiplication and additive operation, need take a large amount of computing times.How on processor simply and efficiently the realization matrix multiplying be the research focus of industry always.
On traditional scalar processor, the researchist has proposed multiple effective matrix multiple implementation method, to reduce the influence of the sorting operation of data in calculating process to the computing of accomplishing whole matrix multiple.But what, real-time operation highly dense along with HD video encoding and decoding, 3G radio communication, Radar Signal Processing etc. used continues to bring out, and single-chip is difficult to satisfy the real-time computation requirement of high density of this type application, and vector processor has obtained widespread use.As shown in Figure 1, be the typical structure of a vector processor, it has processor and program storage and data-carrier store (both all can be addressable memory arbitrarily, comprise external cache, external RAM etc.).The processor of vector processor is divided into scalar processor unit and two parts of Vector Processing parts; Usually in the Vector Processing parts K parallel processing element (PE) arranged; These processing units all have arithmetic unit and register separately; The data interaction that can carry out through stipulations instructions between processing unit is like the data addition between the parallel processing element, relatively wait.Scalar processing unit mainly is responsible for the processing of Flow Control and logic determines instruction, and vector processing unit mainly is responsible for intensive data computation.The used data of vector processing unit computing are provided by the vector data storage unit.Usually, as shown in Figure 2, the number of the BANK of vector data storage unit (memory bank) is consistent with the processing unit number K of vector processing unit.
Application number is the patent documentation of " 200380107095.7 "; A patent that discloses Intel company's proposition is used the effective multiplication of minor matrix of simd register; The diagonal line of multiplicand matrix A is written in the different registers of processor, and the multiplier matrix B is written at least one in the register of vertically arranging according to the order of sequence.Through moving an element, selectively the last element in multiplication in every row of the multiplier matrix B in the register and the same row that moved of addition element is moved to together the front end of these row.The diagonal line of multiplicand matrix A multiply by the row of multiplier matrix B, and their result is added to the result of row of matrix of consequence C and last.This method is to obtain reasonable effect under the less situation of matrix size, and still along with the increase gradually of matrix size, the performance that is difficult to obtain shows.Therefore, how to realize on vector processor that large matrix multiplying efficiently is a current difficulty that faces.
Summary of the invention
Technical matters to be solved by this invention is: the vectorization implementation method that to the problem that prior art exists, the present invention provides that a kind of principle is simple, easy to operate, the multistage parallel property characteristics that can make full use of vector processor and the large matrix towards vector processor that is easy to realize multiply each other.
For solving the problems of the technologies described above, the present invention adopts following technical scheme:
A kind of vectorization implementation method that multiplies each other towards the large matrix of vector processor may further comprise the steps:
(1) input multiplicand matrix A and multiplier matrix B; Through dma controller multiplicand matrix A and multiplier matrix B are transported to respectively in the vectorial storage unit; In handling process, the multiplier matrix B to be reordered, the capable ordering successively of the 1st~n that is about in the multiplier matrix B is the 1st~n row;
(2) element in the row in element in the multiplicand matrix A delegation and the multiplier matrix B is loaded into respectively in K the parallel processing element, and correspondence multiplies each other one by one; With multiplied result reduction summation in the parallel processing element of an appointment; Summed result is stored in the vectorial storage unit as a matrix of consequence element;
(3) along moving on to the next line of multiplicand matrix A and the next column of multiplier matrix B, repeating step (2) obtains the matrix of consequence C that is made up of the matrix of consequence element until the calculating of accomplishing all Frames.
As further improvement of the present invention:
In the said handling process; Each row of multiplicand matrix A is organized into a Frame; Each row of multiplier matrix B are organized into a Frame; When the element number of said Frame is not equal to the multiple of the number K of parallel processing element in the vector processor, mends 0 at the Frame afterbody and make the element number of each Frame equal the multiple of the number K of parallel processing element.
Compared with prior art, the invention has the advantages that:
The vectorization implementation method that multiplies each other towards the large matrix of vector processor of the present invention; Through in the process of dma controller carrying data, realizing the data reordering of multiplier matrix B; Also make full use of the characteristics that a plurality of parallel processing elements of vectorial parts in the vector processor can carry out the identical operation operation simultaneously simultaneously and carry out a large amount of operations of the same type; Thereby improved the efficient of compute matrix multiplication greatly, and step is simple, is easy to realize.
Description of drawings
Fig. 1 is typical vector processor structural representation.
Fig. 2 is the structural representation of the vector data storage unit in the vector processor of Fig. 1.
Fig. 3 is a main-process stream synoptic diagram of the present invention.
Fig. 4 realizes the multiplier matrix B element synoptic diagram that reorders with dma controller in the embodiment of the invention 1.
Fig. 5 is multiplicand matrix A and the Storage Format synoptic diagram of element in vector data storage unit shown in Figure 2 of multiplier matrix B in the embodiment of the invention 2; Fig. 5 (1) is the Storage Format synoptic diagram of element in vector data storage unit shown in Figure 2 of the multiplicand matrix A in the embodiment of the invention 2; Fig. 5 (2) is the Storage Format synoptic diagram of element in vector data storage unit shown in Figure 2 of the multiplier matrix B in the embodiment of the invention 2.
Fig. 6 is loaded into K the synoptic diagram in the parallel processing element for the multiplicand matrix A (16 * 16) and the multiplier matrix B (16 * 16) of the embodiment of the invention 2.
Fig. 7 is the matrix multiplication performing step synoptic diagram of the multiplicand matrix A (16 * 16) and the multiplier matrix B (16 * 16) of the embodiment of the invention 2.
Fig. 8 is multiplicand matrix A and the Storage Format synoptic diagram of element in vector data storage unit shown in Figure 2 of multiplier matrix B in the embodiment of the invention 3; Fig. 8 (1) is the Storage Format synoptic diagram of element in vector data storage unit shown in Figure 2 of the multiplicand matrix A in the embodiment of the invention 3; Fig. 8 (2) is the Storage Format synoptic diagram of element in vector data storage unit shown in Figure 2 of the multiplier matrix B in the embodiment of the invention 3.
Fig. 9 is that the multiplicand matrix A (26 * 22) and the multiplier matrix B (22 * 27) of the embodiment of the invention 3 is loaded into K the synoptic diagram in the parallel processing element.
Figure 10 is the matrix multiplication performing step synoptic diagram of the multiplicand matrix A (26 * 22) and the multiplier matrix B (22 * 27) of the embodiment of the invention 3.
Embodiment
Below will combine Figure of description and specific embodiment that the present invention is done further explain.
Embodiment 1:
As shown in Figure 3, the vectorization implementation method that multiplies each other towards the large matrix of vector processor of the present invention may further comprise the steps:
1, input multiplicand matrix A and multiplier matrix B; Through dma controller multiplicand matrix A and multiplier matrix B are transported to respectively in the vectorial storage unit, as shown in Figure 4 in the handling process, the multiplier matrix B to be reordered, the capable ordering successively of the 1st~n that is about in the multiplier matrix B is the 1st~n row.
Through the configuration of dma controller, can each row of multiplicand matrix A be organized into a Frame, each row of multiplier matrix B are organized into a Frame, and whole multiplier matrix B can be divided into p Frame altogether.When the element number of Frame is not equal to the multiple of the number K of parallel processing element in the vector processor, mends 0 at the Frame afterbody and make the element number of each Frame equal the multiple of the number K of parallel processing element.
2, the element in the column data frame of the data line frame of multiplicand matrix A and multiplier matrix B is loaded into respectively in K the parallel processing element, and correspondence multiplies each other one by one; Multiplied result is the reduction summation in the parallel processing element of an appointment; Summed result stores in the vectorial storage unit as a matrix of consequence element.
3, along moving on to the next line of multiplicand matrix A and the next column of multiplier matrix B, repeating step 2 to 3 obtains the matrix of consequence C that is made up of the matrix of consequence element until the calculating of accomplishing all Frames.
For the computing that the multiplicand matrix A of m*n multiply by the multiplier matrix B of n*p, can obtain the Matrix C of m*p.It can be expressed as on mathematical formulae:
(0≤i<m, 0≤j<p).The Elements C of matrix of consequence C
IjBe by the corresponding row element A of multiplicand matrix A
IkWith the corresponding column element B of multiplier matrix B
KjCarrying out the dot product operational computations tries to achieve.
Embodiment 2:
As shown in Figure 7, adopt the vectorization implementation method that multiplies each other towards the large matrix of vector processor of the present invention, the calculating scale be 16 * 16 matrix to multiply by scale be 16 * 16 matrix (the vector processing unit number K is 8), may further comprise the steps:
1, as shown in Figure 6, input multiplicand matrix A (16 * 16) and multiplier matrix B (16 * 16); Arrive vectorial storage unit through DMA carrying multiplicand matrix A and multiplier matrix B; Realize reorder (method for reordering is identical with embodiment 1) of multiplier matrix B in this process, multiplicand matrix A and multiplier matrix B in the location mode of vector location shown in Fig. 5 (1) and Fig. 5 (2).
2, the element with row of row element of multiplicand matrix A and multiplier matrix B is loaded in the vector processing unit, because the scale of multiplicand matrix A and multiplier matrix B all is 16 * 16, so will be at twice with multiplicand matrix A and the loading of multiplier matrix B.The element that is loaded into the correspondence of vector processing unit carries out multiply operation, because the number of vector processing unit is 8, and the number of multiplicand element and multiplier element is a difference 16, and the multiply operation of this step should be carried out twice.
3, with the reduction instruction result that each vector processing unit calculated in the step 2 is carried out the phase add operation, the reduction as a result of gained is to the processing unit X of appointment, and the operation of this same step also should be carried out twice.
4, two results of gained carry out the phase add operation in the X unit with above-mentioned, draw a matrix of consequence element and deposit vectorial storage unit in.
5, along moving on to the next line of multiplicand matrix A and the next column of multiplier matrix B, repeat step 2 in the said process to step 4 to calculate whole matrix of consequence C=A * B.
Embodiment 3:
Shown in figure 10, the vectorization implementation method that multiplies each other towards the large matrix of vector processor of the present invention, calculating scale be 26 * 22 matrix to multiply by scale be 22 * 27 matrix (the vector processing unit number K is 8), may further comprise the steps:
1, as shown in Figure 9; Arrive vectorial storage unit through DMA carrying multiplicand matrix A and multiplier matrix B; Realize reorder (method for reordering is identical with embodiment 1) of multiplier matrix B in this process; But also multiplicand matrix A and multiplier matrix B have been carried out mending 0 operation, multiplicand matrix A and multiplier matrix B are at location mode such as Fig. 8 (1) and Fig. 8 (2) of vector location.
2, the element with row of row element of multiplicand matrix A and multiplier matrix B is loaded in the vector processing unit; Here the row of the row of multiplicand matrix A and multiplier matrix B are to have passed through benefit 0, so will divide 3 times with multiplicand matrix A and the loading of multiplier matrix B.The element that is loaded into the correspondence of vector processing unit carries out multiply operation, because the number of vector processing unit is 8, and the number of mending 0 back multiplicand element and multiplier element all is 24, and the multiply operation of this step should be carried out 3 times.
3, with the reduction instruction result that each vector processing unit calculated in the step 2 is carried out the phase add operation, the reduction as a result of gained is to the processing unit X of appointment, and the operation of this same step also should be carried out 3 times.
4,3 results of gained carry out the phase add operation in the X unit with above-mentioned, draw a matrix of consequence element and deposit vectorial storage unit in.
5, along moving on to the next line of multiplicand matrix A and the next column of multiplier matrix B, the step 2 that repeats in the said process can calculate whole matrix of consequence C=A * B to step 4.
Can write out multiplying each other of advantages of simplicity and high efficiency code realization large matrix according to above step according to the structure and the instruction set of concrete vector processor.Method of the present invention is easily understood for the programmer, helps programmer's realization of encoding.
The above only is a preferred implementation of the present invention, and protection scope of the present invention also not only is confined to the foregoing description, and all technical schemes that belongs under the thinking of the present invention all belong to protection scope of the present invention.Should be pointed out that for those skilled in the art some improvement and retouching not breaking away under the principle of the invention prerequisite should be regarded as protection scope of the present invention.
Claims (2)
1. vectorization implementation method that multiplies each other towards the large matrix of vector processor, tool is characterised in that and may further comprise the steps:
(1) input multiplicand matrix A and multiplier matrix B; Through dma controller multiplicand matrix A and multiplier matrix B are transported to respectively in the vectorial storage unit; In handling process, the multiplier matrix B to be reordered, the capable ordering successively of the 1st~n that is about in the multiplier matrix B is the 1st~n row;
(2) element in the row in element in the multiplicand matrix A delegation and the multiplier matrix B is loaded into respectively in K the parallel processing element, and correspondence multiplies each other one by one; With multiplied result reduction summation in the parallel processing element of an appointment; Summed result is stored in the vectorial storage unit as a matrix of consequence element;
(3) along moving on to the next line of multiplicand matrix A and the next column of multiplier matrix B, repeating step (2) obtains the matrix of consequence C that is made up of the matrix of consequence element until the calculating of accomplishing all Frames.
2. the vectorization implementation method that multiplies each other towards the large matrix of vector processor according to claim 1; It is characterized in that; In the said handling process, each row of multiplicand matrix A is organized into a Frame, and each row of multiplier matrix B are organized into a Frame; When the element number of said Frame is not equal to the multiple of the number K of parallel processing element in the vector processor, mends 0 at the Frame afterbody and make the element number of each Frame equal the multiple of the number K of parallel processing element.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110338108.8A CN102411558B (en) | 2011-10-31 | 2011-10-31 | Vector processor oriented large matrix multiplied vectorization realizing method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110338108.8A CN102411558B (en) | 2011-10-31 | 2011-10-31 | Vector processor oriented large matrix multiplied vectorization realizing method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102411558A true CN102411558A (en) | 2012-04-11 |
CN102411558B CN102411558B (en) | 2015-05-13 |
Family
ID=45913637
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201110338108.8A Active CN102411558B (en) | 2011-10-31 | 2011-10-31 | Vector processor oriented large matrix multiplied vectorization realizing method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102411558B (en) |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104461449A (en) * | 2014-11-14 | 2015-03-25 | 中国科学院数据与通信保护研究教育中心 | Large integer multiplication realizing method and device based on vector instructions |
CN104636316A (en) * | 2015-02-06 | 2015-05-20 | 中国人民解放军国防科学技术大学 | GPDSP-oriented large-scale matrix multiplication calculation method |
CN104899182A (en) * | 2015-06-09 | 2015-09-09 | 中国人民解放军国防科学技术大学 | Matrix multiplication acceleration method for supporting variable blocks |
CN106445471A (en) * | 2016-10-13 | 2017-02-22 | 北京百度网讯科技有限公司 | Processor and method for executing matrix multiplication on processor |
CN106959937A (en) * | 2017-03-30 | 2017-07-18 | 中国人民解放军国防科学技术大学 | A kind of vectorization implementation method of warp product matrix towards GPDSP |
CN106970896A (en) * | 2017-03-30 | 2017-07-21 | 中国人民解放军国防科学技术大学 | The vectorization implementation method of the two-dimensional matrix convolution of vector processor-oriented |
CN107256203A (en) * | 2017-06-28 | 2017-10-17 | 郑州云海信息技术有限公司 | The implementation method and device of a kind of matrix-vector multiplication |
CN107977231A (en) * | 2017-12-15 | 2018-05-01 | 北京中科寒武纪科技有限公司 | A kind of computational methods and Related product |
CN108509384A (en) * | 2017-02-24 | 2018-09-07 | 富士通株式会社 | Computational methods, information processing unit, calculation procedure and information processing system |
US10338919B2 (en) | 2017-05-08 | 2019-07-02 | Nvidia Corporation | Generalized acceleration of matrix multiply accumulate operations |
US10454680B2 (en) | 2016-11-01 | 2019-10-22 | Beijing Baidu Netcom Science And Technology Co., Ltd. | RSA decryption processor and method for controlling RSA decryption processor |
CN110377877A (en) * | 2019-07-26 | 2019-10-25 | 苏州浪潮智能科技有限公司 | A kind of data processing method, device, equipment and storage medium |
CN110494846A (en) * | 2017-03-20 | 2019-11-22 | 英特尔公司 | System, method and apparatus for addition of matrices, subtraction and multiplication |
CN111465924A (en) * | 2017-12-12 | 2020-07-28 | 特斯拉公司 | System and method for converting matrix input to vectorized input for a matrix processor |
CN112433760A (en) * | 2020-11-27 | 2021-03-02 | 海光信息技术股份有限公司 | Data sorting method and data sorting circuit |
CN113536220A (en) * | 2020-04-21 | 2021-10-22 | 中科寒武纪科技股份有限公司 | Operation method, processor and related product |
US11816482B2 (en) | 2017-05-08 | 2023-11-14 | Nvidia Corporation | Generalized acceleration of matrix multiply accumulate operations |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109165733A (en) * | 2018-07-11 | 2019-01-08 | 中国人民解放军国防科技大学 | Multi-input multi-output matrix maximum pooling vectorization implementation method |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1155117A (en) * | 1996-01-19 | 1997-07-23 | 张胤微 | High-speed multiplication device |
CN1394314A (en) * | 2000-11-02 | 2003-01-29 | 索尼计算机娱乐公司 | Parallel operation device, entertainment device, operating method, computer program, and semiconductor device |
CN101061474A (en) * | 2004-06-10 | 2007-10-24 | 哈桑·塞希托格鲁 | Matrix-valued methods and apparatus for signal processing |
CN101089840A (en) * | 2007-07-12 | 2007-12-19 | 浙江大学 | Matrix multiplication parallel computing system based on multi-FPGA |
CN101163240A (en) * | 2006-10-13 | 2008-04-16 | 国际商业机器公司 | Filter arrangement and method thereof |
EP2017743A2 (en) * | 2007-07-19 | 2009-01-21 | Itt Manufacturing Enterprises, Inc. | High speed and efficient matrix multiplication hardware module |
-
2011
- 2011-10-31 CN CN201110338108.8A patent/CN102411558B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1155117A (en) * | 1996-01-19 | 1997-07-23 | 张胤微 | High-speed multiplication device |
CN1394314A (en) * | 2000-11-02 | 2003-01-29 | 索尼计算机娱乐公司 | Parallel operation device, entertainment device, operating method, computer program, and semiconductor device |
CN101061474A (en) * | 2004-06-10 | 2007-10-24 | 哈桑·塞希托格鲁 | Matrix-valued methods and apparatus for signal processing |
CN101163240A (en) * | 2006-10-13 | 2008-04-16 | 国际商业机器公司 | Filter arrangement and method thereof |
CN101089840A (en) * | 2007-07-12 | 2007-12-19 | 浙江大学 | Matrix multiplication parallel computing system based on multi-FPGA |
EP2017743A2 (en) * | 2007-07-19 | 2009-01-21 | Itt Manufacturing Enterprises, Inc. | High speed and efficient matrix multiplication hardware module |
Cited By (31)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104461449A (en) * | 2014-11-14 | 2015-03-25 | 中国科学院数据与通信保护研究教育中心 | Large integer multiplication realizing method and device based on vector instructions |
CN104461449B (en) * | 2014-11-14 | 2018-02-27 | 中国科学院数据与通信保护研究教育中心 | Large integer multiplication implementation method and device based on vector instruction |
CN104636316A (en) * | 2015-02-06 | 2015-05-20 | 中国人民解放军国防科学技术大学 | GPDSP-oriented large-scale matrix multiplication calculation method |
CN104899182B (en) * | 2015-06-09 | 2017-10-31 | 中国人民解放军国防科学技术大学 | A kind of Matrix Multiplication accelerated method for supporting variable partitioned blocks |
CN104899182A (en) * | 2015-06-09 | 2015-09-09 | 中国人民解放军国防科学技术大学 | Matrix multiplication acceleration method for supporting variable blocks |
CN106445471A (en) * | 2016-10-13 | 2017-02-22 | 北京百度网讯科技有限公司 | Processor and method for executing matrix multiplication on processor |
US10140251B2 (en) | 2016-10-13 | 2018-11-27 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Processor and method for executing matrix multiplication operation on processor |
US10454680B2 (en) | 2016-11-01 | 2019-10-22 | Beijing Baidu Netcom Science And Technology Co., Ltd. | RSA decryption processor and method for controlling RSA decryption processor |
CN108509384B (en) * | 2017-02-24 | 2022-04-12 | 富士通株式会社 | Calculation method, information processing apparatus, calculation program, and information processing system |
CN108509384A (en) * | 2017-02-24 | 2018-09-07 | 富士通株式会社 | Computational methods, information processing unit, calculation procedure and information processing system |
CN110494846A (en) * | 2017-03-20 | 2019-11-22 | 英特尔公司 | System, method and apparatus for addition of matrices, subtraction and multiplication |
CN106970896A (en) * | 2017-03-30 | 2017-07-21 | 中国人民解放军国防科学技术大学 | The vectorization implementation method of the two-dimensional matrix convolution of vector processor-oriented |
CN106959937B (en) * | 2017-03-30 | 2019-03-29 | 中国人民解放军国防科学技术大学 | A kind of vectorization implementation method of the warp product matrix towards GPDSP |
CN106970896B (en) * | 2017-03-30 | 2020-05-12 | 中国人民解放军国防科学技术大学 | Vector processor-oriented vectorization implementation method for two-dimensional matrix convolution |
CN106959937A (en) * | 2017-03-30 | 2017-07-18 | 中国人民解放军国防科学技术大学 | A kind of vectorization implementation method of warp product matrix towards GPDSP |
US10338919B2 (en) | 2017-05-08 | 2019-07-02 | Nvidia Corporation | Generalized acceleration of matrix multiply accumulate operations |
US11797302B2 (en) | 2017-05-08 | 2023-10-24 | Nvidia Corporation | Generalized acceleration of matrix multiply accumulate operations |
US10884734B2 (en) | 2017-05-08 | 2021-01-05 | Nvidia Corporation | Generalized acceleration of matrix multiply accumulate operations |
US11816481B2 (en) | 2017-05-08 | 2023-11-14 | Nvidia Corporation | Generalized acceleration of matrix multiply accumulate operations |
US11816482B2 (en) | 2017-05-08 | 2023-11-14 | Nvidia Corporation | Generalized acceleration of matrix multiply accumulate operations |
US11797303B2 (en) | 2017-05-08 | 2023-10-24 | Nvidia Corporation | Generalized acceleration of matrix multiply accumulate operations |
US11797301B2 (en) | 2017-05-08 | 2023-10-24 | Nvidia Corporation | Generalized acceleration of matrix multiply accumulate operations |
CN107256203A (en) * | 2017-06-28 | 2017-10-17 | 郑州云海信息技术有限公司 | The implementation method and device of a kind of matrix-vector multiplication |
CN111465924A (en) * | 2017-12-12 | 2020-07-28 | 特斯拉公司 | System and method for converting matrix input to vectorized input for a matrix processor |
CN111465924B (en) * | 2017-12-12 | 2023-11-17 | 特斯拉公司 | System and method for converting matrix input into vectorized input for matrix processor |
CN107977231B (en) * | 2017-12-15 | 2020-10-27 | 安徽寒武纪信息科技有限公司 | Calculation method and related product |
CN107977231A (en) * | 2017-12-15 | 2018-05-01 | 北京中科寒武纪科技有限公司 | A kind of computational methods and Related product |
CN110377877B (en) * | 2019-07-26 | 2022-12-23 | 苏州浪潮智能科技有限公司 | Data processing method, device, equipment and storage medium |
CN110377877A (en) * | 2019-07-26 | 2019-10-25 | 苏州浪潮智能科技有限公司 | A kind of data processing method, device, equipment and storage medium |
CN113536220A (en) * | 2020-04-21 | 2021-10-22 | 中科寒武纪科技股份有限公司 | Operation method, processor and related product |
CN112433760A (en) * | 2020-11-27 | 2021-03-02 | 海光信息技术股份有限公司 | Data sorting method and data sorting circuit |
Also Published As
Publication number | Publication date |
---|---|
CN102411558B (en) | 2015-05-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102411558B (en) | Vector processor oriented large matrix multiplied vectorization realizing method | |
Dou et al. | 64-bit floating-point FPGA matrix multiplication | |
CN1230735C (en) | Processing multiply-accumulate operations in single cycle | |
EP3659051B1 (en) | Accelerated mathematical engine | |
CN102141976B (en) | Method for storing diagonal data of sparse matrix and SpMV (Sparse Matrix Vector) realization method based on method | |
CN103294648B (en) | Support the partitioned matrix multiplication vectorization method of many MAC operation parts vector treatment device | |
CN111937009A (en) | Systolic convolutional neural network | |
CN100405361C (en) | Method and system for performing calculation operations and a device | |
CN110188869B (en) | Method and system for integrated circuit accelerated calculation based on convolutional neural network algorithm | |
CN103440121A (en) | Triangular matrix multiplication vectorization method of vector processor | |
CN107341133B (en) | Scheduling method of reconfigurable computing structure based on LU decomposition of arbitrary dimension matrix | |
CN103902507A (en) | Matrix multiplication calculating device and matrix multiplication calculating method both oriented to programmable algebra processor | |
CN105373517A (en) | Spark-based distributed matrix inversion parallel operation method | |
WO2020196407A1 (en) | Convolutional computation device | |
Yzelman et al. | A cache-oblivious sparse matrix–vector multiplication scheme based on the Hilbert curve | |
CN107667345A (en) | Packing data alignment plus computations, processor, method and system | |
CN103389967B (en) | The device and method of a kind of matrix transposition based on SRAM | |
CN102402415A (en) | Device and method for buffering data in dynamic reconfigurable array | |
CN102360281A (en) | Multifunctional fixed-point media access control (MAC) operation device for microprocessor | |
Haidar et al. | Out of memory SVD solver for big data | |
CN102411773B (en) | Vector-processor-oriented mean-residual normalized product correlation vectoring method | |
EP3842954A1 (en) | System and method for configurable systolic array with partial read/write | |
CN104615516A (en) | Method for achieving large-scale high-performance Linpack testing benchmark for GPDSP | |
CN102231624B (en) | Vector processor-oriented floating point complex number block finite impulse response (FIR) vectorization realization method | |
CN115576606A (en) | Method for realizing matrix transposition multiplication, coprocessor, server and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant |