CN102411558A - Vector processor oriented large matrix multiplied vectorization realizing method - Google Patents

Vector processor oriented large matrix multiplied vectorization realizing method Download PDF

Info

Publication number
CN102411558A
CN102411558A CN2011103381088A CN201110338108A CN102411558A CN 102411558 A CN102411558 A CN 102411558A CN 2011103381088 A CN2011103381088 A CN 2011103381088A CN 201110338108 A CN201110338108 A CN 201110338108A CN 102411558 A CN102411558 A CN 102411558A
Authority
CN
China
Prior art keywords
matrix
multiplier
multiplicand
vector
parallel processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2011103381088A
Other languages
Chinese (zh)
Other versions
CN102411558B (en
Inventor
刘仲
陈书明
陈跃跃
曾咏涛
刘衡竹
陈海燕
龚国辉
彭元喜
陈胜刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201110338108.8A priority Critical patent/CN102411558B/en
Publication of CN102411558A publication Critical patent/CN102411558A/en
Application granted granted Critical
Publication of CN102411558B publication Critical patent/CN102411558B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Complex Calculations (AREA)

Abstract

The invention discloses a vector processor oriented large matrix multiplied vectorization realizing method, which comprises the following steps of: (1) inputting a multiplicand matrix A and a multiplier B; transporting the multiplicand matrix A and multiplier B to a vector storing unit by a DMA (direct memory access) controller; in transporting process, ordering the first to number n lines of the multiplier B into first to number n columns; (2) loading elements in one line of the multiplicand matrix A and in one column of the multiplier B to K parallel processing units and multiplying the elements in a one-to-one correspondence manner; reducing and summing the multiplied results in one pointed parallel processing unit; storing the summed result as a result matrix element in a vector storing unit; and (3) transferring to next line of the multiplicand matrix A and next column of the multiplier B; re-executing the step (2) until calculating all data frames and acquiring a result matrix C composed of matrix elements. The vectorization realizing method disclosed by the invention has the advantages of simple principle, convenient operation and capability of improving calculating efficiency.

Description

The vectorization implementation method that multiplies each other towards the large matrix of vector processor
Technical field
The present invention relates generally to vector processor and data processing field, relates in particular to the vectorization implementation method that a kind of large matrix multiplies each other.
Background technology
All can relate to matrix multiplication operation at many science calculation tasks with in using; Like Flame Image Process, the signal codec in the communication system etc. are for larger matrix multiple calculation task; Owing to relate to a large amount of multiplication and additive operation, need take a large amount of computing times.How on processor simply and efficiently the realization matrix multiplying be the research focus of industry always.
On traditional scalar processor, the researchist has proposed multiple effective matrix multiple implementation method, to reduce the influence of the sorting operation of data in calculating process to the computing of accomplishing whole matrix multiple.But what, real-time operation highly dense along with HD video encoding and decoding, 3G radio communication, Radar Signal Processing etc. used continues to bring out, and single-chip is difficult to satisfy the real-time computation requirement of high density of this type application, and vector processor has obtained widespread use.As shown in Figure 1, be the typical structure of a vector processor, it has processor and program storage and data-carrier store (both all can be addressable memory arbitrarily, comprise external cache, external RAM etc.).The processor of vector processor is divided into scalar processor unit and two parts of Vector Processing parts; Usually in the Vector Processing parts K parallel processing element (PE) arranged; These processing units all have arithmetic unit and register separately; The data interaction that can carry out through stipulations instructions between processing unit is like the data addition between the parallel processing element, relatively wait.Scalar processing unit mainly is responsible for the processing of Flow Control and logic determines instruction, and vector processing unit mainly is responsible for intensive data computation.The used data of vector processing unit computing are provided by the vector data storage unit.Usually, as shown in Figure 2, the number of the BANK of vector data storage unit (memory bank) is consistent with the processing unit number K of vector processing unit.
Application number is the patent documentation of " 200380107095.7 "; A patent that discloses Intel company's proposition is used the effective multiplication of minor matrix of simd register; The diagonal line of multiplicand matrix A is written in the different registers of processor, and the multiplier matrix B is written at least one in the register of vertically arranging according to the order of sequence.Through moving an element, selectively the last element in multiplication in every row of the multiplier matrix B in the register and the same row that moved of addition element is moved to together the front end of these row.The diagonal line of multiplicand matrix A multiply by the row of multiplier matrix B, and their result is added to the result of row of matrix of consequence C and last.This method is to obtain reasonable effect under the less situation of matrix size, and still along with the increase gradually of matrix size, the performance that is difficult to obtain shows.Therefore, how to realize on vector processor that large matrix multiplying efficiently is a current difficulty that faces.
Summary of the invention
Technical matters to be solved by this invention is: the vectorization implementation method that to the problem that prior art exists, the present invention provides that a kind of principle is simple, easy to operate, the multistage parallel property characteristics that can make full use of vector processor and the large matrix towards vector processor that is easy to realize multiply each other.
For solving the problems of the technologies described above, the present invention adopts following technical scheme:
A kind of vectorization implementation method that multiplies each other towards the large matrix of vector processor may further comprise the steps:
(1) input multiplicand matrix A and multiplier matrix B; Through dma controller multiplicand matrix A and multiplier matrix B are transported to respectively in the vectorial storage unit; In handling process, the multiplier matrix B to be reordered, the capable ordering successively of the 1st~n that is about in the multiplier matrix B is the 1st~n row;
(2) element in the row in element in the multiplicand matrix A delegation and the multiplier matrix B is loaded into respectively in K the parallel processing element, and correspondence multiplies each other one by one; With multiplied result reduction summation in the parallel processing element of an appointment; Summed result is stored in the vectorial storage unit as a matrix of consequence element;
(3) along moving on to the next line of multiplicand matrix A and the next column of multiplier matrix B, repeating step (2) obtains the matrix of consequence C that is made up of the matrix of consequence element until the calculating of accomplishing all Frames.
As further improvement of the present invention:
In the said handling process; Each row of multiplicand matrix A is organized into a Frame; Each row of multiplier matrix B are organized into a Frame; When the element number of said Frame is not equal to the multiple of the number K of parallel processing element in the vector processor, mends 0 at the Frame afterbody and make the element number of each Frame equal the multiple of the number K of parallel processing element.
Compared with prior art, the invention has the advantages that:
The vectorization implementation method that multiplies each other towards the large matrix of vector processor of the present invention; Through in the process of dma controller carrying data, realizing the data reordering of multiplier matrix B; Also make full use of the characteristics that a plurality of parallel processing elements of vectorial parts in the vector processor can carry out the identical operation operation simultaneously simultaneously and carry out a large amount of operations of the same type; Thereby improved the efficient of compute matrix multiplication greatly, and step is simple, is easy to realize.
Description of drawings
Fig. 1 is typical vector processor structural representation.
Fig. 2 is the structural representation of the vector data storage unit in the vector processor of Fig. 1.
Fig. 3 is a main-process stream synoptic diagram of the present invention.
Fig. 4 realizes the multiplier matrix B element synoptic diagram that reorders with dma controller in the embodiment of the invention 1.
Fig. 5 is multiplicand matrix A and the Storage Format synoptic diagram of element in vector data storage unit shown in Figure 2 of multiplier matrix B in the embodiment of the invention 2; Fig. 5 (1) is the Storage Format synoptic diagram of element in vector data storage unit shown in Figure 2 of the multiplicand matrix A in the embodiment of the invention 2; Fig. 5 (2) is the Storage Format synoptic diagram of element in vector data storage unit shown in Figure 2 of the multiplier matrix B in the embodiment of the invention 2.
Fig. 6 is loaded into K the synoptic diagram in the parallel processing element for the multiplicand matrix A (16 * 16) and the multiplier matrix B (16 * 16) of the embodiment of the invention 2.
Fig. 7 is the matrix multiplication performing step synoptic diagram of the multiplicand matrix A (16 * 16) and the multiplier matrix B (16 * 16) of the embodiment of the invention 2.
Fig. 8 is multiplicand matrix A and the Storage Format synoptic diagram of element in vector data storage unit shown in Figure 2 of multiplier matrix B in the embodiment of the invention 3; Fig. 8 (1) is the Storage Format synoptic diagram of element in vector data storage unit shown in Figure 2 of the multiplicand matrix A in the embodiment of the invention 3; Fig. 8 (2) is the Storage Format synoptic diagram of element in vector data storage unit shown in Figure 2 of the multiplier matrix B in the embodiment of the invention 3.
Fig. 9 is that the multiplicand matrix A (26 * 22) and the multiplier matrix B (22 * 27) of the embodiment of the invention 3 is loaded into K the synoptic diagram in the parallel processing element.
Figure 10 is the matrix multiplication performing step synoptic diagram of the multiplicand matrix A (26 * 22) and the multiplier matrix B (22 * 27) of the embodiment of the invention 3.
Embodiment
Below will combine Figure of description and specific embodiment that the present invention is done further explain.
Embodiment 1:
As shown in Figure 3, the vectorization implementation method that multiplies each other towards the large matrix of vector processor of the present invention may further comprise the steps:
1, input multiplicand matrix A and multiplier matrix B; Through dma controller multiplicand matrix A and multiplier matrix B are transported to respectively in the vectorial storage unit, as shown in Figure 4 in the handling process, the multiplier matrix B to be reordered, the capable ordering successively of the 1st~n that is about in the multiplier matrix B is the 1st~n row.
Through the configuration of dma controller, can each row of multiplicand matrix A be organized into a Frame, each row of multiplier matrix B are organized into a Frame, and whole multiplier matrix B can be divided into p Frame altogether.When the element number of Frame is not equal to the multiple of the number K of parallel processing element in the vector processor, mends 0 at the Frame afterbody and make the element number of each Frame equal the multiple of the number K of parallel processing element.
2, the element in the column data frame of the data line frame of multiplicand matrix A and multiplier matrix B is loaded into respectively in K the parallel processing element, and correspondence multiplies each other one by one; Multiplied result is the reduction summation in the parallel processing element of an appointment; Summed result stores in the vectorial storage unit as a matrix of consequence element.
3, along moving on to the next line of multiplicand matrix A and the next column of multiplier matrix B, repeating step 2 to 3 obtains the matrix of consequence C that is made up of the matrix of consequence element until the calculating of accomplishing all Frames.
For the computing that the multiplicand matrix A of m*n multiply by the multiplier matrix B of n*p, can obtain the Matrix C of m*p.It can be expressed as on mathematical formulae: (0≤i<m, 0≤j<p).The Elements C of matrix of consequence C IjBe by the corresponding row element A of multiplicand matrix A IkWith the corresponding column element B of multiplier matrix B KjCarrying out the dot product operational computations tries to achieve.
Embodiment 2:
As shown in Figure 7, adopt the vectorization implementation method that multiplies each other towards the large matrix of vector processor of the present invention, the calculating scale be 16 * 16 matrix to multiply by scale be 16 * 16 matrix (the vector processing unit number K is 8), may further comprise the steps:
1, as shown in Figure 6, input multiplicand matrix A (16 * 16) and multiplier matrix B (16 * 16); Arrive vectorial storage unit through DMA carrying multiplicand matrix A and multiplier matrix B; Realize reorder (method for reordering is identical with embodiment 1) of multiplier matrix B in this process, multiplicand matrix A and multiplier matrix B in the location mode of vector location shown in Fig. 5 (1) and Fig. 5 (2).
2, the element with row of row element of multiplicand matrix A and multiplier matrix B is loaded in the vector processing unit, because the scale of multiplicand matrix A and multiplier matrix B all is 16 * 16, so will be at twice with multiplicand matrix A and the loading of multiplier matrix B.The element that is loaded into the correspondence of vector processing unit carries out multiply operation, because the number of vector processing unit is 8, and the number of multiplicand element and multiplier element is a difference 16, and the multiply operation of this step should be carried out twice.
3, with the reduction instruction result that each vector processing unit calculated in the step 2 is carried out the phase add operation, the reduction as a result of gained is to the processing unit X of appointment, and the operation of this same step also should be carried out twice.
4, two results of gained carry out the phase add operation in the X unit with above-mentioned, draw a matrix of consequence element and deposit vectorial storage unit in.
5, along moving on to the next line of multiplicand matrix A and the next column of multiplier matrix B, repeat step 2 in the said process to step 4 to calculate whole matrix of consequence C=A * B.
Embodiment 3:
Shown in figure 10, the vectorization implementation method that multiplies each other towards the large matrix of vector processor of the present invention, calculating scale be 26 * 22 matrix to multiply by scale be 22 * 27 matrix (the vector processing unit number K is 8), may further comprise the steps:
1, as shown in Figure 9; Arrive vectorial storage unit through DMA carrying multiplicand matrix A and multiplier matrix B; Realize reorder (method for reordering is identical with embodiment 1) of multiplier matrix B in this process; But also multiplicand matrix A and multiplier matrix B have been carried out mending 0 operation, multiplicand matrix A and multiplier matrix B are at location mode such as Fig. 8 (1) and Fig. 8 (2) of vector location.
2, the element with row of row element of multiplicand matrix A and multiplier matrix B is loaded in the vector processing unit; Here the row of the row of multiplicand matrix A and multiplier matrix B are to have passed through benefit 0, so will divide 3 times with multiplicand matrix A and the loading of multiplier matrix B.The element that is loaded into the correspondence of vector processing unit carries out multiply operation, because the number of vector processing unit is 8, and the number of mending 0 back multiplicand element and multiplier element all is 24, and the multiply operation of this step should be carried out 3 times.
3, with the reduction instruction result that each vector processing unit calculated in the step 2 is carried out the phase add operation, the reduction as a result of gained is to the processing unit X of appointment, and the operation of this same step also should be carried out 3 times.
4,3 results of gained carry out the phase add operation in the X unit with above-mentioned, draw a matrix of consequence element and deposit vectorial storage unit in.
5, along moving on to the next line of multiplicand matrix A and the next column of multiplier matrix B, the step 2 that repeats in the said process can calculate whole matrix of consequence C=A * B to step 4.
Can write out multiplying each other of advantages of simplicity and high efficiency code realization large matrix according to above step according to the structure and the instruction set of concrete vector processor.Method of the present invention is easily understood for the programmer, helps programmer's realization of encoding.
The above only is a preferred implementation of the present invention, and protection scope of the present invention also not only is confined to the foregoing description, and all technical schemes that belongs under the thinking of the present invention all belong to protection scope of the present invention.Should be pointed out that for those skilled in the art some improvement and retouching not breaking away under the principle of the invention prerequisite should be regarded as protection scope of the present invention.

Claims (2)

1. vectorization implementation method that multiplies each other towards the large matrix of vector processor, tool is characterised in that and may further comprise the steps:
(1) input multiplicand matrix A and multiplier matrix B; Through dma controller multiplicand matrix A and multiplier matrix B are transported to respectively in the vectorial storage unit; In handling process, the multiplier matrix B to be reordered, the capable ordering successively of the 1st~n that is about in the multiplier matrix B is the 1st~n row;
(2) element in the row in element in the multiplicand matrix A delegation and the multiplier matrix B is loaded into respectively in K the parallel processing element, and correspondence multiplies each other one by one; With multiplied result reduction summation in the parallel processing element of an appointment; Summed result is stored in the vectorial storage unit as a matrix of consequence element;
(3) along moving on to the next line of multiplicand matrix A and the next column of multiplier matrix B, repeating step (2) obtains the matrix of consequence C that is made up of the matrix of consequence element until the calculating of accomplishing all Frames.
2. the vectorization implementation method that multiplies each other towards the large matrix of vector processor according to claim 1; It is characterized in that; In the said handling process, each row of multiplicand matrix A is organized into a Frame, and each row of multiplier matrix B are organized into a Frame; When the element number of said Frame is not equal to the multiple of the number K of parallel processing element in the vector processor, mends 0 at the Frame afterbody and make the element number of each Frame equal the multiple of the number K of parallel processing element.
CN201110338108.8A 2011-10-31 2011-10-31 Vector processor oriented large matrix multiplied vectorization realizing method Active CN102411558B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110338108.8A CN102411558B (en) 2011-10-31 2011-10-31 Vector processor oriented large matrix multiplied vectorization realizing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110338108.8A CN102411558B (en) 2011-10-31 2011-10-31 Vector processor oriented large matrix multiplied vectorization realizing method

Publications (2)

Publication Number Publication Date
CN102411558A true CN102411558A (en) 2012-04-11
CN102411558B CN102411558B (en) 2015-05-13

Family

ID=45913637

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110338108.8A Active CN102411558B (en) 2011-10-31 2011-10-31 Vector processor oriented large matrix multiplied vectorization realizing method

Country Status (1)

Country Link
CN (1) CN102411558B (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104461449A (en) * 2014-11-14 2015-03-25 中国科学院数据与通信保护研究教育中心 Large integer multiplication realizing method and device based on vector instructions
CN104636316A (en) * 2015-02-06 2015-05-20 中国人民解放军国防科学技术大学 GPDSP-oriented large-scale matrix multiplication calculation method
CN104899182A (en) * 2015-06-09 2015-09-09 中国人民解放军国防科学技术大学 Matrix multiplication acceleration method for supporting variable blocks
CN106445471A (en) * 2016-10-13 2017-02-22 北京百度网讯科技有限公司 Processor and method for executing matrix multiplication on processor
CN106959937A (en) * 2017-03-30 2017-07-18 中国人民解放军国防科学技术大学 A kind of vectorization implementation method of warp product matrix towards GPDSP
CN106970896A (en) * 2017-03-30 2017-07-21 中国人民解放军国防科学技术大学 The vectorization implementation method of the two-dimensional matrix convolution of vector processor-oriented
CN107256203A (en) * 2017-06-28 2017-10-17 郑州云海信息技术有限公司 The implementation method and device of a kind of matrix-vector multiplication
CN107977231A (en) * 2017-12-15 2018-05-01 北京中科寒武纪科技有限公司 A kind of computational methods and Related product
CN108509384A (en) * 2017-02-24 2018-09-07 富士通株式会社 Computational methods, information processing unit, calculation procedure and information processing system
US10338919B2 (en) 2017-05-08 2019-07-02 Nvidia Corporation Generalized acceleration of matrix multiply accumulate operations
US10454680B2 (en) 2016-11-01 2019-10-22 Beijing Baidu Netcom Science And Technology Co., Ltd. RSA decryption processor and method for controlling RSA decryption processor
CN110377877A (en) * 2019-07-26 2019-10-25 苏州浪潮智能科技有限公司 A kind of data processing method, device, equipment and storage medium
CN110494846A (en) * 2017-03-20 2019-11-22 英特尔公司 System, method and apparatus for addition of matrices, subtraction and multiplication
CN111465924A (en) * 2017-12-12 2020-07-28 特斯拉公司 System and method for converting matrix input to vectorized input for a matrix processor
CN112433760A (en) * 2020-11-27 2021-03-02 海光信息技术股份有限公司 Data sorting method and data sorting circuit
CN113536220A (en) * 2020-04-21 2021-10-22 中科寒武纪科技股份有限公司 Operation method, processor and related product
US11816482B2 (en) 2017-05-08 2023-11-14 Nvidia Corporation Generalized acceleration of matrix multiply accumulate operations

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109165733A (en) * 2018-07-11 2019-01-08 中国人民解放军国防科技大学 Multi-input multi-output matrix maximum pooling vectorization implementation method

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1155117A (en) * 1996-01-19 1997-07-23 张胤微 High-speed multiplication device
CN1394314A (en) * 2000-11-02 2003-01-29 索尼计算机娱乐公司 Parallel operation device, entertainment device, operating method, computer program, and semiconductor device
CN101061474A (en) * 2004-06-10 2007-10-24 哈桑·塞希托格鲁 Matrix-valued methods and apparatus for signal processing
CN101089840A (en) * 2007-07-12 2007-12-19 浙江大学 Matrix multiplication parallel computing system based on multi-FPGA
CN101163240A (en) * 2006-10-13 2008-04-16 国际商业机器公司 Filter arrangement and method thereof
EP2017743A2 (en) * 2007-07-19 2009-01-21 Itt Manufacturing Enterprises, Inc. High speed and efficient matrix multiplication hardware module

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1155117A (en) * 1996-01-19 1997-07-23 张胤微 High-speed multiplication device
CN1394314A (en) * 2000-11-02 2003-01-29 索尼计算机娱乐公司 Parallel operation device, entertainment device, operating method, computer program, and semiconductor device
CN101061474A (en) * 2004-06-10 2007-10-24 哈桑·塞希托格鲁 Matrix-valued methods and apparatus for signal processing
CN101163240A (en) * 2006-10-13 2008-04-16 国际商业机器公司 Filter arrangement and method thereof
CN101089840A (en) * 2007-07-12 2007-12-19 浙江大学 Matrix multiplication parallel computing system based on multi-FPGA
EP2017743A2 (en) * 2007-07-19 2009-01-21 Itt Manufacturing Enterprises, Inc. High speed and efficient matrix multiplication hardware module

Cited By (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104461449A (en) * 2014-11-14 2015-03-25 中国科学院数据与通信保护研究教育中心 Large integer multiplication realizing method and device based on vector instructions
CN104461449B (en) * 2014-11-14 2018-02-27 中国科学院数据与通信保护研究教育中心 Large integer multiplication implementation method and device based on vector instruction
CN104636316A (en) * 2015-02-06 2015-05-20 中国人民解放军国防科学技术大学 GPDSP-oriented large-scale matrix multiplication calculation method
CN104899182B (en) * 2015-06-09 2017-10-31 中国人民解放军国防科学技术大学 A kind of Matrix Multiplication accelerated method for supporting variable partitioned blocks
CN104899182A (en) * 2015-06-09 2015-09-09 中国人民解放军国防科学技术大学 Matrix multiplication acceleration method for supporting variable blocks
CN106445471A (en) * 2016-10-13 2017-02-22 北京百度网讯科技有限公司 Processor and method for executing matrix multiplication on processor
US10140251B2 (en) 2016-10-13 2018-11-27 Beijing Baidu Netcom Science And Technology Co., Ltd. Processor and method for executing matrix multiplication operation on processor
US10454680B2 (en) 2016-11-01 2019-10-22 Beijing Baidu Netcom Science And Technology Co., Ltd. RSA decryption processor and method for controlling RSA decryption processor
CN108509384B (en) * 2017-02-24 2022-04-12 富士通株式会社 Calculation method, information processing apparatus, calculation program, and information processing system
CN108509384A (en) * 2017-02-24 2018-09-07 富士通株式会社 Computational methods, information processing unit, calculation procedure and information processing system
CN110494846A (en) * 2017-03-20 2019-11-22 英特尔公司 System, method and apparatus for addition of matrices, subtraction and multiplication
CN106970896A (en) * 2017-03-30 2017-07-21 中国人民解放军国防科学技术大学 The vectorization implementation method of the two-dimensional matrix convolution of vector processor-oriented
CN106959937B (en) * 2017-03-30 2019-03-29 中国人民解放军国防科学技术大学 A kind of vectorization implementation method of the warp product matrix towards GPDSP
CN106970896B (en) * 2017-03-30 2020-05-12 中国人民解放军国防科学技术大学 Vector processor-oriented vectorization implementation method for two-dimensional matrix convolution
CN106959937A (en) * 2017-03-30 2017-07-18 中国人民解放军国防科学技术大学 A kind of vectorization implementation method of warp product matrix towards GPDSP
US10338919B2 (en) 2017-05-08 2019-07-02 Nvidia Corporation Generalized acceleration of matrix multiply accumulate operations
US11797302B2 (en) 2017-05-08 2023-10-24 Nvidia Corporation Generalized acceleration of matrix multiply accumulate operations
US10884734B2 (en) 2017-05-08 2021-01-05 Nvidia Corporation Generalized acceleration of matrix multiply accumulate operations
US11816481B2 (en) 2017-05-08 2023-11-14 Nvidia Corporation Generalized acceleration of matrix multiply accumulate operations
US11816482B2 (en) 2017-05-08 2023-11-14 Nvidia Corporation Generalized acceleration of matrix multiply accumulate operations
US11797303B2 (en) 2017-05-08 2023-10-24 Nvidia Corporation Generalized acceleration of matrix multiply accumulate operations
US11797301B2 (en) 2017-05-08 2023-10-24 Nvidia Corporation Generalized acceleration of matrix multiply accumulate operations
CN107256203A (en) * 2017-06-28 2017-10-17 郑州云海信息技术有限公司 The implementation method and device of a kind of matrix-vector multiplication
CN111465924A (en) * 2017-12-12 2020-07-28 特斯拉公司 System and method for converting matrix input to vectorized input for a matrix processor
CN111465924B (en) * 2017-12-12 2023-11-17 特斯拉公司 System and method for converting matrix input into vectorized input for matrix processor
CN107977231B (en) * 2017-12-15 2020-10-27 安徽寒武纪信息科技有限公司 Calculation method and related product
CN107977231A (en) * 2017-12-15 2018-05-01 北京中科寒武纪科技有限公司 A kind of computational methods and Related product
CN110377877B (en) * 2019-07-26 2022-12-23 苏州浪潮智能科技有限公司 Data processing method, device, equipment and storage medium
CN110377877A (en) * 2019-07-26 2019-10-25 苏州浪潮智能科技有限公司 A kind of data processing method, device, equipment and storage medium
CN113536220A (en) * 2020-04-21 2021-10-22 中科寒武纪科技股份有限公司 Operation method, processor and related product
CN112433760A (en) * 2020-11-27 2021-03-02 海光信息技术股份有限公司 Data sorting method and data sorting circuit

Also Published As

Publication number Publication date
CN102411558B (en) 2015-05-13

Similar Documents

Publication Publication Date Title
CN102411558B (en) Vector processor oriented large matrix multiplied vectorization realizing method
Dou et al. 64-bit floating-point FPGA matrix multiplication
CN1230735C (en) Processing multiply-accumulate operations in single cycle
EP3659051B1 (en) Accelerated mathematical engine
CN102141976B (en) Method for storing diagonal data of sparse matrix and SpMV (Sparse Matrix Vector) realization method based on method
CN103294648B (en) Support the partitioned matrix multiplication vectorization method of many MAC operation parts vector treatment device
CN111937009A (en) Systolic convolutional neural network
CN100405361C (en) Method and system for performing calculation operations and a device
CN110188869B (en) Method and system for integrated circuit accelerated calculation based on convolutional neural network algorithm
CN103440121A (en) Triangular matrix multiplication vectorization method of vector processor
CN107341133B (en) Scheduling method of reconfigurable computing structure based on LU decomposition of arbitrary dimension matrix
CN103902507A (en) Matrix multiplication calculating device and matrix multiplication calculating method both oriented to programmable algebra processor
CN105373517A (en) Spark-based distributed matrix inversion parallel operation method
WO2020196407A1 (en) Convolutional computation device
Yzelman et al. A cache-oblivious sparse matrix–vector multiplication scheme based on the Hilbert curve
CN107667345A (en) Packing data alignment plus computations, processor, method and system
CN103389967B (en) The device and method of a kind of matrix transposition based on SRAM
CN102402415A (en) Device and method for buffering data in dynamic reconfigurable array
CN102360281A (en) Multifunctional fixed-point media access control (MAC) operation device for microprocessor
Haidar et al. Out of memory SVD solver for big data
CN102411773B (en) Vector-processor-oriented mean-residual normalized product correlation vectoring method
EP3842954A1 (en) System and method for configurable systolic array with partial read/write
CN104615516A (en) Method for achieving large-scale high-performance Linpack testing benchmark for GPDSP
CN102231624B (en) Vector processor-oriented floating point complex number block finite impulse response (FIR) vectorization realization method
CN115576606A (en) Method for realizing matrix transposition multiplication, coprocessor, server and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant