CN103294648A - Block matrix multiplication vectorization method supporting vector processor with multiple MAC (multiply accumulate) operational units - Google Patents

Block matrix multiplication vectorization method supporting vector processor with multiple MAC (multiply accumulate) operational units Download PDF

Info

Publication number
CN103294648A
CN103294648A CN2013101664113A CN201310166411A CN103294648A CN 103294648 A CN103294648 A CN 103294648A CN 2013101664113 A CN2013101664113 A CN 2013101664113A CN 201310166411 A CN201310166411 A CN 201310166411A CN 103294648 A CN103294648 A CN 103294648A
Authority
CN
China
Prior art keywords
submatrix
matrix
vector
multiplication
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2013101664113A
Other languages
Chinese (zh)
Other versions
CN103294648B (en
Inventor
刘仲
陈书明
窦强
郭阳
刘衡竹
田希
龚国辉
陈海燕
彭元喜
万江华
刘胜
陈跃跃
扈啸
吴家铸
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201310166411.3A priority Critical patent/CN103294648B/en
Publication of CN103294648A publication Critical patent/CN103294648A/en
Application granted granted Critical
Publication of CN103294648B publication Critical patent/CN103294648B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Complex Calculations (AREA)

Abstract

A block matrix multiplication vectorization method supporting a vector processor with multiple MAC (multiply accumulate) operational units includes the steps of (1), determining the optimum block size of submatrix, the quantity of lines and rows of the submatrix of a multiplier matrix B and the quantity of lines and rows of the submatrix of a multiplier matrix A according to the quantity p of vector processing elements (VPE) of the vector processor, the quantity m of the MAC operational units in the VPE, the capacity s of a vector memory and data size d of matrix elements, (2) equally dividing the capacity s of the vector memory into two storage areas of Buffer 0 and Buffer 1, and realizing multiplication of the submatrix in the Buffer 0 and the Buffer 1 in a Pingpong mode until completing the multiplication operation of the whole matrix. The block matrix multiplication vectorization method has the advantages of easiness in implementing, convenience in operation, capabilities of improving parallelism of the vector processor and increasing operation efficiency thereof, and the like.

Description

Support the partitioned matrix multiplication vectorization method of many MAC arithmetic unit vector processor
Technical field
The present invention is mainly concerned with technical field of data processing, refers in particular to a kind of partitioned matrix multiplication vectorization method of many MAC of support arithmetic unit vector processor.
Background technology
Along with the high-performance calculation demand of compute-intensive applications such as large-scale dense Solving Linear, 4G radio communication, Radar Signal Processing, HD video and Digital Image Processing is growing, marked change appears in Computer Architecture, many novel architectures appear, as many nuclear system structures, heterogeneous multi-core bind, stream handle architecture and vector processor architecture etc., these new architectures are integrated a plurality of processor cores on single-chip, comprise abundant arithmetic unit on each nuclear, increased substantially the calculated performance of chip; Simultaneously, also new challenge has been proposed in software development.Because existing a large amount of program and algorithm are based on the single core processor design, how at architecture characteristics such as multinuclear, multicomputing units, fully develop concurrency at all levels, parallel and these application algorithms of vectorization are current main difficulties that face efficiently.
" matrix multiplication " is high-performance calculation (High Performance Computing, HPC) one of Chang Yong nucleus module, be typical computation-intensive and memory access intensive applications, taking advantage of of processor added (Multiply Accumulate, MAC) ability and memory access bandwidth requirement are very high, the time complexity that calculates is very high, is approximately O(N 3), N is matrix size.It is lower that the three traditional methods that recirculate the realization matrix multiplication are calculated memory access, and it is big that the data of Cache lack, matrix data is moved the expense accounting, causes the operation efficiency of processor lower.The partitioned matrix multiplication method is divided into the multiplication of large matrix the multiplication of a series of submatrixs, by the block size of submatrix reasonably is set, the block size blocksize of submatrix satisfies blocksize<=sqrt (M/3) usually, M is the capacity of Cache, make the data access when submatrix calculates all in Cache, to hit, reduce the computing time of whole large matrix multiplication the computing time by the minimizing submatrix, thereby increase substantially the operation efficiency of processor.
Fig. 1 is the general structural representation of many MAC arithmetic unit vector processor, it comprises scalar processor unit (Scalar Processing Unit, SPU) and Vector Processing parts (Vector Processing Unit, VPU), SPU is responsible for scalar task computation and Flow Control, VPU is responsible for vector calculation, comprise some vector processing units (Vector Processing Element, VPE), each VPE comprises a plurality of calculation function parts such as MAC0, MAC1, and other functional parts such as ALU, BP.SPU and VPU provide data channel transmission and swap data.The Load/Store of vector data addressed location support vector data provides jumbo special-purpose vector memory, rather than the Cache mechanism of single core processor, and existing partitioned matrix multiplication method is not suitable for this class vector processor.Therefore, need a kind of method of supporting the partitioned matrix multiplication vectorization of many MAC arithmetic unit vector processor efficiently of design badly, so that the operation efficiency of optimum performance vector processor.
Summary of the invention
The technical problem to be solved in the present invention just is: at the technical matters that prior art exists, the invention provides a kind of realize simple, easy to operate, can improve the vector processor concurrency, can improve the partitioned matrix multiplication vectorization method of many MAC of support arithmetic unit vector processor of processor calculating efficient.
For solving the problems of the technologies described above, the present invention by the following technical solutions:
A kind of partitioned matrix multiplication vectorization method of supporting many MAC arithmetic unit vector processor, flow process is:
(1) according to the quantity p of the vector processing unit VPE of vector processor, quantity m, the capacity s of vector memory of MAC arithmetic unit among the VPE and the size of data d of matrix element, determine the block size blocksize of optimum submatrix, determine line number and the columns of the submatrix of the columns of submatrix of multiplier matrix B and line number and definite multiplicand matrix A;
(2) the capacity s with vector memory is divided into two parts storage area Buffer0 and the Buffer1 that capacity equates, realizes the multiplication of submatrix successively between Buffer0 and Buffer1 with ping-pong, calculates up to whole matrix multiplication and finishes.
As a further improvement on the present invention:
In the described step (1), the columns of the submatrix of multiplier matrix B is p*m, and line number is (s/2/2)/(p*m*d); After determining the submatrix block size of multiplier matrix B, determine the submatrix block size of multiplicand matrix A again; The line number of the submatrix of described multiplicand matrix A and columns all equal the line number of the submatrix of multiplier matrix B, i.e. (s/2/2)/(p*m*d).
The scalar processor unit SPU of described vector processor reads each element in each row of multiplicand submatrix A successively, and is extended to a vector data; Read the B0 line data of multiplier submatrix B and each element of aforementioned vector data carries out multiply accumulating respectively by Vector Processing parts VPU; When having traveled through the A0 line data of multiplicand submatrix, calculate the C0 line data of the Matrix C that bears fruit; When having traveled through all row of multiplicand submatrix A, finish the calculating of the Matrix C that bears fruit.
In the described step (2), storage area Buffer0 is used for the multiplier matrix B of this submatrix multiplication and stores with the submatrix of output matrix of consequence C, simultaneously by dma controller will be next time the submatrix data of the needed multiplier matrix B of submatrix multiplication be transported to storage area Buffer1, and the last matrix data that bears fruit is moved external storage.
Compared with prior art, the invention has the advantages that: the present invention determines the block size blocksize of optimum submatrix according to the architecture characteristics of vector processor and the size of data of matrix element, has effectively improved the calculating memory access ratio of processor; Adopt the ping-pong of double buffering to realize that the multiplication of submatrix can be overlapping with data-moving time and computing time effectively, reduce total computing time.Read the line data of multiplicand submatrix by the scalar processor unit SPU of vector processor, and be extended to vector data, carry out multiply accumulating with Vector Processing parts VPU respectively by each element that row reads the vector data of multiplier submatrix, avoided the visit of columns certificate and the stipulations summation of vector data.It is simple that these advantages make that method of the present invention realizes, easy to operate, can fully excavate instruction, data, the task dispatching concurrency at all levels of vector processor, the operation efficiency of processor is brought up to more than 90%, thereby given full play to the advantage of the high-performance calculation ability that many MAC arithmetic unit vector processor has.
Description of drawings
Fig. 1 is the general structural representation of many MAC arithmetic unit vector processor.
Fig. 2 is the execution schematic flow sheet of the inventive method.
Fig. 3 is the schematic flow sheet of determining the block size of optimum submatrix in the specific embodiment according to the architecture characteristics of vector processor.
Fig. 4 is the computing synoptic diagram of the specific implementation process of neutron matrix multiplication of the present invention.
Fig. 5 is the schematic flow sheet that adopts the ping-pong realization submatrix multiplication of double buffering in specific embodiment.
Embodiment
Below with reference to Figure of description and specific embodiment the present invention is described in further details.
As shown in Figure 2, the present invention supports the partitioned matrix multiplication vectorization method of many MAC arithmetic unit vector processor, and idiographic flow is:
(1) at first according to the quantity p of the vector processing unit VPE of vector processor, quantity m, the capacity s of vector memory of MAC arithmetic unit among the VPE and the size of data d of matrix element, determine the block size blocksize of optimum submatrix, determine line number and the columns of the submatrix of the columns of submatrix of multiplier matrix B and line number and definite multiplicand matrix A.
(2) the capacity s with vector memory is divided into two parts storage area Buffer0 and the Buffer1 that capacity equates, realizes the multiplication of submatrix successively between Buffer0 and Buffer1 with ping-pong, calculates up to whole matrix multiplication and finishes.
As shown in Figure 3, during concrete the application, according to the quantity p of the vector processing unit VPE of vector processor, quantity m, the capacity s of vector memory of MAC arithmetic unit among each VPE and the size of data d of matrix element, determine the block size blocksize of optimum submatrix.Wherein, the columns of the submatrix of multiplier matrix B is p*m, and line number is (s/2/2)/(p*m*d).After determining the submatrix block size of multiplier matrix B, determine the submatrix block size of multiplicand matrix A again.The line number of the submatrix of multiplicand matrix A and columns all equal the line number of the submatrix of multiplier matrix B, i.e. (s/2/2)/(p*m*d).Give one example, the data of supposing matrix element are single-precision floating-point data, size of data is the 4B(byte), the capacity of vector memory is 1024KB, the quantity p=16 of vector processing unit VPE, the quantity m=2 of the MAC arithmetic unit among each VPE, then the columns of the submatrix of multiplier matrix B is the p*m=16*2=32 row, line number is (1024*1024/2/2)/(16*2*4)=2048 row.The line number of the submatrix of multiplicand matrix A and columns all equal 2048.
As shown in Figure 4, in the present embodiment, the columns of the submatrix of multiplier matrix B is 4, line number is 8; The line number of the submatrix of multiplicand matrix A and columns all equal 8.The method that adopts is, read each element in each row of multiplicand submatrix A successively by the scalar processor unit SPU of vector processor, and be extended to a vector data, be extended to vector (a00, a00 as the capable a00 element of the A0 among Fig. 4, a00, a00), the a01 element is extended to vector (a01, a01, a01, a01).(b02 b03), carries out multiply accumulating respectively with each elements of aforementioned vector data for b00, b01 to read the B0 line data of multiplier submatrix B by Vector Processing parts VPU.When having traveled through the A0 line data of multiplicand submatrix, calculate the Matrix C that bears fruit the C0 line data (c00, c01, c02, c03).When having traveled through all row of multiplicand submatrix A, finish the calculating of the Matrix C that bears fruit.
As shown in Figure 5, partitioned matrix multiplication in the present embodiment adopts the ping-pong of double buffering (Buffer) to realize the multiplication of submatrix, the capacity s of vector memory is divided into two parts storage area Buffer0 and the Buffer1 that capacity equates, wherein Buffer0 is used for the multiplier matrix B and the submatrix storage of exporting matrix of consequence C of this submatrix multiplication, simultaneously by dma controller will be next time the submatrix data of the needed multiplier matrix B of submatrix multiplication be transported to Buffer1, and the last matrix data that bears fruit is moved external storage.
In sum, the partitioned matrix multiplication vectorization method of many MAC of support arithmetic unit vector processor of realizing by the present invention can be determined the block size blocksize of optimum submatrix according to the architecture characteristics of vector processor.Adopt the ping-pong of double buffering to realize that the multiplication of submatrix can be overlapping with data-moving time and computing time effectively, reduce total computing time.Read the line data of multiplicand submatrix by the scalar processor unit SPU of vector processor, and be extended to vector data, carry out multiply accumulating with Vector Processing parts VPU respectively by each element that row reads the vector data of multiplier submatrix, avoided the visit of columns certificate and the stipulations summation of vector data.It is simple that these advantages make that method of the present invention realizes, easy to operate, can fully excavate instruction, data, the task dispatching concurrency at all levels of vector processor, the operation efficiency of processor is brought up to more than 90%, thereby given full play to the advantage of the high-performance calculation ability that many MAC arithmetic unit vector processor has.
Below only be preferred implementation of the present invention, protection scope of the present invention also not only is confined to above-described embodiment, and all technical schemes that belongs under the thinking of the present invention all belong to protection scope of the present invention.Should be pointed out that for those skilled in the art the some improvements and modifications not breaking away under the principle of the invention prerequisite should be considered as protection scope of the present invention.

Claims (4)

1. partitioned matrix multiplication vectorization method of supporting many MAC arithmetic unit vector processor is characterized in that flow process is:
(1) according to the quantity p of the vector processing unit VPE of vector processor, quantity m, the capacity s of vector memory of MAC arithmetic unit among the VPE and the size of data d of matrix element, determine the block size blocksize of optimum submatrix, determine line number and the columns of the submatrix of the columns of submatrix of multiplier matrix B and line number and definite multiplicand matrix A;
(2) the capacity s with vector memory is divided into two parts storage area Buffer0 and the Buffer1 that capacity equates, realizes the multiplication of submatrix successively between Buffer0 and Buffer1 with ping-pong, calculates up to whole matrix multiplication and finishes.
2. the partitioned matrix multiplication vectorization method of many MAC of support arithmetic unit vector processor according to claim 1 is characterized in that, in the described step (1), the columns of the submatrix of multiplier matrix B is p*m, and line number is (s/2/2)/(p*m*d); After determining the submatrix block size of multiplier matrix B, determine the submatrix block size of multiplicand matrix A again; The line number of the submatrix of described multiplicand matrix A and columns all equal the line number of the submatrix of multiplier matrix B, i.e. (s/2/2)/(p*m*d).
3. the partitioned matrix multiplication vectorization method of many MAC of support arithmetic unit vector processor according to claim 2, it is characterized in that, the scalar processor unit SPU of described vector processor reads each element in each row of multiplicand submatrix A successively, and is extended to a vector data; Read the B0 line data of multiplier submatrix B and each element of aforementioned vector data carries out multiply accumulating respectively by Vector Processing parts VPU; When having traveled through the A0 line data of multiplicand submatrix, calculate the C0 line data of the Matrix C that bears fruit; When having traveled through all row of multiplicand submatrix A, finish the calculating of the Matrix C that bears fruit.
4. according to the partitioned matrix multiplication vectorization method of claim 1 or 2 or 3 described many MAC of support arithmetic unit vector processors, it is characterized in that, in the described step (2), storage area Buffer0 is used for the multiplier matrix B of this submatrix multiplication and stores with the submatrix of output matrix of consequence C, simultaneously by dma controller will be next time the submatrix data of the needed multiplier matrix B of submatrix multiplication be transported to storage area Buffer1, and the last matrix data that bears fruit is moved external storage.
CN201310166411.3A 2013-05-08 2013-05-08 Support the partitioned matrix multiplication vectorization method of many MAC operation parts vector treatment device Active CN103294648B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310166411.3A CN103294648B (en) 2013-05-08 2013-05-08 Support the partitioned matrix multiplication vectorization method of many MAC operation parts vector treatment device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310166411.3A CN103294648B (en) 2013-05-08 2013-05-08 Support the partitioned matrix multiplication vectorization method of many MAC operation parts vector treatment device

Publications (2)

Publication Number Publication Date
CN103294648A true CN103294648A (en) 2013-09-11
CN103294648B CN103294648B (en) 2016-06-01

Family

ID=49095548

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310166411.3A Active CN103294648B (en) 2013-05-08 2013-05-08 Support the partitioned matrix multiplication vectorization method of many MAC operation parts vector treatment device

Country Status (1)

Country Link
CN (1) CN103294648B (en)

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104461465A (en) * 2014-12-29 2015-03-25 南京大学 High-efficiency controller based on ping-pong operation and method thereof
CN104899182A (en) * 2015-06-09 2015-09-09 中国人民解放军国防科学技术大学 Matrix multiplication acceleration method for supporting variable blocks
CN105426344A (en) * 2015-11-09 2016-03-23 南京大学 Matrix calculation method of distributed large-scale matrix multiplication based on Spark
CN106445471A (en) * 2016-10-13 2017-02-22 北京百度网讯科技有限公司 Processor and method for executing matrix multiplication on processor
CN108509384A (en) * 2017-02-24 2018-09-07 富士通株式会社 Computational methods, information processing unit, calculation procedure and information processing system
CN108805273A (en) * 2018-05-20 2018-11-13 复旦大学 Door control unit accelerates the hardware circuit implementation of operation in a kind of LSTM
CN108985450A (en) * 2018-06-28 2018-12-11 中国人民解放军国防科技大学 Vector processor-oriented convolution neural network operation vectorization method
CN109086075A (en) * 2017-10-30 2018-12-25 上海寒武纪信息科技有限公司 Artificial intelligence process device and the method for executing Matrix Multiplication vector instruction using processor
US10338919B2 (en) 2017-05-08 2019-07-02 Nvidia Corporation Generalized acceleration of matrix multiply accumulate operations
CN110263296A (en) * 2019-05-18 2019-09-20 南京惟心光电系统有限公司 A kind of matrix-vector multiplier and its operation method based on photoelectricity computing array
US10454680B2 (en) 2016-11-01 2019-10-22 Beijing Baidu Netcom Science And Technology Co., Ltd. RSA decryption processor and method for controlling RSA decryption processor
CN110415157A (en) * 2018-04-26 2019-11-05 华为技术有限公司 A kind of calculation method and device of matrix multiplication
CN111045958A (en) * 2018-10-11 2020-04-21 展讯通信(上海)有限公司 Acceleration engine and processor
CN111737292A (en) * 2020-07-16 2020-10-02 腾讯科技(深圳)有限公司 Data retrieval method and related device
CN111902813A (en) * 2018-03-27 2020-11-06 Sk电信有限公司 Apparatus and method for convolution operation
CN112346852A (en) * 2019-08-06 2021-02-09 脸谱公司 Distributed physical processing of matrix summation operations
CN112446007A (en) * 2019-08-29 2021-03-05 上海华为技术有限公司 Matrix operation method, operation device and processor
CN112948758A (en) * 2021-02-24 2021-06-11 上海商汤智能科技有限公司 Data processing method and device and chip
CN114489496A (en) * 2022-01-14 2022-05-13 南京邮电大学 Data storage and transmission method based on FPGA artificial intelligence accelerator
US11556337B2 (en) 2021-04-12 2023-01-17 Analog Devices International Unlimited Company Parallel matrix multiplication technique optimized for memory fetches
US11990137B2 (en) 2018-09-13 2024-05-21 Shanghai Cambricon Information Technology Co., Ltd. Image retouching method and terminal device

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE102018110607A1 (en) 2017-05-08 2018-11-08 Nvidia Corporation Generalized acceleration of matrix multiplication and accumulation operations

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7844630B2 (en) * 2007-09-01 2010-11-30 International Business Machines Corporation Method and structure for fast in-place transformation of standard full and packed matrix data formats
CN102214160A (en) * 2011-07-08 2011-10-12 中国科学技术大学 Single-accuracy matrix multiplication optimization method based on loongson chip 3A
CN102446160A (en) * 2011-09-06 2012-05-09 中国人民解放军国防科学技术大学 Dual-precision SIMD (Single Instruction Multiple Data) component-oriented matrix multiplication implementation method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7844630B2 (en) * 2007-09-01 2010-11-30 International Business Machines Corporation Method and structure for fast in-place transformation of standard full and packed matrix data formats
CN102214160A (en) * 2011-07-08 2011-10-12 中国科学技术大学 Single-accuracy matrix multiplication optimization method based on loongson chip 3A
CN102446160A (en) * 2011-09-06 2012-05-09 中国人民解放军国防科学技术大学 Dual-precision SIMD (Single Instruction Multiple Data) component-oriented matrix multiplication implementation method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
纪坤,等.: "矩阵三角分解分块算法的研究与实现", 《计算机应用与软件》 *
陈晶,等.: "分布式并行矩阵乘算法分析", 《测控技术》 *

Cited By (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104461465A (en) * 2014-12-29 2015-03-25 南京大学 High-efficiency controller based on ping-pong operation and method thereof
CN104899182A (en) * 2015-06-09 2015-09-09 中国人民解放军国防科学技术大学 Matrix multiplication acceleration method for supporting variable blocks
CN104899182B (en) * 2015-06-09 2017-10-31 中国人民解放军国防科学技术大学 A kind of Matrix Multiplication accelerated method for supporting variable partitioned blocks
CN105426344A (en) * 2015-11-09 2016-03-23 南京大学 Matrix calculation method of distributed large-scale matrix multiplication based on Spark
CN106445471A (en) * 2016-10-13 2017-02-22 北京百度网讯科技有限公司 Processor and method for executing matrix multiplication on processor
US10140251B2 (en) 2016-10-13 2018-11-27 Beijing Baidu Netcom Science And Technology Co., Ltd. Processor and method for executing matrix multiplication operation on processor
US10454680B2 (en) 2016-11-01 2019-10-22 Beijing Baidu Netcom Science And Technology Co., Ltd. RSA decryption processor and method for controlling RSA decryption processor
CN108509384A (en) * 2017-02-24 2018-09-07 富士通株式会社 Computational methods, information processing unit, calculation procedure and information processing system
CN108509384B (en) * 2017-02-24 2022-04-12 富士通株式会社 Calculation method, information processing apparatus, calculation program, and information processing system
US10884734B2 (en) 2017-05-08 2021-01-05 Nvidia Corporation Generalized acceleration of matrix multiply accumulate operations
US10338919B2 (en) 2017-05-08 2019-07-02 Nvidia Corporation Generalized acceleration of matrix multiply accumulate operations
CN109086075A (en) * 2017-10-30 2018-12-25 上海寒武纪信息科技有限公司 Artificial intelligence process device and the method for executing Matrix Multiplication vector instruction using processor
US12050887B2 (en) 2017-10-30 2024-07-30 Shanghai Cambricon Information Technology Co., Ltd. Information processing method and terminal device
US11922132B2 (en) 2017-10-30 2024-03-05 Shanghai Cambricon Information Technology Co., Ltd. Information processing method and terminal device
CN109086075B (en) * 2017-10-30 2021-06-08 上海寒武纪信息科技有限公司 Artificial intelligence processor and method for executing matrix multiplication vector instruction by using same
US11762631B2 (en) 2017-10-30 2023-09-19 Shanghai Cambricon Information Technology Co., Ltd. Information processing method and terminal device
CN111902813B (en) * 2018-03-27 2024-05-07 Sapeon韩国株式会社 Apparatus and method for convolution operation
CN111902813A (en) * 2018-03-27 2020-11-06 Sk电信有限公司 Apparatus and method for convolution operation
CN110415157A (en) * 2018-04-26 2019-11-05 华为技术有限公司 A kind of calculation method and device of matrix multiplication
CN110415157B (en) * 2018-04-26 2024-01-30 华为技术有限公司 Matrix multiplication calculation method and device
CN108805273A (en) * 2018-05-20 2018-11-13 复旦大学 Door control unit accelerates the hardware circuit implementation of operation in a kind of LSTM
CN108985450B (en) * 2018-06-28 2019-10-29 中国人民解放军国防科技大学 Vector processor-oriented convolution neural network operation vectorization method
CN108985450A (en) * 2018-06-28 2018-12-11 中国人民解放军国防科技大学 Vector processor-oriented convolution neural network operation vectorization method
US11990137B2 (en) 2018-09-13 2024-05-21 Shanghai Cambricon Information Technology Co., Ltd. Image retouching method and terminal device
US11996105B2 (en) 2018-09-13 2024-05-28 Shanghai Cambricon Information Technology Co., Ltd. Information processing method and terminal device
US12094456B2 (en) 2018-09-13 2024-09-17 Shanghai Cambricon Information Technology Co., Ltd. Information processing method and system
US12057110B2 (en) 2018-09-13 2024-08-06 Shanghai Cambricon Information Technology Co., Ltd. Voice recognition based on neural networks
US12057109B2 (en) 2018-09-13 2024-08-06 Shanghai Cambricon Information Technology Co., Ltd. Information processing method and terminal device
CN111045958A (en) * 2018-10-11 2020-04-21 展讯通信(上海)有限公司 Acceleration engine and processor
CN110263296A (en) * 2019-05-18 2019-09-20 南京惟心光电系统有限公司 A kind of matrix-vector multiplier and its operation method based on photoelectricity computing array
CN112346852A (en) * 2019-08-06 2021-02-09 脸谱公司 Distributed physical processing of matrix summation operations
CN112446007A (en) * 2019-08-29 2021-03-05 上海华为技术有限公司 Matrix operation method, operation device and processor
CN111737292A (en) * 2020-07-16 2020-10-02 腾讯科技(深圳)有限公司 Data retrieval method and related device
CN112948758A (en) * 2021-02-24 2021-06-11 上海商汤智能科技有限公司 Data processing method and device and chip
US11556337B2 (en) 2021-04-12 2023-01-17 Analog Devices International Unlimited Company Parallel matrix multiplication technique optimized for memory fetches
CN114489496A (en) * 2022-01-14 2022-05-13 南京邮电大学 Data storage and transmission method based on FPGA artificial intelligence accelerator
CN114489496B (en) * 2022-01-14 2024-05-21 南京邮电大学 Data storage and transmission method based on FPGA artificial intelligent accelerator

Also Published As

Publication number Publication date
CN103294648B (en) 2016-06-01

Similar Documents

Publication Publication Date Title
CN103294648B (en) Support the partitioned matrix multiplication vectorization method of many MAC operation parts vector treatment device
US12086700B2 (en) Neural processor
US12032653B2 (en) Method and apparatus for distributed and cooperative computation in artificial neural networks
CN108805266B (en) Reconfigurable CNN high-concurrency convolution accelerator
CN103049241B (en) A kind of method improving CPU+GPU isomery device calculated performance
CN103440121B (en) A kind of triangular matrix multiplication vectorization method of vector processor-oriented
CN102411558B (en) Vector processor oriented large matrix multiplied vectorization realizing method
WO2019128404A1 (en) Matrix multiplier
WO2019205617A1 (en) Calculation method and apparatus for matrix multiplication
CN112465110B (en) Hardware accelerator for convolution neural network calculation optimization
CN105912501B (en) A kind of SM4-128 Encryption Algorithm realization method and systems based on extensive coarseness reconfigurable processor
US20150088954A1 (en) System and Method for Sparse Matrix Vector Multiplication Processing
CN105335331B (en) A kind of SHA256 realization method and systems based on extensive coarseness reconfigurable processor
CN103336758A (en) Sparse matrix storage method CSRL (Compressed Sparse Row with Local Information) and SpMV (Sparse Matrix Vector Multiplication) realization method based on same
CN102110079B (en) Tuning calculation method of distributed conjugate gradient method based on MPI
CN103984527A (en) Method optimizing sparse matrix vector multiplication to improve incompressible pipe flow simulation efficiency
Yue et al. A 28nm 16.9-300TOPS/W computing-in-memory processor supporting floating-point NN inference/training with intensive-CIM sparse-digital architecture
CN111859277B (en) Sparse matrix vector multiplication vectorization implementation method
CN104317770A (en) Data storage structure and data access method for multiple core processing system
CN110598844A (en) Parallel convolution neural network accelerator based on FPGA and acceleration method
CN104615584A (en) Method for vectorization computing of solution of large-scale trigonometric linear system of equations for GPDSP
CN104636316A (en) GPDSP-oriented large-scale matrix multiplication calculation method
CN104636315A (en) GPDSP-oriented matrix LU decomposition vectorization calculation method
CN102411773B (en) Vector-processor-oriented mean-residual normalized product correlation vectoring method
WO2021057111A1 (en) Computing device and method, chip, electronic device, storage medium and program

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant