CN103294648B - Support the partitioned matrix multiplication vectorization method of many MAC operation parts vector treatment device - Google Patents

Support the partitioned matrix multiplication vectorization method of many MAC operation parts vector treatment device Download PDF

Info

Publication number
CN103294648B
CN103294648B CN201310166411.3A CN201310166411A CN103294648B CN 103294648 B CN103294648 B CN 103294648B CN 201310166411 A CN201310166411 A CN 201310166411A CN 103294648 B CN103294648 B CN 103294648B
Authority
CN
China
Prior art keywords
submatrix
matrix
vector
treatment device
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310166411.3A
Other languages
Chinese (zh)
Other versions
CN103294648A (en
Inventor
刘仲
陈书明
窦强
郭阳
刘衡竹
田希
龚国辉
陈海燕
彭元喜
万江华
刘胜
陈跃跃
扈啸
吴家铸
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201310166411.3A priority Critical patent/CN103294648B/en
Publication of CN103294648A publication Critical patent/CN103294648A/en
Application granted granted Critical
Publication of CN103294648B publication Critical patent/CN103294648B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

A kind of partitioned matrix multiplication vectorization method supporting many MAC operation parts vector treatment device, flow process is: (1) is according to quantity m, the vector capacity s of storer and the size of data d of matrix element of the MAC operation parts in quantity p, the VPE of the vector processing unit VPE of vector treatment device, determine the piece size blocksize of optimum submatrix, it is determined that the row number of the submatrix of multiplier matrix B and line number and determine line number and the row number of the submatrix of multiplicand matrix A; (2) two portions storage area Buffer0 and Buffer1 being divided into capacity equal the capacity s of vector storer, realizes the multiplication of submatrix, successively until whole matrix multiplication has calculated between Buffer0 and Buffer1 in the mode of rattling. The present invention have realize simple, easy to operate, vector treatment device parallelism can be improved, the advantages such as processor calculating efficiency can be improved.

Description

Support the partitioned matrix multiplication vectorization method of many MAC operation parts vector treatment device
Technical field
The present invention is mainly concerned with technical field of data processing, refers in particular to a kind of partitioned matrix multiplication vectorization method supporting many MAC operation parts vector treatment device.
Background technology
Along with large-scale dense linear equations solve, the high-performance calculation demand of the compute-intensive applications such as 4G radio communication, Radar Signal Processing, HD video and digital image processing growing, there is noticeable change in computer architecture, many new architectures occur, such as many nuclear system structures, heterogeneous multi-core tying, stream processor architecture and vector processor architecture etc., these new system structures are integrated multiple processor core on a single chip, each core comprises abundant arithmetic facility, has increased substantially the calculated performance of chip; Meanwhile, also Software development is proposed new challenge. Because existing a large amount of program and algorithm design based on single core processor, how for the system structure feature such as multinuclear, many arithmetic facilitys, the parallelism that fully exploitation is at all levels, parallel efficiently and these application algorithms of vectorization are the main difficulties currently faced.
" matrix multiplication " is high-performance calculation (HighPerformanceComputing, HPC) one of conventional core module, it is typical computation-intensive and memory access intensive applications, taking advantage of of treater is added (MultiplyAccumulate, MAC) ability and memory bandwidth require very high, the time complexity calculated is very high, is approximately O(N3), N is matrix scale. Traditional three recirculate, and to calculate memory access lower for the method for realization matrix multiplication, and the data of Cache disappearance, that matrix data moves expense accounting is big, causes the operation efficiency of treater lower. The multiplication of large matrix is divided into the multiplication of a series of submatrix by partitioned matrix multiplication method, by reasonably arranging the piece size of submatrix, the piece size blocksize of submatrix meets blocksize��sqrt (M/3) usually, M is the capacity of Cache, data access when submatrix is calculated can all be hit in Cache, by reducing the computing time reducing whole large matrix multiplication computing time of submatrix, thus increase substantially the operation efficiency of treater.
Fig. 1 is the general structural representation of many MAC operation parts vector treatment device, it comprises scalar processing element (ScalarProcessingUnit, and vector treatment parts (VectorProcessingUnit SPU), VPU), SPU is responsible for scalar task computation and Flow Control, and VPU is responsible for vector calculation, comprise some vector processing unit (VectorProcessingElement, VPE), each VPE comprises multiple computing functional component such as MAC0, MAC1, and other functional components such as ALU, BP. SPU and VPU provides data channel transmission and exchanges data. The Load/Store of vector data access unit support vector data, it is provided that the special vector storer of Large Copacity, instead of the Cache mechanism of single core processor, existing partitioned matrix multiplication method is not suitable for this kind of vector treatment device. Therefore, need the method for the partitioned matrix multiplication vectorization designing a kind of efficient support many MAC operation parts vector treatment device badly, so that the operation efficiency of the performance vector treatment device of optimum.
Summary of the invention
The technical problem to be solved in the present invention is just: the technical problem existed for prior art, the present invention provide a kind of realize simple, easy to operate, vector treatment device parallelism can be improved, the partitioned matrix multiplication vectorization method of the support many MAC operation parts vector treatment device that can improve processor calculating efficiency.
For solving the problems of the technologies described above, the present invention by the following technical solutions:
Supporting a partitioned matrix multiplication vectorization method for many MAC operation parts vector treatment device, flow process is:
(1) according to quantity m, the vector capacity s of storer and the size of data d of matrix element of the MAC operation parts in quantity p, the VPE of the vector processing unit VPE of vector treatment device, determine the piece size blocksize of optimum submatrix, it is determined that the row number of the submatrix of multiplier matrix B and line number and determine line number and the row number of the submatrix of multiplicand matrix A;
(2) two portions storage area Buffer0 and Buffer1 being divided into capacity equal the capacity s of vector storer, realizes the multiplication of submatrix, successively until whole matrix multiplication has calculated between Buffer0 and Buffer1 in the mode of rattling.
As a further improvement on the present invention:
In described step (1), the row number of the submatrix of multiplier matrix B is p*m, and line number is (s/2/2)/(p*m*d); After determining the submatrix piece size of multiplier matrix B, then determine the submatrix piece size of multiplicand matrix A; The line number of the submatrix of described multiplicand matrix A and row number all equal the line number of the submatrix of multiplier matrix B, i.e. (s/2/2)/(p*m*d).
The scalar processing element SPU of described vector treatment device reads each element in every a line of multiplicand submatrix A successively, and is extended to a vector data; Each element of the B0 row data and aforementioned vector data that read multiplier submatrix B by vector treatment parts VPU carries out multiply accumulating respectively; When having traveled through the A0 row data of multiplicand submatrix, calculate the C0 row data of the Matrix C that bears fruit; When having traveled through all row of multiplicand submatrix A, the calculating of the Matrix C that completes to bear fruit.
In described step (2), the submatrix of the multiplier matrix B that storage area Buffer0 is used in this submatrix multiplication and Output rusults Matrix C stores, by DMA controller, the submatrix data of the multiplier matrix B required for submatrix multiplication next time are transported to storage area Buffer1, and the matrix data that bears fruit of last time moves external storage simultaneously.
Compared with prior art, it is an advantage of the current invention that: the system structure feature of the present invention's foundation vector treatment device and the size of data of matrix element, it is determined that the piece size blocksize of optimum submatrix, effectively improves the calculating memory access ratio of treater; The multiplication adopting the table tennis mode of two buffering to realize submatrix effectively by overlapping with computing time for the data-moving time, can reduce total computing time. The row data of multiplicand submatrix is read by the scalar processing element SPU of vector treatment device, and it is extended to vector data, each element of the vector data reading multiplier submatrix by row with vector treatment parts VPU carries out multiply accumulating respectively, avoids the access of column data and the stipulations summation of vector data. It is simple that these advantages make the method for the present invention realize, easy to operate, can fully excavate the parallelisms at all levels such as the instruction of vector treatment device, data, task, the operation efficiency of treater is brought up to more than 90%, thus gives full play to the advantage of the high-performance calculation ability that many MAC operation parts vector treatment device has.
Accompanying drawing explanation
Fig. 1 is the general structural representation of many MAC operation parts vector treatment device.
Fig. 2 is the execution schematic flow sheet of the inventive method.
Fig. 3 is the schematic flow sheet of the piece size determining optimum submatrix in specific embodiment according to the system structure feature of vector treatment device.
Fig. 4 is the computing schematic diagram of the specific implementation process of neutron matrix multiplication of the present invention.
Fig. 5 is the schematic flow sheet adopting the table tennis mode of two buffering to realize submatrix multiplication in a particular embodiment.
Embodiment
Below with reference to Figure of description and specific embodiment, the present invention is described in further details.
As shown in Figure 2, the present invention supports the partitioned matrix multiplication vectorization method of many MAC operation parts vector treatment device, and idiographic flow is:
(1) first according to quantity m, the vector capacity s of storer and the size of data d of matrix element of the MAC operation parts in quantity p, the VPE of the vector processing unit VPE of vector treatment device, determine the piece size blocksize of optimum submatrix, it is determined that the row number of the submatrix of multiplier matrix B and line number and determine line number and the row number of the submatrix of multiplicand matrix A.
(2) two portions storage area Buffer0 and Buffer1 being divided into capacity equal the capacity s of vector storer, realizes the multiplication of submatrix, successively until whole matrix multiplication has calculated between Buffer0 and Buffer1 in the mode of rattling.
As shown in Figure 3, during embody rule, the quantity m of the MAC operation parts in quantity p, each VPE of the vector processing unit VPE according to vector treatment device, the vector capacity s of storer and the size of data d of matrix element, it is determined that the piece size blocksize of optimum submatrix. Wherein, the row number of the submatrix of multiplier matrix B is p*m, and line number is (s/2/2)/(p*m*d). After determining the submatrix piece size of multiplier matrix B, then determine the submatrix piece size of multiplicand matrix A. The line number of the submatrix of multiplicand matrix A and row number all equal the line number of the submatrix of multiplier matrix B, i.e. (s/2/2)/(p*m*d). Give one example, assuming that the data of matrix element are single-precision floating-point data, size of data is 4B(byte), the capacity of vector storer is 1024KB, the quantity p=16 of vector processing unit VPE, the quantity m=2 of the MAC operation parts in each VPE, then the row number of the submatrix of multiplier matrix B is p*m=16*2=32 row, and line number is (1024*1024/2/2)/(16*2*4)=2048 row. Line number and the row number of the submatrix of multiplicand matrix A all equal 2048.
As shown in Figure 4, in the present embodiment, the row number of the submatrix of multiplier matrix B is 4, line number is 8; Line number and the row number of the submatrix of multiplicand matrix A all equal 8. The method adopted is, each element in every a line of multiplicand submatrix A is read successively by the scalar processing element SPU of vector treatment device, and it being extended to a vector data, a00 elements extend as capable in the A0 in Fig. 4 becomes vector (a00, a00, a00, a00), a01 elements extend becomes vector (a01, a01, a01, a01). Read the B0 row data (b00, b01, b02, b03) of multiplier submatrix B by vector treatment parts VPU, carry out multiply accumulating respectively with each element of aforementioned vector data. When having traveled through the A0 row data of multiplicand submatrix, calculate the C0 row data (c00, c01, c02, c03) of the Matrix C that bears fruit. When having traveled through all row of multiplicand submatrix A, the calculating of the Matrix C that completes to bear fruit.
As shown in Figure 5, partitioned matrix multiplication in the present embodiment adopts the table tennis mode of two buffering (Buffer) to realize the multiplication of submatrix, by two portions storage area Buffer0 and Buffer1 that the capacity s of vector storer is divided into capacity equal, multiplier matrix B and the submatrix of Output rusults Matrix C that wherein Buffer0 is used in this submatrix multiplication store, by DMA controller, the submatrix data of the multiplier matrix B required for submatrix multiplication next time are transported to Buffer1, and the matrix data that bears fruit of last time moves external storage simultaneously.
In sum, the partitioned matrix multiplication vectorization method of the support many MAC operation parts vector treatment device realized by the present invention, it is possible to determine the piece size blocksize of optimum submatrix according to the system structure feature of vector treatment device. The multiplication adopting the table tennis mode of two buffering to realize submatrix effectively by overlapping with computing time for the data-moving time, can reduce total computing time. The row data of multiplicand submatrix is read by the scalar processing element SPU of vector treatment device, and it is extended to vector data, each element of the vector data reading multiplier submatrix by row with vector treatment parts VPU carries out multiply accumulating respectively, avoids the access of column data and the stipulations summation of vector data. It is simple that these advantages make the method for the present invention realize, easy to operate, can fully excavate the parallelisms at all levels such as the instruction of vector treatment device, data, task, the operation efficiency of treater is brought up to more than 90%, thus gives full play to the advantage of the high-performance calculation ability that many MAC operation parts vector treatment device has.
Below being only the preferred embodiment of the present invention, protection scope of the present invention is also not only confined to above-described embodiment, and all technical schemes belonged under thinking of the present invention all belong to protection scope of the present invention. It is noted that for those skilled in the art, some improvements and modifications without departing from the principles of the present invention, should be considered as protection scope of the present invention.

Claims (3)

1. support the partitioned matrix multiplication vectorization method of many MAC operation parts vector treatment device for one kind, it is characterised in that, flow process is:
(1) according to quantity m, the vector capacity s of storer and the size of data d of matrix element of the MAC operation parts in quantity p, the VPE of the vector processing unit VPE of vector treatment device, determine the piece size blocksize of optimum submatrix, it is determined that the row number of the submatrix of multiplier matrix B and line number and determine line number and the row number of the submatrix of multiplicand matrix A;
(2) two portions storage area Buffer0 and Buffer1 being divided into capacity equal the capacity s of vector storer, realizes the multiplication of submatrix, successively until whole matrix multiplication has calculated between Buffer0 and Buffer1 in the mode of rattling;
In described step (1), the row number of the submatrix of multiplier matrix B is p*m, and line number is (s/2/2)/(p*m*d); After determining the submatrix piece size of multiplier matrix B, then determine the submatrix piece size of multiplicand matrix A; The line number of the submatrix of described multiplicand matrix A and row number all equal the line number of the submatrix of multiplier matrix B, i.e. (s/2/2)/(p*m*d).
2. the partitioned matrix multiplication vectorization method of support many MAC operation parts vector treatment device according to claim 1, it is characterized in that, the scalar processing element SPU of described vector treatment device reads each element in every a line of multiplicand submatrix A successively, and is extended to a vector data; Each element of the B0 row data and aforementioned vector data that read multiplier submatrix B by vector treatment parts VPU carries out multiply accumulating respectively; When having traveled through the A0 row data of multiplicand submatrix, calculate the C0 row data of the Matrix C that bears fruit; When having traveled through all row of multiplicand submatrix A, the calculating of the Matrix C that completes to bear fruit.
3. the partitioned matrix multiplication vectorization method of support many MAC operation parts vector treatment device according to claim 1 and 2, it is characterized in that, in described step (2), the submatrix of the multiplier matrix B that storage area Buffer0 is used in this submatrix multiplication and Output rusults Matrix C stores, by DMA controller, the submatrix data of the multiplier matrix B required for submatrix multiplication next time are transported to storage area Buffer1, and the matrix data that bears fruit of last time moves external storage simultaneously.
CN201310166411.3A 2013-05-08 2013-05-08 Support the partitioned matrix multiplication vectorization method of many MAC operation parts vector treatment device Active CN103294648B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310166411.3A CN103294648B (en) 2013-05-08 2013-05-08 Support the partitioned matrix multiplication vectorization method of many MAC operation parts vector treatment device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310166411.3A CN103294648B (en) 2013-05-08 2013-05-08 Support the partitioned matrix multiplication vectorization method of many MAC operation parts vector treatment device

Publications (2)

Publication Number Publication Date
CN103294648A CN103294648A (en) 2013-09-11
CN103294648B true CN103294648B (en) 2016-06-01

Family

ID=49095548

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310166411.3A Active CN103294648B (en) 2013-05-08 2013-05-08 Support the partitioned matrix multiplication vectorization method of many MAC operation parts vector treatment device

Country Status (1)

Country Link
CN (1) CN103294648B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11797302B2 (en) 2017-05-08 2023-10-24 Nvidia Corporation Generalized acceleration of matrix multiply accumulate operations
US11816481B2 (en) 2017-05-08 2023-11-14 Nvidia Corporation Generalized acceleration of matrix multiply accumulate operations

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104461465A (en) * 2014-12-29 2015-03-25 南京大学 High-efficiency controller based on ping-pong operation and method thereof
CN104899182B (en) * 2015-06-09 2017-10-31 中国人民解放军国防科学技术大学 A kind of Matrix Multiplication accelerated method for supporting variable partitioned blocks
CN105426344A (en) * 2015-11-09 2016-03-23 南京大学 Matrix calculation method of distributed large-scale matrix multiplication based on Spark
CN106445471B (en) * 2016-10-13 2018-06-01 北京百度网讯科技有限公司 Processor and the method for performing matrix multiplication on a processor
CN106411519B (en) 2016-11-01 2019-01-25 北京百度网讯科技有限公司 For the processor of RSA decryption and for the control method of RSA decryption processor
JP6912703B2 (en) * 2017-02-24 2021-08-04 富士通株式会社 Arithmetic method, arithmetic unit, arithmetic program and arithmetic system
CN109086075B (en) 2017-10-30 2021-06-08 上海寒武纪信息科技有限公司 Artificial intelligence processor and method for executing matrix multiplication vector instruction by using same
KR102065672B1 (en) * 2018-03-27 2020-01-13 에스케이텔레콤 주식회사 Apparatus and method for convolution operation
CN110415157B (en) * 2018-04-26 2024-01-30 华为技术有限公司 Matrix multiplication calculation method and device
CN108805273A (en) * 2018-05-20 2018-11-13 复旦大学 Door control unit accelerates the hardware circuit implementation of operation in a kind of LSTM
CN108985450B (en) * 2018-06-28 2019-10-29 中国人民解放军国防科技大学 Vector processor-oriented convolution neural network operation vectorization method
CN111045958B (en) * 2018-10-11 2022-09-16 展讯通信(上海)有限公司 Acceleration engine and processor
CN110263296B (en) * 2019-05-18 2020-12-04 南京惟心光电系统有限公司 Matrix vector multiplier based on photoelectric calculation array and operation method thereof
US11010202B2 (en) * 2019-08-06 2021-05-18 Facebook, Inc. Distributed physical processing of matrix sum operation
CN112446007A (en) * 2019-08-29 2021-03-05 上海华为技术有限公司 Matrix operation method, operation device and processor
CN111737292B (en) * 2020-07-16 2021-01-05 腾讯科技(深圳)有限公司 Data retrieval method and related device
US11556337B2 (en) 2021-04-12 2023-01-17 Analog Devices International Unlimited Company Parallel matrix multiplication technique optimized for memory fetches
CN114489496A (en) * 2022-01-14 2022-05-13 南京邮电大学 Data storage and transmission method based on FPGA artificial intelligence accelerator

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7844630B2 (en) * 2007-09-01 2010-11-30 International Business Machines Corporation Method and structure for fast in-place transformation of standard full and packed matrix data formats
CN102214160A (en) * 2011-07-08 2011-10-12 中国科学技术大学 Single-accuracy matrix multiplication optimization method based on loongson chip 3A
CN102446160A (en) * 2011-09-06 2012-05-09 中国人民解放军国防科学技术大学 Dual-precision SIMD (Single Instruction Multiple Data) component-oriented matrix multiplication implementation method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7844630B2 (en) * 2007-09-01 2010-11-30 International Business Machines Corporation Method and structure for fast in-place transformation of standard full and packed matrix data formats
CN102214160A (en) * 2011-07-08 2011-10-12 中国科学技术大学 Single-accuracy matrix multiplication optimization method based on loongson chip 3A
CN102446160A (en) * 2011-09-06 2012-05-09 中国人民解放军国防科学技术大学 Dual-precision SIMD (Single Instruction Multiple Data) component-oriented matrix multiplication implementation method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
分布式并行矩阵乘算法分析;陈晶,等.;《测控技术》;20051231;第24卷(第5期);52-54 *
矩阵三角分解分块算法的研究与实现;纪坤,等.;《计算机应用与软件》;20100930;第27卷(第9期);72-74 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11797302B2 (en) 2017-05-08 2023-10-24 Nvidia Corporation Generalized acceleration of matrix multiply accumulate operations
US11797301B2 (en) 2017-05-08 2023-10-24 Nvidia Corporation Generalized acceleration of matrix multiply accumulate operations
US11797303B2 (en) 2017-05-08 2023-10-24 Nvidia Corporation Generalized acceleration of matrix multiply accumulate operations
US11816481B2 (en) 2017-05-08 2023-11-14 Nvidia Corporation Generalized acceleration of matrix multiply accumulate operations
US11816482B2 (en) 2017-05-08 2023-11-14 Nvidia Corporation Generalized acceleration of matrix multiply accumulate operations

Also Published As

Publication number Publication date
CN103294648A (en) 2013-09-11

Similar Documents

Publication Publication Date Title
CN103294648B (en) Support the partitioned matrix multiplication vectorization method of many MAC operation parts vector treatment device
KR102492477B1 (en) Matrix multiplier
CN108805266B (en) Reconfigurable CNN high-concurrency convolution accelerator
WO2019205617A1 (en) Calculation method and apparatus for matrix multiplication
US8862653B2 (en) System and method for sparse matrix vector multiplication processing
CN103440121B (en) A kind of triangular matrix multiplication vectorization method of vector processor-oriented
US8984043B2 (en) Multiplying and adding matrices
CN103049241B (en) A kind of method improving CPU+GPU isomery device calculated performance
Nagar et al. A sparse matrix personality for the convey hc-1
CN105912501B (en) A kind of SM4-128 Encryption Algorithm realization method and systems based on extensive coarseness reconfigurable processor
CN105335331B (en) A kind of SHA256 realization method and systems based on extensive coarseness reconfigurable processor
CN102279818B (en) Vector data access and storage control method supporting limited sharing and vector memory
CN102402415B (en) Device and method for buffering data in dynamic reconfigurable array
Li et al. VBSF: a new storage format for SIMD sparse matrix–vector multiplication on modern processors
US20240119114A1 (en) Matrix Multiplier and Matrix Multiplier Control Method
CN104615584A (en) Method for vectorization computing of solution of large-scale trigonometric linear system of equations for GPDSP
CN111859277B (en) Sparse matrix vector multiplication vectorization implementation method
CN110598844A (en) Parallel convolution neural network accelerator based on FPGA and acceleration method
EP3762831A1 (en) A machine perception and dense algorithm integrated circuit
CN101561797A (en) Method and device for singular value and feature value composition of matrix on processing system
CN104615516A (en) Method for achieving large-scale high-performance Linpack testing benchmark for GPDSP
KR20220094180A (en) Systems, methods, and devices for acceleration of merge join operations
Sun et al. Optimizing sparse matrix-vector multiplication on GPUs via index compression
CN111382855B (en) Data processing device, method, chip and electronic equipment
CN107924331A (en) Technology for flexible remote measurement related to dynamic frequency

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant