CN103294648A

CN103294648A - Block matrix multiplication vectorization method supporting vector processor with multiple MAC (multiply accumulate) operational units

Info

Publication number: CN103294648A
Application number: CN2013101664113A
Authority: CN
Inventors: 刘仲; 陈书明; 窦强; 郭阳; 刘衡竹; 田希; 龚国辉; 陈海燕; 彭元喜; 万江华; 刘胜; 陈跃跃; 扈啸; 吴家铸
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2013-05-08
Filing date: 2013-05-08
Publication date: 2013-09-11
Anticipated expiration: 2033-05-08
Also published as: CN103294648B

Abstract

A block matrix multiplication vectorization method supporting a vector processor with multiple MAC (multiply accumulate) operational units includes the steps of (1), determining the optimum block size of submatrix, the quantity of lines and rows of the submatrix of a multiplier matrix B and the quantity of lines and rows of the submatrix of a multiplier matrix A according to the quantity p of vector processing elements (VPE) of the vector processor, the quantity m of the MAC operational units in the VPE, the capacity s of a vector memory and data size d of matrix elements, (2) equally dividing the capacity s of the vector memory into two storage areas of Buffer 0 and Buffer 1, and realizing multiplication of the submatrix in the Buffer 0 and the Buffer 1 in a Pingpong mode until completing the multiplication operation of the whole matrix. The block matrix multiplication vectorization method has the advantages of easiness in implementing, convenience in operation, capabilities of improving parallelism of the vector processor and increasing operation efficiency thereof, and the like.

Description

Support the partitioned matrix multiplication vectorization method of many MAC arithmetic unit vector processor

Technical field

The present invention is mainly concerned with technical field of data processing, refers in particular to a kind of partitioned matrix multiplication vectorization method of many MAC of support arithmetic unit vector processor.

Background technology

Along with the high-performance calculation demand of compute-intensive applications such as large-scale dense Solving Linear, 4G radio communication, Radar Signal Processing, HD video and Digital Image Processing is growing, marked change appears in Computer Architecture, many novel architectures appear, as many nuclear system structures, heterogeneous multi-core bind, stream handle architecture and vector processor architecture etc., these new architectures are integrated a plurality of processor cores on single-chip, comprise abundant arithmetic unit on each nuclear, increased substantially the calculated performance of chip; Simultaneously, also new challenge has been proposed in software development.Because existing a large amount of program and algorithm are based on the single core processor design, how at architecture characteristics such as multinuclear, multicomputing units, fully develop concurrency at all levels, parallel and these application algorithms of vectorization are current main difficulties that face efficiently.

" matrix multiplication " is high-performance calculation (High Performance Computing, HPC) one of Chang Yong nucleus module, be typical computation-intensive and memory access intensive applications, taking advantage of of processor added (Multiply Accumulate, MAC) ability and memory access bandwidth requirement are very high, the time complexity that calculates is very high, is approximately O(N ³), N is matrix size.It is lower that the three traditional methods that recirculate the realization matrix multiplication are calculated memory access, and it is big that the data of Cache lack, matrix data is moved the expense accounting, causes the operation efficiency of processor lower.The partitioned matrix multiplication method is divided into the multiplication of large matrix the multiplication of a series of submatrixs, by the block size of submatrix reasonably is set, the block size blocksize of submatrix satisfies blocksize＜=sqrt (M/3) usually, M is the capacity of Cache, make the data access when submatrix calculates all in Cache, to hit, reduce the computing time of whole large matrix multiplication the computing time by the minimizing submatrix, thereby increase substantially the operation efficiency of processor.

Fig. 1 is the general structural representation of many MAC arithmetic unit vector processor, it comprises scalar processor unit (Scalar Processing Unit, SPU) and Vector Processing parts (Vector Processing Unit, VPU), SPU is responsible for scalar task computation and Flow Control, VPU is responsible for vector calculation, comprise some vector processing units (Vector Processing Element, VPE), each VPE comprises a plurality of calculation function parts such as MAC0, MAC1, and other functional parts such as ALU, BP.SPU and VPU provide data channel transmission and swap data.The Load/Store of vector data addressed location support vector data provides jumbo special-purpose vector memory, rather than the Cache mechanism of single core processor, and existing partitioned matrix multiplication method is not suitable for this class vector processor.Therefore, need a kind of method of supporting the partitioned matrix multiplication vectorization of many MAC arithmetic unit vector processor efficiently of design badly, so that the operation efficiency of optimum performance vector processor.

Summary of the invention

The technical problem to be solved in the present invention just is: at the technical matters that prior art exists, the invention provides a kind of realize simple, easy to operate, can improve the vector processor concurrency, can improve the partitioned matrix multiplication vectorization method of many MAC of support arithmetic unit vector processor of processor calculating efficient.

For solving the problems of the technologies described above, the present invention by the following technical solutions:

A kind of partitioned matrix multiplication vectorization method of supporting many MAC arithmetic unit vector processor, flow process is:

(1) according to the quantity p of the vector processing unit VPE of vector processor, quantity m, the capacity s of vector memory of MAC arithmetic unit among the VPE and the size of data d of matrix element, determine the block size blocksize of optimum submatrix, determine line number and the columns of the submatrix of the columns of submatrix of multiplier matrix B and line number and definite multiplicand matrix A;

(2) the capacity s with vector memory is divided into two parts storage area Buffer0 and the Buffer1 that capacity equates, realizes the multiplication of submatrix successively between Buffer0 and Buffer1 with ping-pong, calculates up to whole matrix multiplication and finishes.

As a further improvement on the present invention:

In the described step (1), the columns of the submatrix of multiplier matrix B is p*m, and line number is (s/2/2)/(p*m*d); After determining the submatrix block size of multiplier matrix B, determine the submatrix block size of multiplicand matrix A again; The line number of the submatrix of described multiplicand matrix A and columns all equal the line number of the submatrix of multiplier matrix B, i.e. (s/2/2)/(p*m*d).

The scalar processor unit SPU of described vector processor reads each element in each row of multiplicand submatrix A successively, and is extended to a vector data; Read the B0 line data of multiplier submatrix B and each element of aforementioned vector data carries out multiply accumulating respectively by Vector Processing parts VPU; When having traveled through the A0 line data of multiplicand submatrix, calculate the C0 line data of the Matrix C that bears fruit; When having traveled through all row of multiplicand submatrix A, finish the calculating of the Matrix C that bears fruit.

In the described step (2), storage area Buffer0 is used for the multiplier matrix B of this submatrix multiplication and stores with the submatrix of output matrix of consequence C, simultaneously by dma controller will be next time the submatrix data of the needed multiplier matrix B of submatrix multiplication be transported to storage area Buffer1, and the last matrix data that bears fruit is moved external storage.

Compared with prior art, the invention has the advantages that: the present invention determines the block size blocksize of optimum submatrix according to the architecture characteristics of vector processor and the size of data of matrix element, has effectively improved the calculating memory access ratio of processor; Adopt the ping-pong of double buffering to realize that the multiplication of submatrix can be overlapping with data-moving time and computing time effectively, reduce total computing time.Read the line data of multiplicand submatrix by the scalar processor unit SPU of vector processor, and be extended to vector data, carry out multiply accumulating with Vector Processing parts VPU respectively by each element that row reads the vector data of multiplier submatrix, avoided the visit of columns certificate and the stipulations summation of vector data.It is simple that these advantages make that method of the present invention realizes, easy to operate, can fully excavate instruction, data, the task dispatching concurrency at all levels of vector processor, the operation efficiency of processor is brought up to more than 90%, thereby given full play to the advantage of the high-performance calculation ability that many MAC arithmetic unit vector processor has.

Description of drawings

Fig. 1 is the general structural representation of many MAC arithmetic unit vector processor.

Fig. 2 is the execution schematic flow sheet of the inventive method.

Fig. 3 is the schematic flow sheet of determining the block size of optimum submatrix in the specific embodiment according to the architecture characteristics of vector processor.

Fig. 4 is the computing synoptic diagram of the specific implementation process of neutron matrix multiplication of the present invention.

Fig. 5 is the schematic flow sheet that adopts the ping-pong realization submatrix multiplication of double buffering in specific embodiment.

Embodiment

Below with reference to Figure of description and specific embodiment the present invention is described in further details.

As shown in Figure 2, the present invention supports the partitioned matrix multiplication vectorization method of many MAC arithmetic unit vector processor, and idiographic flow is:

(1) at first according to the quantity p of the vector processing unit VPE of vector processor, quantity m, the capacity s of vector memory of MAC arithmetic unit among the VPE and the size of data d of matrix element, determine the block size blocksize of optimum submatrix, determine line number and the columns of the submatrix of the columns of submatrix of multiplier matrix B and line number and definite multiplicand matrix A.

As shown in Figure 3, during concrete the application, according to the quantity p of the vector processing unit VPE of vector processor, quantity m, the capacity s of vector memory of MAC arithmetic unit among each VPE and the size of data d of matrix element, determine the block size blocksize of optimum submatrix.Wherein, the columns of the submatrix of multiplier matrix B is p*m, and line number is (s/2/2)/(p*m*d).After determining the submatrix block size of multiplier matrix B, determine the submatrix block size of multiplicand matrix A again.The line number of the submatrix of multiplicand matrix A and columns all equal the line number of the submatrix of multiplier matrix B, i.e. (s/2/2)/(p*m*d).Give one example, the data of supposing matrix element are single-precision floating-point data, size of data is the 4B(byte), the capacity of vector memory is 1024KB, the quantity p=16 of vector processing unit VPE, the quantity m=2 of the MAC arithmetic unit among each VPE, then the columns of the submatrix of multiplier matrix B is the p*m=16*2=32 row, line number is (1024*1024/2/2)/(16*2*4)=2048 row.The line number of the submatrix of multiplicand matrix A and columns all equal 2048.

As shown in Figure 4, in the present embodiment, the columns of the submatrix of multiplier matrix B is 4, line number is 8; The line number of the submatrix of multiplicand matrix A and columns all equal 8.The method that adopts is, read each element in each row of multiplicand submatrix A successively by the scalar processor unit SPU of vector processor, and be extended to a vector data, be extended to vector (a00, a00 as the capable a00 element of the A0 among Fig. 4, a00, a00), the a01 element is extended to vector (a01, a01, a01, a01).(b02 b03), carries out multiply accumulating respectively with each elements of aforementioned vector data for b00, b01 to read the B0 line data of multiplier submatrix B by Vector Processing parts VPU.When having traveled through the A0 line data of multiplicand submatrix, calculate the Matrix C that bears fruit the C0 line data (c00, c01, c02, c03).When having traveled through all row of multiplicand submatrix A, finish the calculating of the Matrix C that bears fruit.

As shown in Figure 5, partitioned matrix multiplication in the present embodiment adopts the ping-pong of double buffering (Buffer) to realize the multiplication of submatrix, the capacity s of vector memory is divided into two parts storage area Buffer0 and the Buffer1 that capacity equates, wherein Buffer0 is used for the multiplier matrix B and the submatrix storage of exporting matrix of consequence C of this submatrix multiplication, simultaneously by dma controller will be next time the submatrix data of the needed multiplier matrix B of submatrix multiplication be transported to Buffer1, and the last matrix data that bears fruit is moved external storage.

In sum, the partitioned matrix multiplication vectorization method of many MAC of support arithmetic unit vector processor of realizing by the present invention can be determined the block size blocksize of optimum submatrix according to the architecture characteristics of vector processor.Adopt the ping-pong of double buffering to realize that the multiplication of submatrix can be overlapping with data-moving time and computing time effectively, reduce total computing time.Read the line data of multiplicand submatrix by the scalar processor unit SPU of vector processor, and be extended to vector data, carry out multiply accumulating with Vector Processing parts VPU respectively by each element that row reads the vector data of multiplier submatrix, avoided the visit of columns certificate and the stipulations summation of vector data.It is simple that these advantages make that method of the present invention realizes, easy to operate, can fully excavate instruction, data, the task dispatching concurrency at all levels of vector processor, the operation efficiency of processor is brought up to more than 90%, thereby given full play to the advantage of the high-performance calculation ability that many MAC arithmetic unit vector processor has.

Below only be preferred implementation of the present invention, protection scope of the present invention also not only is confined to above-described embodiment, and all technical schemes that belongs under the thinking of the present invention all belong to protection scope of the present invention.Should be pointed out that for those skilled in the art the some improvements and modifications not breaking away under the principle of the invention prerequisite should be considered as protection scope of the present invention.

Claims

1. partitioned matrix multiplication vectorization method of supporting many MAC arithmetic unit vector processor is characterized in that flow process is:

2. the partitioned matrix multiplication vectorization method of many MAC of support arithmetic unit vector processor according to claim 1 is characterized in that, in the described step (1), the columns of the submatrix of multiplier matrix B is p*m, and line number is (s/2/2)/(p*m*d); After determining the submatrix block size of multiplier matrix B, determine the submatrix block size of multiplicand matrix A again; The line number of the submatrix of described multiplicand matrix A and columns all equal the line number of the submatrix of multiplier matrix B, i.e. (s/2/2)/(p*m*d).

3. the partitioned matrix multiplication vectorization method of many MAC of support arithmetic unit vector processor according to claim 2, it is characterized in that, the scalar processor unit SPU of described vector processor reads each element in each row of multiplicand submatrix A successively, and is extended to a vector data; Read the B0 line data of multiplier submatrix B and each element of aforementioned vector data carries out multiply accumulating respectively by Vector Processing parts VPU; When having traveled through the A0 line data of multiplicand submatrix, calculate the C0 line data of the Matrix C that bears fruit; When having traveled through all row of multiplicand submatrix A, finish the calculating of the Matrix C that bears fruit.

4. according to the partitioned matrix multiplication vectorization method of claim 1 or 2 or 3 described many MAC of support arithmetic unit vector processors, it is characterized in that, in the described step (2), storage area Buffer0 is used for the multiplier matrix B of this submatrix multiplication and stores with the submatrix of output matrix of consequence C, simultaneously by dma controller will be next time the submatrix data of the needed multiplier matrix B of submatrix multiplication be transported to storage area Buffer1, and the last matrix data that bears fruit is moved external storage.