CN103294648B

CN103294648B - Support the partitioned matrix multiplication vectorization method of many MAC operation parts vector treatment device

Info

Publication number: CN103294648B
Application number: CN201310166411.3A
Authority: CN
Inventors: 刘仲; 陈书明; 窦强; 郭阳; 刘衡竹; 田希; 龚国辉; 陈海燕; 彭元喜; 万江华; 刘胜; 陈跃跃; 扈啸; 吴家铸
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2013-05-08
Filing date: 2013-05-08
Publication date: 2016-06-01
Anticipated expiration: 2033-05-08
Also published as: CN103294648A

Abstract

A kind of partitioned matrix multiplication vectorization method supporting many MAC operation parts vector treatment device, flow process is: (1) is according to quantity m, the vector capacity s of storer and the size of data d of matrix element of the MAC operation parts in quantity p, the VPE of the vector processing unit VPE of vector treatment device, determine the piece size blocksize of optimum submatrix, it is determined that the row number of the submatrix of multiplier matrix B and line number and determine line number and the row number of the submatrix of multiplicand matrix A; (2) two portions storage area Buffer0 and Buffer1 being divided into capacity equal the capacity s of vector storer, realizes the multiplication of submatrix, successively until whole matrix multiplication has calculated between Buffer0 and Buffer1 in the mode of rattling. The present invention have realize simple, easy to operate, vector treatment device parallelism can be improved, the advantages such as processor calculating efficiency can be improved.

Description

Support the partitioned matrix multiplication vectorization method of many MAC operation parts vector treatment device

Technical field

The present invention is mainly concerned with technical field of data processing, refers in particular to a kind of partitioned matrix multiplication vectorization method supporting many MAC operation parts vector treatment device.

Background technology

Along with large-scale dense linear equations solve, the high-performance calculation demand of the compute-intensive applications such as 4G radio communication, Radar Signal Processing, HD video and digital image processing growing, there is noticeable change in computer architecture, many new architectures occur, such as many nuclear system structures, heterogeneous multi-core tying, stream processor architecture and vector processor architecture etc., these new system structures are integrated multiple processor core on a single chip, each core comprises abundant arithmetic facility, has increased substantially the calculated performance of chip; Meanwhile, also Software development is proposed new challenge. Because existing a large amount of program and algorithm design based on single core processor, how for the system structure feature such as multinuclear, many arithmetic facilitys, the parallelism that fully exploitation is at all levels, parallel efficiently and these application algorithms of vectorization are the main difficulties currently faced.

" matrix multiplication " is high-performance calculation (HighPerformanceComputing, HPC) one of conventional core module, it is typical computation-intensive and memory access intensive applications, taking advantage of of treater is added (MultiplyAccumulate, MAC) ability and memory bandwidth require very high, the time complexity calculated is very high, is approximately O(N³), N is matrix scale. Traditional three recirculate, and to calculate memory access lower for the method for realization matrix multiplication, and the data of Cache disappearance, that matrix data moves expense accounting is big, causes the operation efficiency of treater lower. The multiplication of large matrix is divided into the multiplication of a series of submatrix by partitioned matrix multiplication method, by reasonably arranging the piece size of submatrix, the piece size blocksize of submatrix meets blocksize��sqrt (M/3) usually, M is the capacity of Cache, data access when submatrix is calculated can all be hit in Cache, by reducing the computing time reducing whole large matrix multiplication computing time of submatrix, thus increase substantially the operation efficiency of treater.

Fig. 1 is the general structural representation of many MAC operation parts vector treatment device, it comprises scalar processing element (ScalarProcessingUnit, and vector treatment parts (VectorProcessingUnit SPU), VPU), SPU is responsible for scalar task computation and Flow Control, and VPU is responsible for vector calculation, comprise some vector processing unit (VectorProcessingElement, VPE), each VPE comprises multiple computing functional component such as MAC0, MAC1, and other functional components such as ALU, BP. SPU and VPU provides data channel transmission and exchanges data. The Load/Store of vector data access unit support vector data, it is provided that the special vector storer of Large Copacity, instead of the Cache mechanism of single core processor, existing partitioned matrix multiplication method is not suitable for this kind of vector treatment device. Therefore, need the method for the partitioned matrix multiplication vectorization designing a kind of efficient support many MAC operation parts vector treatment device badly, so that the operation efficiency of the performance vector treatment device of optimum.

Summary of the invention

The technical problem to be solved in the present invention is just: the technical problem existed for prior art, the present invention provide a kind of realize simple, easy to operate, vector treatment device parallelism can be improved, the partitioned matrix multiplication vectorization method of the support many MAC operation parts vector treatment device that can improve processor calculating efficiency.

For solving the problems of the technologies described above, the present invention by the following technical solutions:

Supporting a partitioned matrix multiplication vectorization method for many MAC operation parts vector treatment device, flow process is:

(1) according to quantity m, the vector capacity s of storer and the size of data d of matrix element of the MAC operation parts in quantity p, the VPE of the vector processing unit VPE of vector treatment device, determine the piece size blocksize of optimum submatrix, it is determined that the row number of the submatrix of multiplier matrix B and line number and determine line number and the row number of the submatrix of multiplicand matrix A;

(2) two portions storage area Buffer0 and Buffer1 being divided into capacity equal the capacity s of vector storer, realizes the multiplication of submatrix, successively until whole matrix multiplication has calculated between Buffer0 and Buffer1 in the mode of rattling.

As a further improvement on the present invention:

In described step (1), the row number of the submatrix of multiplier matrix B is p*m, and line number is (s/2/2)/(p*m*d); After determining the submatrix piece size of multiplier matrix B, then determine the submatrix piece size of multiplicand matrix A; The line number of the submatrix of described multiplicand matrix A and row number all equal the line number of the submatrix of multiplier matrix B, i.e. (s/2/2)/(p*m*d).

The scalar processing element SPU of described vector treatment device reads each element in every a line of multiplicand submatrix A successively, and is extended to a vector data; Each element of the B0 row data and aforementioned vector data that read multiplier submatrix B by vector treatment parts VPU carries out multiply accumulating respectively; When having traveled through the A0 row data of multiplicand submatrix, calculate the C0 row data of the Matrix C that bears fruit; When having traveled through all row of multiplicand submatrix A, the calculating of the Matrix C that completes to bear fruit.

In described step (2), the submatrix of the multiplier matrix B that storage area Buffer0 is used in this submatrix multiplication and Output rusults Matrix C stores, by DMA controller, the submatrix data of the multiplier matrix B required for submatrix multiplication next time are transported to storage area Buffer1, and the matrix data that bears fruit of last time moves external storage simultaneously.

Compared with prior art, it is an advantage of the current invention that: the system structure feature of the present invention's foundation vector treatment device and the size of data of matrix element, it is determined that the piece size blocksize of optimum submatrix, effectively improves the calculating memory access ratio of treater; The multiplication adopting the table tennis mode of two buffering to realize submatrix effectively by overlapping with computing time for the data-moving time, can reduce total computing time. The row data of multiplicand submatrix is read by the scalar processing element SPU of vector treatment device, and it is extended to vector data, each element of the vector data reading multiplier submatrix by row with vector treatment parts VPU carries out multiply accumulating respectively, avoids the access of column data and the stipulations summation of vector data. It is simple that these advantages make the method for the present invention realize, easy to operate, can fully excavate the parallelisms at all levels such as the instruction of vector treatment device, data, task, the operation efficiency of treater is brought up to more than 90%, thus gives full play to the advantage of the high-performance calculation ability that many MAC operation parts vector treatment device has.

Accompanying drawing explanation

Fig. 1 is the general structural representation of many MAC operation parts vector treatment device.

Fig. 2 is the execution schematic flow sheet of the inventive method.

Fig. 3 is the schematic flow sheet of the piece size determining optimum submatrix in specific embodiment according to the system structure feature of vector treatment device.

Fig. 4 is the computing schematic diagram of the specific implementation process of neutron matrix multiplication of the present invention.

Fig. 5 is the schematic flow sheet adopting the table tennis mode of two buffering to realize submatrix multiplication in a particular embodiment.

Embodiment

Below with reference to Figure of description and specific embodiment, the present invention is described in further details.

As shown in Figure 2, the present invention supports the partitioned matrix multiplication vectorization method of many MAC operation parts vector treatment device, and idiographic flow is:

(1) first according to quantity m, the vector capacity s of storer and the size of data d of matrix element of the MAC operation parts in quantity p, the VPE of the vector processing unit VPE of vector treatment device, determine the piece size blocksize of optimum submatrix, it is determined that the row number of the submatrix of multiplier matrix B and line number and determine line number and the row number of the submatrix of multiplicand matrix A.

As shown in Figure 3, during embody rule, the quantity m of the MAC operation parts in quantity p, each VPE of the vector processing unit VPE according to vector treatment device, the vector capacity s of storer and the size of data d of matrix element, it is determined that the piece size blocksize of optimum submatrix. Wherein, the row number of the submatrix of multiplier matrix B is p*m, and line number is (s/2/2)/(p*m*d). After determining the submatrix piece size of multiplier matrix B, then determine the submatrix piece size of multiplicand matrix A. The line number of the submatrix of multiplicand matrix A and row number all equal the line number of the submatrix of multiplier matrix B, i.e. (s/2/2)/(p*m*d). Give one example, assuming that the data of matrix element are single-precision floating-point data, size of data is 4B(byte), the capacity of vector storer is 1024KB, the quantity p=16 of vector processing unit VPE, the quantity m=2 of the MAC operation parts in each VPE, then the row number of the submatrix of multiplier matrix B is p*m=16*2=32 row, and line number is (1024*1024/2/2)/(16*2*4)=2048 row. Line number and the row number of the submatrix of multiplicand matrix A all equal 2048.

As shown in Figure 4, in the present embodiment, the row number of the submatrix of multiplier matrix B is 4, line number is 8; Line number and the row number of the submatrix of multiplicand matrix A all equal 8. The method adopted is, each element in every a line of multiplicand submatrix A is read successively by the scalar processing element SPU of vector treatment device, and it being extended to a vector data, a00 elements extend as capable in the A0 in Fig. 4 becomes vector (a00, a00, a00, a00), a01 elements extend becomes vector (a01, a01, a01, a01). Read the B0 row data (b00, b01, b02, b03) of multiplier submatrix B by vector treatment parts VPU, carry out multiply accumulating respectively with each element of aforementioned vector data. When having traveled through the A0 row data of multiplicand submatrix, calculate the C0 row data (c00, c01, c02, c03) of the Matrix C that bears fruit. When having traveled through all row of multiplicand submatrix A, the calculating of the Matrix C that completes to bear fruit.

As shown in Figure 5, partitioned matrix multiplication in the present embodiment adopts the table tennis mode of two buffering (Buffer) to realize the multiplication of submatrix, by two portions storage area Buffer0 and Buffer1 that the capacity s of vector storer is divided into capacity equal, multiplier matrix B and the submatrix of Output rusults Matrix C that wherein Buffer0 is used in this submatrix multiplication store, by DMA controller, the submatrix data of the multiplier matrix B required for submatrix multiplication next time are transported to Buffer1, and the matrix data that bears fruit of last time moves external storage simultaneously.

In sum, the partitioned matrix multiplication vectorization method of the support many MAC operation parts vector treatment device realized by the present invention, it is possible to determine the piece size blocksize of optimum submatrix according to the system structure feature of vector treatment device. The multiplication adopting the table tennis mode of two buffering to realize submatrix effectively by overlapping with computing time for the data-moving time, can reduce total computing time. The row data of multiplicand submatrix is read by the scalar processing element SPU of vector treatment device, and it is extended to vector data, each element of the vector data reading multiplier submatrix by row with vector treatment parts VPU carries out multiply accumulating respectively, avoids the access of column data and the stipulations summation of vector data. It is simple that these advantages make the method for the present invention realize, easy to operate, can fully excavate the parallelisms at all levels such as the instruction of vector treatment device, data, task, the operation efficiency of treater is brought up to more than 90%, thus gives full play to the advantage of the high-performance calculation ability that many MAC operation parts vector treatment device has.

Below being only the preferred embodiment of the present invention, protection scope of the present invention is also not only confined to above-described embodiment, and all technical schemes belonged under thinking of the present invention all belong to protection scope of the present invention. It is noted that for those skilled in the art, some improvements and modifications without departing from the principles of the present invention, should be considered as protection scope of the present invention.

Claims

1. support the partitioned matrix multiplication vectorization method of many MAC operation parts vector treatment device for one kind, it is characterised in that, flow process is:

(2) two portions storage area Buffer0 and Buffer1 being divided into capacity equal the capacity s of vector storer, realizes the multiplication of submatrix, successively until whole matrix multiplication has calculated between Buffer0 and Buffer1 in the mode of rattling;

2. the partitioned matrix multiplication vectorization method of support many MAC operation parts vector treatment device according to claim 1, it is characterized in that, the scalar processing element SPU of described vector treatment device reads each element in every a line of multiplicand submatrix A successively, and is extended to a vector data; Each element of the B0 row data and aforementioned vector data that read multiplier submatrix B by vector treatment parts VPU carries out multiply accumulating respectively; When having traveled through the A0 row data of multiplicand submatrix, calculate the C0 row data of the Matrix C that bears fruit; When having traveled through all row of multiplicand submatrix A, the calculating of the Matrix C that completes to bear fruit.

3. the partitioned matrix multiplication vectorization method of support many MAC operation parts vector treatment device according to claim 1 and 2, it is characterized in that, in described step (2), the submatrix of the multiplier matrix B that storage area Buffer0 is used in this submatrix multiplication and Output rusults Matrix C stores, by DMA controller, the submatrix data of the multiplier matrix B required for submatrix multiplication next time are transported to storage area Buffer1, and the matrix data that bears fruit of last time moves external storage simultaneously.