CN102411558A

CN102411558A - Vector processor oriented large matrix multiplied vectorization realizing method

Info

Publication number: CN102411558A
Application number: CN2011103381088A
Authority: CN
Inventors: 刘仲; 陈书明; 陈跃跃; 曾咏涛; 刘衡竹; 陈海燕; 龚国辉; 彭元喜; 陈胜刚
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2011-10-31
Filing date: 2011-10-31
Publication date: 2012-04-11
Anticipated expiration: 2031-10-31
Also published as: CN102411558B

Abstract

The invention discloses a vector processor oriented large matrix multiplied vectorization realizing method, which comprises the following steps of: (1) inputting a multiplicand matrix A and a multiplier B; transporting the multiplicand matrix A and multiplier B to a vector storing unit by a DMA (direct memory access) controller; in transporting process, ordering the first to number n lines of the multiplier B into first to number n columns; (2) loading elements in one line of the multiplicand matrix A and in one column of the multiplier B to K parallel processing units and multiplying the elements in a one-to-one correspondence manner; reducing and summing the multiplied results in one pointed parallel processing unit; storing the summed result as a result matrix element in a vector storing unit; and (3) transferring to next line of the multiplicand matrix A and next column of the multiplier B; re-executing the step (2) until calculating all data frames and acquiring a result matrix C composed of matrix elements. The vectorization realizing method disclosed by the invention has the advantages of simple principle, convenient operation and capability of improving calculating efficiency.

Description

The vectorization implementation method that multiplies each other towards the large matrix of vector processor

Technical field

The present invention relates generally to vector processor and data processing field, relates in particular to the vectorization implementation method that a kind of large matrix multiplies each other.

Background technology

All can relate to matrix multiplication operation at many science calculation tasks with in using; Like Flame Image Process, the signal codec in the communication system etc. are for larger matrix multiple calculation task; Owing to relate to a large amount of multiplication and additive operation, need take a large amount of computing times.How on processor simply and efficiently the realization matrix multiplying be the research focus of industry always.

On traditional scalar processor, the researchist has proposed multiple effective matrix multiple implementation method, to reduce the influence of the sorting operation of data in calculating process to the computing of accomplishing whole matrix multiple.But what, real-time operation highly dense along with HD video encoding and decoding, 3G radio communication, Radar Signal Processing etc. used continues to bring out, and single-chip is difficult to satisfy the real-time computation requirement of high density of this type application, and vector processor has obtained widespread use.As shown in Figure 1, be the typical structure of a vector processor, it has processor and program storage and data-carrier store (both all can be addressable memory arbitrarily, comprise external cache, external RAM etc.).The processor of vector processor is divided into scalar processor unit and two parts of Vector Processing parts; Usually in the Vector Processing parts K parallel processing element (PE) arranged; These processing units all have arithmetic unit and register separately; The data interaction that can carry out through stipulations instructions between processing unit is like the data addition between the parallel processing element, relatively wait.Scalar processing unit mainly is responsible for the processing of Flow Control and logic determines instruction, and vector processing unit mainly is responsible for intensive data computation.The used data of vector processing unit computing are provided by the vector data storage unit.Usually, as shown in Figure 2, the number of the BANK of vector data storage unit (memory bank) is consistent with the processing unit number K of vector processing unit.

Application number is the patent documentation of " 200380107095.7 "; A patent that discloses Intel company's proposition is used the effective multiplication of minor matrix of simd register; The diagonal line of multiplicand matrix A is written in the different registers of processor, and the multiplier matrix B is written at least one in the register of vertically arranging according to the order of sequence.Through moving an element, selectively the last element in multiplication in every row of the multiplier matrix B in the register and the same row that moved of addition element is moved to together the front end of these row.The diagonal line of multiplicand matrix A multiply by the row of multiplier matrix B, and their result is added to the result of row of matrix of consequence C and last.This method is to obtain reasonable effect under the less situation of matrix size, and still along with the increase gradually of matrix size, the performance that is difficult to obtain shows.Therefore, how to realize on vector processor that large matrix multiplying efficiently is a current difficulty that faces.

Summary of the invention

Technical matters to be solved by this invention is: the vectorization implementation method that to the problem that prior art exists, the present invention provides that a kind of principle is simple, easy to operate, the multistage parallel property characteristics that can make full use of vector processor and the large matrix towards vector processor that is easy to realize multiply each other.

For solving the problems of the technologies described above, the present invention adopts following technical scheme:

A kind of vectorization implementation method that multiplies each other towards the large matrix of vector processor may further comprise the steps:

(1) input multiplicand matrix A and multiplier matrix B; Through dma controller multiplicand matrix A and multiplier matrix B are transported to respectively in the vectorial storage unit; In handling process, the multiplier matrix B to be reordered, the capable ordering successively of the 1st～n that is about in the multiplier matrix B is the 1st～n row;

(2) element in the row in element in the multiplicand matrix A delegation and the multiplier matrix B is loaded into respectively in K the parallel processing element, and correspondence multiplies each other one by one; With multiplied result reduction summation in the parallel processing element of an appointment; Summed result is stored in the vectorial storage unit as a matrix of consequence element;

(3) along moving on to the next line of multiplicand matrix A and the next column of multiplier matrix B, repeating step (2) obtains the matrix of consequence C that is made up of the matrix of consequence element until the calculating of accomplishing all Frames.

As further improvement of the present invention:

In the said handling process; Each row of multiplicand matrix A is organized into a Frame; Each row of multiplier matrix B are organized into a Frame; When the element number of said Frame is not equal to the multiple of the number K of parallel processing element in the vector processor, mends 0 at the Frame afterbody and make the element number of each Frame equal the multiple of the number K of parallel processing element.

Compared with prior art, the invention has the advantages that:

The vectorization implementation method that multiplies each other towards the large matrix of vector processor of the present invention; Through in the process of dma controller carrying data, realizing the data reordering of multiplier matrix B; Also make full use of the characteristics that a plurality of parallel processing elements of vectorial parts in the vector processor can carry out the identical operation operation simultaneously simultaneously and carry out a large amount of operations of the same type; Thereby improved the efficient of compute matrix multiplication greatly, and step is simple, is easy to realize.

Description of drawings

Fig. 1 is typical vector processor structural representation.

Fig. 2 is the structural representation of the vector data storage unit in the vector processor of Fig. 1.

Fig. 3 is a main-process stream synoptic diagram of the present invention.

Fig. 4 realizes the multiplier matrix B element synoptic diagram that reorders with dma controller in the embodiment of the invention 1.

Fig. 5 is multiplicand matrix A and the Storage Format synoptic diagram of element in vector data storage unit shown in Figure 2 of multiplier matrix B in the embodiment of the invention 2; Fig. 5 (1) is the Storage Format synoptic diagram of element in vector data storage unit shown in Figure 2 of the multiplicand matrix A in the embodiment of the invention 2; Fig. 5 (2) is the Storage Format synoptic diagram of element in vector data storage unit shown in Figure 2 of the multiplier matrix B in the embodiment of the invention 2.

Fig. 6 is loaded into K the synoptic diagram in the parallel processing element for the multiplicand matrix A (16 * 16) and the multiplier matrix B (16 * 16) of the embodiment of the invention 2.

Fig. 7 is the matrix multiplication performing step synoptic diagram of the multiplicand matrix A (16 * 16) and the multiplier matrix B (16 * 16) of the embodiment of the invention 2.

Fig. 8 is multiplicand matrix A and the Storage Format synoptic diagram of element in vector data storage unit shown in Figure 2 of multiplier matrix B in the embodiment of the invention 3; Fig. 8 (1) is the Storage Format synoptic diagram of element in vector data storage unit shown in Figure 2 of the multiplicand matrix A in the embodiment of the invention 3; Fig. 8 (2) is the Storage Format synoptic diagram of element in vector data storage unit shown in Figure 2 of the multiplier matrix B in the embodiment of the invention 3.

Fig. 9 is that the multiplicand matrix A (26 * 22) and the multiplier matrix B (22 * 27) of the embodiment of the invention 3 is loaded into K the synoptic diagram in the parallel processing element.

Figure 10 is the matrix multiplication performing step synoptic diagram of the multiplicand matrix A (26 * 22) and the multiplier matrix B (22 * 27) of the embodiment of the invention 3.

Embodiment

Below will combine Figure of description and specific embodiment that the present invention is done further explain.

Embodiment 1:

As shown in Figure 3, the vectorization implementation method that multiplies each other towards the large matrix of vector processor of the present invention may further comprise the steps:

1, input multiplicand matrix A and multiplier matrix B; Through dma controller multiplicand matrix A and multiplier matrix B are transported to respectively in the vectorial storage unit, as shown in Figure 4 in the handling process, the multiplier matrix B to be reordered, the capable ordering successively of the 1st～n that is about in the multiplier matrix B is the 1st～n row.

Through the configuration of dma controller, can each row of multiplicand matrix A be organized into a Frame, each row of multiplier matrix B are organized into a Frame, and whole multiplier matrix B can be divided into p Frame altogether.When the element number of Frame is not equal to the multiple of the number K of parallel processing element in the vector processor, mends 0 at the Frame afterbody and make the element number of each Frame equal the multiple of the number K of parallel processing element.

2, the element in the column data frame of the data line frame of multiplicand matrix A and multiplier matrix B is loaded into respectively in K the parallel processing element, and correspondence multiplies each other one by one; Multiplied result is the reduction summation in the parallel processing element of an appointment; Summed result stores in the vectorial storage unit as a matrix of consequence element.

3, along moving on to the next line of multiplicand matrix A and the next column of multiplier matrix B, repeating step 2 to 3 obtains the matrix of consequence C that is made up of the matrix of consequence element until the calculating of accomplishing all Frames.

For the computing that the multiplicand matrix A of m*n multiply by the multiplier matrix B of n*p, can obtain the Matrix C of m*p.It can be expressed as on mathematical formulae: (0≤i＜m, 0≤j＜p).The Elements C of matrix of consequence C _IjBe by the corresponding row element A of multiplicand matrix A _IkWith the corresponding column element B of multiplier matrix B _KjCarrying out the dot product operational computations tries to achieve.

Embodiment 2:

As shown in Figure 7, adopt the vectorization implementation method that multiplies each other towards the large matrix of vector processor of the present invention, the calculating scale be 16 * 16 matrix to multiply by scale be 16 * 16 matrix (the vector processing unit number K is 8), may further comprise the steps:

1, as shown in Figure 6, input multiplicand matrix A (16 * 16) and multiplier matrix B (16 * 16); Arrive vectorial storage unit through DMA carrying multiplicand matrix A and multiplier matrix B; Realize reorder (method for reordering is identical with embodiment 1) of multiplier matrix B in this process, multiplicand matrix A and multiplier matrix B in the location mode of vector location shown in Fig. 5 (1) and Fig. 5 (2).

2, the element with row of row element of multiplicand matrix A and multiplier matrix B is loaded in the vector processing unit, because the scale of multiplicand matrix A and multiplier matrix B all is 16 * 16, so will be at twice with multiplicand matrix A and the loading of multiplier matrix B.The element that is loaded into the correspondence of vector processing unit carries out multiply operation, because the number of vector processing unit is 8, and the number of multiplicand element and multiplier element is a difference 16, and the multiply operation of this step should be carried out twice.

3, with the reduction instruction result that each vector processing unit calculated in the step 2 is carried out the phase add operation, the reduction as a result of gained is to the processing unit X of appointment, and the operation of this same step also should be carried out twice.

4, two results of gained carry out the phase add operation in the X unit with above-mentioned, draw a matrix of consequence element and deposit vectorial storage unit in.

5, along moving on to the next line of multiplicand matrix A and the next column of multiplier matrix B, repeat step 2 in the said process to step 4 to calculate whole matrix of consequence C=A * B.

Embodiment 3:

Shown in figure 10, the vectorization implementation method that multiplies each other towards the large matrix of vector processor of the present invention, calculating scale be 26 * 22 matrix to multiply by scale be 22 * 27 matrix (the vector processing unit number K is 8), may further comprise the steps:

1, as shown in Figure 9; Arrive vectorial storage unit through DMA carrying multiplicand matrix A and multiplier matrix B; Realize reorder (method for reordering is identical with embodiment 1) of multiplier matrix B in this process; But also multiplicand matrix A and multiplier matrix B have been carried out mending 0 operation, multiplicand matrix A and multiplier matrix B are at location mode such as Fig. 8 (1) and Fig. 8 (2) of vector location.

2, the element with row of row element of multiplicand matrix A and multiplier matrix B is loaded in the vector processing unit; Here the row of the row of multiplicand matrix A and multiplier matrix B are to have passed through benefit 0, so will divide 3 times with multiplicand matrix A and the loading of multiplier matrix B.The element that is loaded into the correspondence of vector processing unit carries out multiply operation, because the number of vector processing unit is 8, and the number of mending 0 back multiplicand element and multiplier element all is 24, and the multiply operation of this step should be carried out 3 times.

3, with the reduction instruction result that each vector processing unit calculated in the step 2 is carried out the phase add operation, the reduction as a result of gained is to the processing unit X of appointment, and the operation of this same step also should be carried out 3 times.

4,3 results of gained carry out the phase add operation in the X unit with above-mentioned, draw a matrix of consequence element and deposit vectorial storage unit in.

5, along moving on to the next line of multiplicand matrix A and the next column of multiplier matrix B, the step 2 that repeats in the said process can calculate whole matrix of consequence C=A * B to step 4.

Can write out multiplying each other of advantages of simplicity and high efficiency code realization large matrix according to above step according to the structure and the instruction set of concrete vector processor.Method of the present invention is easily understood for the programmer, helps programmer's realization of encoding.

The above only is a preferred implementation of the present invention, and protection scope of the present invention also not only is confined to the foregoing description, and all technical schemes that belongs under the thinking of the present invention all belong to protection scope of the present invention.Should be pointed out that for those skilled in the art some improvement and retouching not breaking away under the principle of the invention prerequisite should be regarded as protection scope of the present invention.

Claims

1. vectorization implementation method that multiplies each other towards the large matrix of vector processor, tool is characterised in that and may further comprise the steps:

2. the vectorization implementation method that multiplies each other towards the large matrix of vector processor according to claim 1; It is characterized in that; In the said handling process, each row of multiplicand matrix A is organized into a Frame, and each row of multiplier matrix B are organized into a Frame; When the element number of said Frame is not equal to the multiple of the number K of parallel processing element in the vector processor, mends 0 at the Frame afterbody and make the element number of each Frame equal the multiple of the number K of parallel processing element.