CN105589677A

CN105589677A - Systolic structure matrix multiplier based on FPGA (Field Programmable Gate Array) and implementation method thereof

Info

Publication number: CN105589677A
Application number: CN201410653363.5A
Authority: CN
Inventors: 陶耀东; 周磊涛; 李锁; 齐济
Original assignee: Shenyang Gaojing Numerical Control Intelligent Technology Co Ltd
Current assignee: Shenyang Gaojing Numerical Control Intelligent Technology Co Ltd
Priority date: 2014-11-17
Filing date: 2014-11-17
Publication date: 2016-05-18

Abstract

The invention provides a systolic structure matrix multiplier based on an FPGA (Field Programmable Gate Array) and an implementation method thereof. The systolic structure matrix multiplier comprises a multiplier array composed of M x R node units, a clock control module, a data input control module and a data output control module; the M x R node units are interconnected by a two-dimensional mesh array structure of M rows and R columns, used for performing a multiply-accumulate calculation on data, and have systolic structures in each data direction of rows and columns; the clock control module is used for providing a clock, controlling the whole calculation process, and recoding the state of a current multiplier; the data input control module is used for controlling an input matrix to input by rows or by columns, and making the input matrix to satisfy an input rule of time alignment; and the data output control module is used for outputting the calculation results by rows or by columns under the control of an output clock. The implementation method of the matrix multiplier provided by the invention can enable data to enter the matrix multiplier according to the input rule of the systolic structure by rows or by columns, and perform a multiply-add operation and output the calculation results. The systolic structure matrix multiplier based on the FPGA provided by the invention has the advantages of being very high in calculation performance, good in modularity and convenient in reconfiguration.

Description

A kind of systolic structures matrix multiplier and its implementation based on FPGA

Technical field

The present invention relates to FPGA technology and High Performance Computing field, is a kind of based on FPGA specificallySystolic structures matrix multiplier and its implementation.

Background technology

Matrix multiplication operation is as a basic operation of the aspects such as science calculating, Digital Signal Processing and obtainExtensive use, and its calculated performance directly has influence on the overall performance of system, along with calculated data amount day by dayImproving constantly of huge and precision prescribed level, matrix multiplication becomes the bottleneck of system-computed performance gradually.

It is general processor that matrix multiplication operation in the past adopts hardware conventionally, the side of software realization matrix multiplicationMethod or employing hardware are that special digital signal processor (DigitalSignalProcess, DSP) is realized.Such processing method technology is comparatively ripe, and implementation tool is perfect, the simple feature of programming, but by the innerThe restriction of portion's structure and platform, is not suitable for the occasion higher to real time and reliability requirement, and being also not suitable for shouldBe used in small-sized embedded system. Adopt dominant frequency etc. that general processor or DSP technology be subject to device conventionally because ofElement impact, cannot reach very high calculated performance.

Improving constantly of FPGA technology and technological level thereof in recent years, FPGA can be applied in complex denseThe occasion of the calculating of type, and the distinctive programmability of FPGA can be tackled various different demands flexiblyOccasion, realizes various Different Logic functions, and system is with good expansibility. The particulate that FPGA is intrinsicDegree parallel ability, high-performance low-power-consumption ratio, become embedded system performance accelerate best selection itOne.

Systolic structures is as shown in Figure 1 a kind of linear structure that just possesses pipeline function itself, not TongfangThe linear structure parallel work-flow not waiting to quantity, can obtain quite high system data throughput and clock frequentlyRate, and systolic structures has feature simple, regular, that modularity is good, only has a small amount of node with outsideHave IO operation, this can make system keep good processing speed, simultaneously also can and exterior I O bandwidth betweenBalance is applicable to FPGA and realizes very much.

Use at present FPGA to realize parallel matrix multiplication aspect, most of by data of a line and rowOutput to N node, occur that input module fan-out is excessive, wiring delay increases, and does not possess pipeline organization,Easily there is wrong problem, cause global matrix multiplier to be difficult to reach superior performance and unreliable, be difficult toMeet the requirement of calculating real-time. In addition also occur data input control complexity, the result after calculating is readGet difficulty, and entirety do not have modularity, the shortcoming that needs major part to revise after transplanting.

Summary of the invention

For above shortcomings part in prior art, the technical problem to be solved in the present invention is to provide oneIt is too high that kind can solve existing matrix multiplier input module fan-out, is difficult to reach higher overall performance, dataInput control complexity, the result after calculating reads difficulty, and entirety do not have modularity, after transplanting, needsA kind of systolic structures matrix multiplier and its implementation based on FPGA of the problems such as most of amendment.

The technical scheme that the present invention adopted is for achieving the above object: a kind of systolic structures based on FPGAMatrix multiplier and its implementation, comprising: multiplier array, clock control module, data input controlModule, data output control module;

Described multiplier array is the two-dimensional mesh array structure of the capable R row of M, formed by M × R node unit,And interconnected between adjacent node unit, for realizing the product calculation of input matrix A and matrix B;

Described clock control module, for exporting two-way clock, offer respectively data input control module andData output control module;

Described data input control module, for parallel input matrix A and square under the control of clock control moduleBattle array B, wherein matrix A is the capable N row of M, matrix B is the capable R row of N;

Described data output control module, for output matrix C, wherein Matrix C is the capable R row of M.

Described node unit is by a multiplier, and adder and one are for storing depositing of result of calculationDevice composition; Described multiplier receives the input data of this node unit, and result of product is sent to adder;Described register for the data accumulation of storing a clock cycle with; Described adder is for by multiplierIn result of product and register a upper clock cycle data accumulation and do add operation, and by add operationResult send to register upgrade storage.

The multiplier of described node unit inside adopts the special multiplication stone of FPGA inside.

Described data input control module is pressed row input matrix A, is inputted by row under the control of clock control moduleMatrix B, until matrix A, all data of matrix B pass through multiplier array whole node units full line andPermutation.

Described data output control module is exported by row or column under the control of clock control module.

An implementation method for systolic structures matrix multiplier based on FPGA, comprises the following steps:

Data input control module is by the matrix B of the matrix A of capable M N row and the capable R row of N, in clock control moduleControl under parallel being input in multiplier array;

In input data each node unit in multiplier array, carry out multiplying cumulative, at clockUnder the control of control module, complete all node units and calculate, until there is no data input;

Data output control module is exported the Matrix C of the capable R row of M as matrix under the control of clock control moduleMultiplication result.

Described data input control module is by the matrix B of the matrix A of capable M N row and the capable R row of N, in clock controlParallel being input in multiplier array under the control of module, is used following rule:

Matrix A is the capable N row of M, A_i,kThe element of matrix A, i=1,2 ... M, k=1,2 ... N; SquareBattle array B is the capable R row of N, B_k,jThe element of matrix B, k=1,2 ... N; J=1,2 ... R.

The row vector of matrix A enters into by line number order from small to large the row that multiplier array is corresponding successively, andAnd adjacent lines vector enters multiplier array and differs in time 1 clock cycle, i.e. the capable k row of the i of matrix AData A_i,kData A with the capable k-1 row of the i-1 of matrix A_i-1,k-1Enter multiplier array simultaneously;

The column vector of matrix B is pressed row order number from small to large and is entered into successively the row that multiplier array is corresponding, andAnd adjacent columns vector enters multiplier array and differs in time 1 clock cycle, i.e. the capable j row of the k of matrix BData B_k,jData B with the capable j-1 row of the k-1 of matrix B_k-1,j-1Enter multiplier array simultaneously;

And matrix A enters into multiplier array in the time by being advanced into multiplier array and matrix B by rowUpper parallel, i.e. A_i,kAnd B_k,jTo enter multiplier array under the same clock cycle, until matrix A and matrixThe all elements of B all passes through full line and the permutation of the multiplier array of M × R node unit composition, makes dataThe input control that arrives each node unit meets the input rule of time alignment.

In described input data each node unit in multiplier array, carry out multiplying cumulative,In individual node unit, carry out following steps:

1) data of a upper node of the row and column of reception pulsation direction;

2) calculate the product of two data, and add up with the result of originally depositing;

3) preserve the value after adding up, the row input data of accepting are outputed to next row node, the row of acceptanceInput data output to next row node.

Compared with prior art, the invention has the beneficial effects as follows:

1. the present invention adopts the matrix array structure of systolic structures, has streamline feature completely, to inputThe fan-out of side requires very low, can reach very high calculated performance, after tested, uses AlteraDE2 development boardRealization can reach the maximum clock frequency of 200M nearly, and in the time that matrix size becomes large, still can keepHigher clock frequency.

2. the present invention adopts the method for the whole matrix multiplier of doubleclocking control, makes to input computational process and exportsJourney is used independently clock, and ease for operation is good, can avoid maloperation, and can be suitable for calculating side and the side of readingThe occasion that clock is different.

3. the present invention has good modularity, is particularly suitable as a module application in embedded system,For different calculating scales, because the node in array is all identical, as long as array is rearrangedCan, conveniently reshuffle.

4. the present invention has good low power capabilities, has used doubleclocking control computational process, is not calculatingIn the situation of work, close clock, greatly reduce the overall power consumption of matrix multiplier, be adapted at embeddedIn system, use.

Brief description of the drawings

Fig. 1 is systolic arrays schematic diagram in the inventive method;

Fig. 2 is the internal structure schematic diagram of matrix multiplier in the inventive method;

Fig. 3 is matrix multiplier clock control module schematic diagram in the inventive method;

Fig. 4 is node unit cut-away view in the inventive method;

Fig. 5 is matrix multiplier input instance graph in the inventive method;

Fig. 6 is that shape is calculated in the data input that in the inventive method, matrix multiplier calculates under clock in difference inputState figure.

Detailed description of the invention

Below in conjunction with drawings and Examples, the present invention is described in further detail.

As shown in Figure 2, a kind of systolic structures matrix multiplier based on FPGA comprises: multiplier array, clockControl module, data input control module, data output control module; Described multiplier array is the capable R row of MTwo-dimensional mesh array structure, formed by M × R node unit, and interconnected between adjacent node unit, useIn the product calculation of realizing input matrix A and matrix B; Described clock control module, for exporting two-way clock,Offer respectively data input control module and data output control module; Described data input control module,For parallel input matrix A and matrix B under the control of clock control module, wherein matrix A is the capable N row of M, squareBattle array B is the capable R row of N; Described data output control module, for output matrix C, wherein Matrix C is the capable R row of M.Described data input control module presses row input matrix A under the control of clock control module, by row input matrixB, until full line and permutation that matrix A, all data of matrix B are passed through whole node units of multiplier array.Described data output control module is exported by row or column under the control of clock control module. Wherein C matrixFor the capable R row of M, and C=A × B.

As shown in Figure 3, be matrix multiplier clock control module schematic diagram in the inventive method, matrix multiplierUse input to calculate clock CLK1 and two clocks of output clock CLK2, input is calculated clock from calculating side,In figure, represent with black line; Output clock, from the side of reading, dots in figure. It is main that clock is calculated in inputAs the CLK1 clock of input control module and node; Output clock is mainly used as the clock of data outputting moduleCLK2 clock with node. The whole matrix computations flow process of clock control module control, calculating side is used Req1 to askAsk calculating, module replies to calculating side according to oneself state with Ack1; The side of reading, is used Req2 request to readAs a result, module replies to Ack2 the side of reading according to oneself state. When input and computing mode, input is calculatedClock CLK1 enables, and output clock CLK2 closes; When in output state, clock CLK1 is calculated in inputClose, output clock CLK2 enables.

As shown in Figure 4, node unit is the elementary cell that forms multiplier array, by a multiplier, oneAdder and one form for the register of storing result of calculation. Described multiplier receives this node unitInput data, send to adder by result of product; Described register is for storing the number of a clock cycleAccording to cumulative sum; Described adder for by the result of product of multiplier and a upper clock cycle of registerData accumulation and do add operation, and send to register to upgrade storage the result of add operation. In figureRow_in and Col_in are the input data of node, and CLK1 is synchronised clock, and rstn is asynchronous reset signal, everyThe inferior Matrix Multiplication that recalculates needs reset to remove all data in node before. Under the control of synchronised clock,Row_in and Col_in are two of multiplier input data, and the result that data multiply each other is defeated as of accumulatorEnter, REG is the memory cell in node, for remembering last cumulative result, and as accumulatorAnother inputs data, and the data in result and REG that accumulator multiplies each other data are added, and by result againWrite REG and complete multiply accumulating process one time. When the node in array is again without input when data, the REG in nodeIn data be exactly to calculate last result. In the time that CLK2 enables, by the data output in REG, and acceptThe data of upper node input, leave in REG.

An implementation method for systolic structures matrix multiplier based on FPGA, is characterized in that, comprises followingStep:

1) data of a upper node of the row and column of reception pulsation direction;

And the computing formula of output matrix C is: C_i,j＝∑A_i,k×B_k,j, wherein i=1,2 ... M,k＝1，2……N；j＝1,2，……R。

The action of all nodes completes under the control of CLK2, and node outputs to same by the result in accumulatorThe next node of individual direction. Accept the data of a node, and be kept in accumulator. So repeatedly straightExport to all result datas. And the input of data input control module is computing time(M+N+R-1) the individual CLK1 clock cycle, the result of data output control module is M by the time of line outputThe individual CLK2 clock cycle, result is R CLK2 clock cycle by the time of row output.

As shown in Figure 5, the implementation procedure of the embodiment of 3 × 3 node units of the present invention is as follows, the matrix A of inputBe 3 × 3 array structure with matrix B, need to obtain Matrix C, C=A × B:

(1) matrix node scale is configured to 3 row 3 column array structures, and it is interconnected to press two-dimensional mesh.

(2) configure corresponding input control module according to matrix size, data outputting module, makes it meet 3 rowParallel input and output.

(3) configure corresponding clock control module according to matrix size.

(4) the whole matrix multiplier that resets, waits side to be calculated to send computation requests Req1, judges current state,Reply can be calculated Ack1, enables CLK1.

(5) data input control module is controlled the raw column data of input by input rule.

(6) through (N+N+N-1)=8 CLK1 clock cycle, calculating completes, and CLK1 closes.

Etc. (7) side to be read sends and reads Req2, and clock control module is recovered can read according to current stateAck2。

(8) enable CLK2, matrix is successively by row or column output data.

(9) through (N)=3 CLK2 clock cycle, data have been exported by row or column.

Be illustrated in figure 6 matrix multiplier in the inventive method and calculate the data input under clock in difference inputComputing mode figure,

Visible for input matrix

A = [\begin{matrix} a_{1,1} & a_{1,2} & a_{1,3} \\ a_{2,1} & a_{2,2} & a_{2,3} \\ a_{3,1} & a_{3,2} & a_{3,3} \end{matrix}] = [\begin{matrix} 1 & 2 & 3 \\ 1 & 2 & 3 \\ 1 & 2 & 3 \end{matrix}],

Matrix

B = [\begin{matrix} b_{1,1} & b_{1,2} & b_{1,3} \\ b_{2,1} & b_{2,2} & b_{2,3} \\ b_{3,1} & b_{3,2} & b_{3,3} \end{matrix}] = [\begin{matrix} 1 & 2 & 3 \\ 1 & 2 & 3 \\ 1 & 2 & 3 \end{matrix}],

Output matrix

C = [\begin{matrix} c_{1,1} & c_{1,2} & c_{1,3} \\ c_{2,1} & c_{2,2} & c_{2,3} \\ c_{3,1} & c_{3,2} & c_{3,3} \end{matrix}] = [\begin{matrix} 20 & 20 & 20 \\ 20 & 20 & 20 \\ 20 & 20 & 20 \end{matrix}],

?

[\begin{matrix} 1 & 2 & 3 \\ 1 & 2 & 3 \\ 1 & 2 & 3 \end{matrix}] \times [\begin{matrix} 2 & 2 & 2 \\ 3 & 3 & 3 \\ 4 & 4 & 4 \end{matrix}] = [\begin{matrix} 20 & 20 & 20 \\ 20 & 20 & 20 \\ 20 & 20 & 20 \end{matrix}]

Set up, meet A × B=C.

According to above enforcement, just can well apply this invention.

Claims

1. the systolic structures matrix multiplier based on FPGA, its feature with in, comprising: multiplier array,Clock control module, data input control module, data output control module;

2. a kind of systolic structures matrix multiplier based on FPGA according to claim 1, is characterized in that,Described node unit is by a multiplier, and adder and one are for storing the register group of result of calculationBecome; Described multiplier receives the input data of this node unit, and result of product is sent to adder; DescribedRegister for the data accumulation of storing a clock cycle with; Described adder is used for the product of multiplierIn result and register a upper clock cycle data accumulation and do add operation, and by the knot of add operationFruit sends to register to upgrade storage.

3. a kind of systolic structures matrix multiplier based on FPGA according to claim 2, is characterized in that,The multiplier of described node unit inside adopts the special multiplication stone of FPGA inside.

4. a kind of systolic structures matrix multiplier based on FPGA according to claim 1, is characterized in that,Described data input control module presses row input matrix A under the control of clock control module, by row input matrixB, until full line and permutation that matrix A, all data of matrix B are passed through whole node units of multiplier array.

5. a kind of systolic structures matrix multiplier based on FPGA according to claim 1, is characterized in that,Described data output control module is exported by row or column under the control of clock control module.

6. an implementation method for the systolic structures matrix multiplier based on FPGA, is characterized in that, comprises followingStep:

7. the implementation method of a kind of systolic structures matrix multiplier based on FPGA according to claim 6, itsBe characterised in that, described data input control module is the matrix B of the matrix A of capable M N row and the capable R row of N, timeParallel being input in multiplier array under the control of clock control module, is used following rule:

8. the implementation method of a kind of systolic structures matrix multiplier based on FPGA according to claim 6, itsBe characterised in that, in described input data each node unit in multiplier array, carry out multiplying tiredAdd, in individual node unit, carry out following steps:

1) data of a upper node of the row and column of reception pulsation direction;