CN109144469A

CN109144469A - Pipeline organization neural network matrix operation framework and method

Info

Publication number: CN109144469A
Application number: CN201810813920.3A
Authority: CN
Inventors: 王照钢; 毛劲松; 徐栋麟
Original assignee: Shanghai Liang Niu Semiconductor Technology Co Ltd
Current assignee: Shanghai Liang Niu Semiconductor Technology Co Ltd
Priority date: 2018-07-23
Filing date: 2018-07-23
Publication date: 2019-01-04
Anticipated expiration: 2038-07-23
Also published as: CN109144469B

Abstract

The invention proposes a kind of pipeline organization neural network matrix operation frameworks, it includes: accelerator, pass through digital circuit, for operating to input vector A and input matrix B execution pipeline formula are multiply-add to obtain the result of A*B=D, wherein, A is the column vector of a dimension 1*m, and the dimension of B is m*n, and D is that the vector matrix of 1 row n column exports result；The multiply-add operation of the pipeline system refers to, input matrix B is divided for multiple and different column blocks, the first row block of input vector A and input matrix B is multiplied and is added up and exports result, it is further continued for executing multiplying and adding up and export result for next column block in input vector A and input matrix B, iteration repeatedly, until in input vector A and input matrix B last column block also complete multiply and add up and result also export after to get to the multiplied result D of input vector A and input matrix B.

Description

Pipeline organization neural network matrix operation framework and method

Technical field

The present invention relates to digital circuit Integrated design technical fields, and in particular to a kind of pipeline organization neural network matrix Operation framework and method.

Background technique

Neural network mould is passed through such as the feature vector or two-dimensional image data of voice signal for one group of input data The calculating of type can be derived that morpheme information corresponding to the voice signal or two-dimensional image data or the corresponding markup information of image, from Data be input to using neural network model calculate last produce output result generally require to consume a large amount of computing resource or Storage resource.

And it is known that the quality of one piece of performance of integrated circuits mainly handles speed, the stability, material of data from it Material cost and occupied space size etc. are evaluated, and the processing mode of data is related to many aspects such as the speed of operation Performance, various optimizations are carried out by every means in Processing Algorithm with regard to chip designer currently on the market, so as to reach efficiently, Save the cost, the purpose of enhancing product performance, for example, existing neural network matrix operation framework usually has the disadvantages that

1, the dimension of matrix operation is fixed, cannot adaptively change operation scale；

2, it is usually central processing unit CPU via committed memory, is calculated such as RAM, be a kind of software operation, Speed depends on the operation frequency of CPU, a large amount of memory headrooms can be consumed when scale is big, computational efficiency is very low；

3, realize that matrix-vector multiplication is operated by dsp processor, such operation is often serial to be executed, and execution efficiency is low It taking a long time, input vector and the pre-existing ram space of weight matrix formula, the intermediate variable in calculating process is also required to export, Further increase storage and broadband expense.

Summary of the invention

The purpose of the present invention is to provide a kind of pipeline organization neural network matrix operation framework and methods, utilize number Circuit realization includes the accelerator of the counter for multiplying accumulating MAC unit and being equipped with of array arrangement, shift unit realization, Datacycle is inputted in conjunction with circulation theory, realizes that such as pipeline organization is overlapped according to primitive request, it is cumulative to playback, so that Matrix, which multiplies operation with vector, to be executed parallel, for the processing mode of CPU and DSP, greatly improve processing speed Degree, and intermediate result can be stored in local, not consume additional storage overhead；By the auxiliary of controller, realize that dynamic is matched Set dimension, the quantity of counter pulse, the displacement depth of shift unit of the matrix and vector that participate in multiply-add operation.

In order to achieve the above object, the invention is realized by the following technical scheme:

A kind of pipeline organization neural network matrix operation framework, characterized in that include:

Accelerator, by digital circuit, for operation multiply-add to input vector A and input matrix B execution pipeline formula with Obtain the result of A*B=D, wherein A is the column vector of a dimension 1*m, and the dimension of B is m*n, and D is the vector matrix of 1 row n column Export result；The multiply-add operation of the pipeline system refers to, input matrix B is divided for multiple and different column blocks, by input vector A with it is defeated The first row block for entering matrix B is multiplied and is added up and export result, is further continued under executing in input vector A and input matrix B One column block multiplies and adds up and export result, repeatedly iteration, until last column in input vector A and input matrix B Block also complete multiply and add up and result also export after to get arrive input vector A and input matrix B multiplied result D.

Above-mentioned pipeline organization neural network matrix operation framework, wherein the accelerator includes:

Fixed point multiplies accumulating module, for executing the multiply-add operation of continuous-flow type to input vector A and input matrix B；The fixed point multiplies accumulating Module includes several fixed point adder and multipliers run parallel, and two input terminals of each fixed-point multiply-accumulator sequentially input the 1 of vector A Each element in each element and input matrix B respective column block of row m column in respective column, synchronously to execute to defeated respectively Each respective column multiplies and adds up in the respective column block of incoming vector A and input matrix B, and in counter reset after the completion of calculating Multiply accumulating the output and zero of result under the control of the RC reset pulse enable end of each fixed point adder and multiplier of pulse pair, then holds Next respective column block multiplies and adds up in row input vector A and input matrix B；

Counter, for multiplying and tire out in the fixed point every respective column block for having executed an input vector A and input matrix B of adder and multiplier A reset pulse is exported after adding, the pulse is multiply-add to each fixed point by the first register chain generation assembly line reset signal The RC reset pulse enable end of device, and the every flowing water for completing an input vector A and input matrix B of module is multiplied accumulating in fixed point The clearing of itself pulse is controlled after the multiply-add operation of formula；

Shift unit, for controlling the displacement depth of input vector A columns；

First register chain, counter by first register chain to it is each fixed point adder and multiplier RC reset pulse enable end into Row pulse control；

1 row m column element of the second register chain, input vector A is input to each fixed point continually by second register chain Adder and multiplier；

Several third register chain, the correspondence column element in input matrix B respective column block are deposited continually by corresponding third Device chain is input to corresponding fixed point adder and multiplier.

Above-mentioned pipeline organization neural network matrix operation framework, wherein also include:

Controller connects accelerator, for the columns of dynamic configuration input vector A and line number m, the input matrix B of input matrix B Columns n and accelerator in counter pulse quantity, so that shift unit control is to the displacement depth of input vector A columns, With complete an input vector A and input matrix B respective column block multiply-add operation after counter RC reset pulse is made Can end control, and after completing the multiply-add operation of continuous-flow type of input vector A and input matrix B counter controls its The clearing of pulse.

Above-mentioned pipeline organization neural network matrix operation framework, in which:

The controller is realized by CPU.

The quantity for pinpointing adder and multiplier and third register chain is identical as the columns that each column block of input matrix B is included respectively.

A kind of method of the pipeline organization neural network matrix operation of digital circuit, characterized in that include:

By digital circuit operation multiply-add to input vector A and input matrix B execution pipeline formula to obtain the knot of A*B=D Fruit, wherein A is the column vector of a dimension 1*m, and the dimension of B is m*n, and D is that the vector matrix of 1 row n column exports result；

The multiply-add operation of the pipeline system refers to, input matrix B divide for multiple and different column blocks, by input vector A with input square The first row block of battle array B is multiplied and is added up and export result, is further continued for executing next column in input vector A and input matrix B Block multiplies and adds up and export result, repeatedly iteration, until last column block in input vector A and input matrix B Complete multiply and add up and result also export after to get arrive input vector A and input matrix B multiplied result D.

Compared with the prior art, the present invention has the following advantages: pass through the hardware-accelerated of the multiply-add equal operations of high speed matrix-vector Quick neural network acceleration capacity is realized, so that counting in real time after data input and model load by above-mentioned operation framework It calculates as a result, further speeding up image or speech recognition process with the speed and efficiency of the promotion neural network computing of big degree.

Detailed description of the invention

Fig. 1 is structural block diagram of the invention；

Fig. 2 is the structural block diagram of accelerator in the present invention；

Fig. 3 is the specific block diagram of accelerator in the embodiment of the present invention.

Specific embodiment

The present invention is further elaborated by the way that a preferable specific embodiment is described in detail below in conjunction with attached drawing.

As shown in Figure 1, the invention proposes a kind of pipeline organization neural network matrix operation framework, it includes:

As shown in Fig. 2, specifically, the accelerator includes:

Fixed point multiplies accumulating module, for executing the multiply-add operation of continuous-flow type to input vector A and input matrix B；The fixed point multiplies accumulating Module includes several fixed point adder and multipliers run parallel, and two input terminals of each fixed-point multiply-accumulator sequentially input the 1 of vector A Each element in each element and input matrix B respective column block of row m column in respective column, synchronously to execute to defeated respectively In the respective column block of incoming vector A and input matrix B each respective column multiply and it is cumulative (in figure, the i-th column in i representing matrix B, B The all elements of i-th column in [:] [i] representing matrix B, the columns that an iteration carries out multiply-add operation is x+1), and calculate It carries out multiplying accumulating knot under the control of the RC reset pulse enable end of each fixed point adder and multiplier of counter reset pulse pair after the completion The output and zero of fruit, then execute multiplying and adding up for next respective column block in input vector A and input matrix B；

Counter (can be cycle tiemr or timer), for fixed point adder and multiplier it is every has executed input vector A with The respective column block of input matrix B multiplies and exports after cumulative a reset pulse, which generates by the first register chain flows Waterline reset signal multiplies accumulating the primary input of the every completion of module in fixed point to the RC reset pulse enable end of each fixed point adder and multiplier The clearing of itself pulse is controlled after the multiply-add operation of continuous-flow type of vector A and input matrix B；

Shift unit, for controlling the displacement depth of input vector A columns；

The pipeline organization neural network matrix operation framework also includes:

Controller can be realized by CPU, connect accelerator, the columns and input square for dynamic configuration input vector A The quantity of counter pulse in the line number m of battle array B, the columns n of input matrix B and accelerator, so that shift unit is controlled to input The displacement depth of vector A columns, and after completing the multiply-add operation of respective column block of an input vector A and input matrix B Control of the counter to RC reset pulse enable end, and it is multiply-add in the continuous-flow type for completing an input vector A and input matrix B The clearing of its pulse of counter controls after operating.

In the present embodiment, pinpoint the quantity of adder and multiplier and third register chain respectively with each column block institute of input matrix B The columns for including is identical.

It, then can be with for example, increase the quantity of fixed point adder and multiplier specifically, bigger degree of parallelism can be expanded further It is primary to calculate more column, and then reduce the number of iteration.

Below in conjunction with a most preferred embodiment, to further illustrate the realization process of operation framework of the invention:

Be two matrixes in example it is respectively A:1*m and B as shown in figure 3, the structure can once calculate the multiply-add operation of 32 column: M*n, m, n can configure that counter (Timer) is a cycle count in this example to adapt to different scale of neural network Device or timer, m multiply-add operation of every completion will export a reset pulse and carry out accumulation result output and zero, figure In MAC be one 16 fixed point adder and multipliers, one shares 32, i.e. x=31 in the present embodiment, and an iteration carries out multiply-add The columns of operation is 32 column, and the every progress once-through operation of MAC is completed once to multiply and be operated with one-accumulate, and operational formula is c=c+a* B, wherein a is some value in input vector A, and b is some value in input matrix B, and c is accumulated result, and RC is Reset pulse is enabled.Entire matrix operation process is that his 1 row m column element is continuously input to the second deposit by input vector A Device chain (Register Chain), at the same time, input matrix B carry out every 32 column in his m row n column as a unit Input just completes D [1:32]=A [1:m] after input matrix B inputs all m row elements for completing this corresponding 32 column * the operation of B [1:m] [1:32], then carries out iteration next time again, and each iteration requires all of input input vector A Element, but the different lines block of B can be selected, for example first time iteration has selected the 1st to the 32nd of B to arrange, second of iteration selects B The the 33rd to the 64th column.Start to repeat the accumulation operations before just now again while exporting result after iteration, it is different The element of B is different when only adding up, and the rule that adds up is still and original the same；It is taken turns the 32nd time having carried out last A new matrix is obtained after cumulative, its arrangement is such a array of 1*n, and it is clear that this hour counter starts a reset Zero, clearing each time can mean that the operation for completing two matrixes.If also starting new matrix operation or and Last time equally starts to carry out same repetitive operation.It is some other that the advantage of this Matrix Multiplication is that he can effectively be reduced CPU and DSP participates in the energy consumption delay cost that same operation needs；First it effectively avoid such as the data of CPU at It needs to read data, decoding, analysis and execution first when reason, finally output is as a result, and the arithmetic unit of this Matrix Multiplication is carrying out It can directly be entered data into when the processing of data in register chain and carry out operation according to beat, not need to decode；Third Advantage is that such matrix can carry out Random Design, if wanting the matrix of operation bigger, can design register chain It is 32 or 64, but if can be to carry out 16 by circuit design to save the material cost in space and hardware circuit The matrix operation circuit of position fixed-point calculation, only needs to carry out more loop computation during operation；4th this Kind matrix operation hardware circuit improves the service efficiency of the adder in circuit, saves material cost.

The invention also provides a kind of method of the pipeline organization neural network matrix operation of digital circuit, packets Contain:

In conclusion the present invention is arranged using the integration of the fixed point adder and multiplier and counter of big array, circulation theory is utilized Data are subjected to circulation input, is overlapped just like pipeline organization according to primitive request, playbacks and add up, by a large-scale matrix Vector operations split into small matrix-vector dimension and are operated, and customized can successively do x+1 parallel vectors and multiply, so that Multiplying operation and can executing parallel for vector matrix, for traditional CPU and DSP processing mode, substantially increases processing Speed, and intermediate result can be stored in local, not consume additional storage overhead, such as vector A and any column of matrix B M times multiply-add result is retained in corresponding adder and multiplier, without carrying out data-moving；For controller, Ke Yishun It accesses to sequence and outputs and inputs data, successively moved into calculative data by shift register, for controller Only need in advance to read in the data of input vector A, while reading in every column data of input matrix B in batches, when all data all It reads in and completes, that is, complete a vector matrix and multiply operation.

It is discussed in detail although the contents of the present invention have passed through above preferred embodiment, but it should be appreciated that above-mentioned Description is not considered as limitation of the present invention.After those skilled in the art have read above content, for of the invention A variety of modifications and substitutions all will be apparent.Therefore, protection scope of the present invention should be limited to the appended claims.

Claims

1. a kind of pipeline organization neural network matrix operation framework, characterized by comprising:

2. pipeline organization neural network matrix operation framework as described in claim 1, which is characterized in that the accelerator Include:

Counter, for multiplying and tire out in the fixed point every respective column block for having executed an input vector A and input matrix B of adder and multiplier A reset pulse is exported after adding, the pulse is multiply-add to each fixed point by the first register chain generation assembly line reset signal The RC reset pulse enable end of device multiplies accumulating the every continuous-flow type for completing an input vector A and input matrix B of module in fixed point and multiplies The clearing of itself pulse is controlled after add operation；

Shift unit, for controlling the displacement depth of input vector A columns；

3. pipeline organization neural network matrix operation framework as claimed in claim 2, which is characterized in that also include:

4. pipeline organization neural network matrix operation framework as described in claim 1, it is characterised in that:

The controller is realized by CPU.

5. pipeline organization neural network matrix operation framework as claimed in claim 2, it is characterised in that:

6. a kind of method of the pipeline organization neural network matrix operation of digital circuit, characterized by comprising: