CN109144469B

CN109144469B - Pipeline structure neural network matrix operation architecture and method

Info

Publication number: CN109144469B
Application number: CN201810813920.3A
Authority: CN
Inventors: 王照钢; 毛劲松; 徐栋麟
Original assignee: Shanghai Lightning Semiconductor Technology Co ltd
Current assignee: Shanghai Lightning Semiconductor Technology Co ltd
Priority date: 2018-07-23
Filing date: 2018-07-23
Publication date: 2023-12-05
Anticipated expiration: 2038-07-23
Also published as: CN109144469A

Abstract

The invention provides a pipeline structure neural network matrix operation architecture, which comprises: the accelerator is realized by a digital circuit and is used for carrying out pipelined multiplication and addition operation on an input vector A and an input matrix B to obtain a result of A, B and D, wherein A is a column vector of one dimension 1*m, B is m, n and D is a vector matrix output result of 1 row and n columns; the pipelined multiplication and addition operation means that the input matrix B is divided into a plurality of different column blocks, multiplication and addition are carried out on the input vector A and the first column block of the input matrix B, the result is output, multiplication and addition of the input vector A and the next column block of the input matrix B are continuously carried out, the result is output, iteration is repeated until the multiplication and addition of the last column block of the input vector A and the last column block of the input matrix B are completed, and the result is output, so that the multiplication result D of the input vector A and the input matrix B is obtained.

Description

Pipeline structure neural network matrix operation architecture and method

Technical Field

The invention relates to the technical field of digital circuit integrated design, in particular to a pipeline structure neural network matrix operation architecture and a method.

Background

For a group of input data, such as feature vectors of a voice signal or two-dimensional image data, labeling information corresponding to morpheme information or images corresponding to the voice signal or the two-dimensional image data can be obtained through calculation of a neural network model, and a great deal of calculation resources or storage resources are often consumed from data input to calculation by using the neural network model to finally generate an output result.

However, it is known that the performance of an integrated circuit is mainly evaluated in terms of the speed of processing data, performance stability, material cost, and occupied space, and the data processing manner relates to the performance of the computing speed, so that chip designers in the market at present want to perform various optimizations on the processing algorithm, so as to achieve the purposes of high efficiency, cost saving, product performance improvement, etc., for example, the existing neural network matrix computing architecture generally has the following disadvantages:

1. the dimensions of the matrix operation are fixed and the operation scale cannot be adaptively changed;

2. usually, the CPU performs computation by occupying memory, such as RAM, which is a software operation, the speed of which depends on the operation frequency of the CPU, and when the scale is large, a large amount of memory space is consumed, so that the computation efficiency is very low;

3. matrix vector multiplication is realized through a DSP processor, the operation is often serial, the execution efficiency is low, the time consumption is long, the input vector and the weight matrix type pre-exist RAM space, the intermediate variable in the calculation process also needs to be output, and the storage and the broadband overhead are further increased.

Disclosure of Invention

The invention aims to provide a pipeline structure neural network matrix operation architecture and method, which utilizes a digital circuit to realize a multiplication accumulation MAC unit comprising array arrangement, a counter matched with the multiplication accumulation MAC unit and an accelerator realized by a shifter, combines a circulation principle to circularly input data, and realizes superposition and homing accumulation according to original requirements like a pipeline structure, so that matrix and vector multiplication operations can be executed in parallel, the processing speed is greatly improved compared with the processing modes of a CPU and a DSP, and intermediate results can be stored locally without consuming extra storage expenditure; the dimension of the matrix and the vector which participate in the multiplication and addition operation, the number of counter pulses and the shift depth of the shifter are dynamically configured by the aid of the controller.

In order to achieve the above purpose, the present invention is realized by the following technical scheme:

a pipeline structured neural network matrix operation architecture, comprising:

the accelerator is realized by a digital circuit and is used for carrying out pipelined multiplication and addition operation on an input vector A and an input matrix B to obtain a result of A, B and D, wherein A is a column vector of one dimension 1*m, B is m, n and D is a vector matrix output result of 1 row and n columns; the pipelined multiplication and addition operation means that the input matrix B is divided into a plurality of different column blocks, multiplication and addition are carried out on the input vector A and the first column block of the input matrix B, the result is output, multiplication and addition of the input vector A and the next column block of the input matrix B are continuously carried out, the result is output, iteration is repeated until the multiplication and addition of the last column block of the input vector A and the last column block of the input matrix B are completed, and the result is output, so that the multiplication result D of the input vector A and the input matrix B is obtained.

The pipeline structure neural network matrix operation architecture, wherein the accelerator comprises:

the fixed-point multiply-accumulate module is used for executing running water multiply-accumulate operation on the input vector A and the input matrix B; the fixed-point multiply-accumulate module comprises a plurality of fixed-point multiply-add devices running in parallel, wherein the two input ends of each fixed-point multiply-add device are sequentially input with 1 row and m columns of elements of a vector A and each element in a corresponding column block of an input matrix B so as to synchronously execute multiply and accumulate on each corresponding column in the corresponding column block of the input vector A and the input matrix B respectively, output and return zero of multiply and accumulate results are carried out under the control of a reset pulse of a counter to an RC reset pulse enabling end of each fixed-point multiply-add device after calculation is completed, and then multiply and accumulate on the next corresponding column block in the input matrix B are executed;

the counter is used for outputting a reset pulse after the fixed-point multiply-add device finishes the multiply-add operation of the corresponding column blocks of the input vector A and the input matrix B every time, generating a pipeline reset signal to the RC reset pulse enabling end of each fixed-point multiply-add device through the first register chain, and controlling the zero clearing of self pulses after the fixed-point multiply-add module finishes the continuous-flow multiply-add operation of the input vector A and the input matrix B every time;

the shifter is used for controlling the shifting depth of the column number A of the input vector;

the counter carries out pulse control on the RC reset pulse enabling end of each fixed-point multiply-add device through the first register chain;

a second register chain through which 1 row m column elements of the input vector a are sequentially input to each fixed-point multiply-add device;

and a plurality of third register chains, wherein corresponding column elements in corresponding column blocks of the input matrix B are continuously input to corresponding fixed-point multiply-add devices through the corresponding third register chains.

The pipeline structure neural network matrix operation architecture further comprises:

the controller is connected with the accelerator and is used for dynamically configuring the column number of the input vector A and the row number m of the input matrix B, the column number n of the input matrix B and the number of counter pulses in the accelerator so that the shifter controls the shift depth of the column number of the input vector A, the counter controls the RC reset pulse enabling end after the multiplication and addition operation of corresponding column blocks of the input vector A and the input matrix B is finished once, and the counter controls the zero clearing of the pulses of the input vector A and the input matrix B after the running water multiplication and addition operation of the input vector A and the input matrix B is finished once.

The pipeline structure neural network matrix operation architecture comprises the following components:

the controller is realized by a CPU.

the number of the fixed-point multiply-add devices and the third register chains is the same as the number of columns contained in each column block of the input matrix B.

A method for pipeline structure neural network matrix operation implemented by a digital circuit, comprising:

performing pipelined multiply-add operation on an input vector a and an input matrix B through a digital circuit to obtain a result of a×b=d, wherein a is a column vector of one dimension 1*m, B is m×n, and D is a vector matrix output result of 1 row and n columns;

the pipelined multiplication and addition operation means that the input matrix B is divided into a plurality of different column blocks, multiplication and addition are carried out on the input vector A and the first column block of the input matrix B, the result is output, multiplication and addition of the input vector A and the next column block of the input matrix B are continuously carried out, the result is output, iteration is repeated until the multiplication and addition of the last column block of the input vector A and the last column block of the input matrix B are completed, and the result is output, so that the multiplication result D of the input vector A and the input matrix B is obtained.

Compared with the prior art, the invention has the following advantages: the rapid neural network acceleration capability is realized through hardware acceleration of operations such as high-speed matrix vector multiplication and addition, so that the result is calculated in real time through the operation framework after data input and model loading, the operation speed and efficiency of the neural network are improved to a great extent, and the image or voice recognition process is further accelerated.

Drawings

FIG. 1 is a block diagram of the structure of the present invention;

FIG. 2 is a block diagram of an accelerator according to the present invention;

fig. 3 is a block diagram showing a specific structure of the accelerator according to the embodiment of the present invention.

Detailed Description

The invention will be further described by the following detailed description of a preferred embodiment, taken in conjunction with the accompanying drawings.

As shown in fig. 1, the present invention proposes a pipeline structure neural network matrix operation architecture, which includes:

As shown in fig. 2, specifically, the accelerator includes:

the fixed-point multiply-accumulate module is used for executing running water multiply-accumulate operation on the input vector A and the input matrix B; the fixed-point multiply-accumulate module comprises a plurality of fixed-point multiply-accumulate devices running in parallel, wherein the two input ends of each fixed-point multiply-accumulate device sequentially input each element of 1 row and m columns of a vector A and each element in a corresponding column of an input matrix B corresponding column block so as to synchronously execute multiply and accumulate on each corresponding column of the input vector A and the corresponding column of the input matrix B respectively (in the figure, i represents the ith column in the matrix B, B [ i ] represents all elements of the ith column in the matrix B, the column number of multiply-accumulate operation performed by one iteration is x+1), and after calculation is completed, output and zero return of a multiply and accumulate result are performed under the control of a counter reset pulse to an RC reset pulse enabling end of each fixed-point multiply-accumulate device, and then multiply and accumulate on the next corresponding column block in the input matrix B are executed;

a counter (which may be a cyclic timer or a timer) for outputting a reset pulse after the fixed-point multiply-add device performs multiply-add operation on the corresponding column blocks of the input vector a and the input matrix B each time, the reset pulse generating a pipeline reset signal to the RC reset pulse enabling end of each fixed-point multiply-add device through the first register chain, and controlling zero clearing of the pulse after the fixed-point multiply-add module performs a running-type multiply-add operation on the input vector a and the input matrix B each time;

the controller can be realized by a CPU, and is connected with the accelerator and is used for dynamically configuring the column number of the input vector A and the row number m of the input matrix B, the column number n of the input matrix B and the number of counter pulses in the accelerator, so that the shifter controls the shift depth of the column number of the input vector A, the control of the counter on the RC reset pulse enabling end after the multiplication and addition operation of the corresponding column blocks of the input vector A and the input matrix B is finished once, and the counter controls the zero clearing of the pulses after the running water multiplication and addition operation of the input vector A and the input matrix B is finished once.

In this embodiment, the number of fixed-point multiply-add devices and the number of third register chains are respectively the same as the number of columns included in each column block of the input matrix B.

Specifically, the parallelism can be further expanded, for example, the number of fixed-point multiply-add devices is increased, more columns can be calculated at a time, and the iteration times are further reduced.

The implementation of the computing architecture of the present invention is further described in conjunction with a preferred embodiment:

as shown in fig. 3, the structure can calculate 32 columns of multiply-add operations at a time, and two matrices are a in the example: 1*m and B: m is n, m and n can be configured to adapt to different neural network scales, in the example, the counter (Timer) is a loop counter or Timer, a reset pulse is output to perform accumulation result output and return to zero every time m times of multiply-add operation is completed, the MAC in the figure is a 16-bit fixed-point multiply-add device, which is total 32, i.e. in this embodiment, x=31, the number of columns of multiply-add operation performed by one iteration is 32, each time the MAC performs multiply-add operation, the operation formula is c=c+a, where a is a certain value in the input vector a, B is a certain value in the input matrix B, c is the accumulation result, and RC is the reset pulse enable. The whole matrix operation process is that the input vector a inputs 1 row and m columns of elements continuously to a second Register Chain (Register Chain), meanwhile, the input matrix B inputs every 32 columns of m rows and n columns as a unit, after the input matrix B inputs all m rows of elements corresponding to the 32 columns, the operation of D [1:32] =a [1:m ] =b [1:m ] [1:32] is completed, and then the next iteration is performed, wherein each iteration needs to input all elements of the input vector a, but different column blocks of B are selected, for example, the first iteration selects the 1 st to 32 th columns of B, and the second iteration selects the 33 rd to 64 th columns of B. After iteration is finished, the result is output and the previous accumulation operation is repeated, and the element B is different only when accumulation is carried out, and the accumulation rule is the same as the original rule; after the last round of 32 nd accumulation is completed, a new matrix is obtained, the arrangement of the new matrix is an array like 1*n, at this time, the counter starts reset clearing once, and each clearing means that the operation of two matrices is completed. If a new matrix operation is started, the same repeated operation is started as the last time. The advantage of the matrix multiplication is that the matrix multiplication can effectively reduce the energy consumption delay cost required by other CPUs and DSPs to participate in the same operation; firstly, the method effectively avoids the need of firstly reading data, decoding, analyzing and executing and finally outputting results when the data is processed like a CPU, and the arithmetic unit of the matrix multiplication can directly input the data into a register chain and operate according to beats when the data is processed, so that decoding is not needed; a third advantage is that this type of matrix can be designed randomly, if the matrix to be operated is relatively large, the register chain can be designed to be 32-bit or 64-bit, but if the circuit can be designed as a matrix operation circuit for 16-bit fixed point operation in order to save space and material cost of hardware circuits, only more cyclic operations are required in the operation process; the fourth matrix operation hardware circuit improves the service efficiency of adders in the circuit and saves the material cost.

The invention also provides a method for realizing pipeline structure neural network matrix operation by a digital circuit, which comprises the following steps:

In summary, the invention utilizes the integration arrangement of the fixed-point multiply-add device and the counter of the large array, and utilizes the circulation principle to carry out circulation input on data, and if a pipeline structure carries out superposition, homing and accumulation according to the original requirement, one large matrix vector operation is split into small matrix vector dimensions to carry out operation, x+1 parallel vector multiplication can be sequentially customized, so that the multiplication operation of a vector matrix can be carried out in parallel, compared with the traditional CPU and DSP processing modes, the processing speed is greatly improved, the intermediate result can be stored locally, and the extra storage cost is not consumed, for example, the m times of multiply-add result of any column of the vector A and the matrix B can be reserved in the corresponding multiply-add device without carrying out data movement; for the controller, the input data and the output data can be accessed sequentially, the data to be calculated are shifted in sequence through the shift register, for the controller, only the data of the input vector A are required to be read in advance, each column of data of the input matrix B is read in batches, and when all the data are read in, the vector matrix multiplication operation is completed.

While the present invention has been described in detail through the foregoing description of the preferred embodiment, it should be understood that the foregoing description is not to be considered as limiting the invention. Many modifications and substitutions of the present invention will become apparent to those of ordinary skill in the art upon reading the foregoing. Accordingly, the scope of the invention should be limited only by the attached claims.

Claims

1. A pipeline architecture neural network matrix operation architecture, comprising:

the accelerator is realized by a digital circuit and is used for carrying out pipelined multiplication and addition operation on an input vector A and an input matrix B to obtain a result of A, B and D, wherein A is a column vector of one dimension 1*m, B is m, n and D is a vector matrix output result of 1 row and n columns; the pipelined multiplication and addition operation finger divides an input matrix B into a plurality of different column blocks, multiplies and adds an input vector A and a first column block of the input matrix B and outputs a result, then continues to execute the multiplication and addition of the input vector A and a next column block of the input matrix B and outputs the result, and the iteration is repeated until the multiplication and addition of the last column block of the input vector A and the input matrix B are completed and the result is output, and then a multiplication result D of the input vector A and the input matrix B is obtained;

wherein, the accelerator includes:

the fixed-point multiply-accumulate module is used for executing running water multiply-accumulate operation on the input vector A and the input matrix B; the fixed-point multiply-accumulate module comprises x+1 fixed-point multiply-add devices running in parallel, wherein the two input ends of each fixed-point multiply-add device are sequentially input with each element of 1 row and m columns of a vector A and each element in a corresponding column of a corresponding column block of an input matrix B so as to synchronously execute multiply and accumulate on each corresponding column of the corresponding column block of the input vector A and the corresponding column block of the input matrix B respectively, output and return zero of multiply-accumulate results are carried out under the control of a reset pulse of a counter to an RC reset pulse enabling end of each fixed-point multiply-add device after calculation is completed, and then multiply and accumulate on the next corresponding column block of the input vector A and the input matrix B are executed;

the counter is used for outputting a reset pulse after the fixed-point multiply-add device finishes the multiply-add operation of the corresponding column blocks of the input vector A and the input matrix B every time, the pulse generates a pipeline reset signal to the RC reset pulse enabling end of each fixed-point multiply-add device through the first register chain, and the fixed-point multiply-add module controls the zero clearing of the pulse after finishing the running-type multiply-add operation of the input vector A and the input matrix B every time;

x+1 third register chains, corresponding column elements in corresponding column blocks of the input matrix B are continuously input to corresponding fixed-point multiply-add devices through the corresponding third register chains;

the number of the fixed-point multiply-add device and the third register chain is the same as the number of columns contained in each column block of the input matrix B; the number of columns of multiply-add operation performed in one iteration is x+1 columns, all elements of the input vector a need to be input, different column blocks of the input matrix B are selected, and each x+1 column in m rows and n columns of the input matrix B is input as a unit.

2. The pipeline architecture neural network matrix operation architecture of claim 1, further comprising:

3. The pipeline architecture neural network matrix operation architecture of claim 2, wherein:

the controller is realized by a CPU.

4. A method for pipeline structure neural network matrix operations implemented by digital circuits, comprising:

performing operations using the pipeline architecture neural network matrix operation architecture of claim 1; performing pipelined multiply-add operation on an input vector a and an input matrix B through a digital circuit to obtain a result of a×b=d, wherein a is a column vector of one dimension 1*m, B is m×n, and D is a vector matrix output result of 1 row and n columns;