CN116992932A

CN116992932A - Parameterized LSTM acceleration system for data off-chip block transmission and design method thereof

Info

Publication number: CN116992932A
Application number: CN202210436245.3A
Authority: CN
Inventors: 姚睿; 赵杰; 余永传; 钱淑冰; 田祥瑞; 陈燕; 游霞
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2022-04-24
Filing date: 2022-04-24
Publication date: 2023-11-03

Abstract

The invention discloses a parameterized LSTM acceleration system for data off-chip block transmission and a design method thereof, belonging to the technical field of neural network acceleration. The acceleration system comprises a parameterized parallel acceleration processing unit LAU, a data distributor DS, an off-chip memory module FFB, and on-chip memory modules LMB, LWB, LCB and LOB. The accelerating system design method comprises the steps of instantiating each module of the accelerating system according to user-specified parameters, and interweaving and storing network weights to the FFB according to a specific rule; then, the external characteristic is input and stored in the LMB, and weight parameters are transmitted to the LWB from the FFB in a circulating and blocking mode through the DS, and the time-sharing multiplexing LAU completes network acceleration operation; and storing the final operation result into the LOB and outputting the final operation result to the outside of the chip. The invention improves the flexibility and the universality of the system and effectively coordinates the resource balance of the large-scale network model and the small-scale embedded system through parameterized design and network weight block transmission operation.

Description

Parameterized LSTM acceleration system for data off-chip block transmission and design method thereof

Technical Field

The invention belongs to the technical field of neural network acceleration, relates to the design of an LSTM acceleration system based on an FPGA, and particularly relates to a parameterized LSTM acceleration system for data off-chip block transmission and a design method thereof.

Background

LSTM has memory to the historical data, and can solve the problem of gradient explosion and gradient disappearance well, is particularly suitable for processing the sequence signal. It belongs to a computationally intensive algorithm, usually employing GPU, ASIC, FPGA for acceleration processing; in the field of embedded edge computing, in consideration of the requirements of embedded type and low power consumption, FPGA is adopted to realize hardware acceleration at present. However, most of the existing LSTM acceleration systems based on the FPGA are realized on specific hardware platforms aiming at a fixed network model, cannot be configured according to different network structures and hardware platforms, and are poor in universality and flexibility; and the hardware platform has limited logic and storage resources on a chip, so that the calculation and parameter storage requirements of the LSTM network with larger scale are difficult to meet. Therefore, there is a need for a parameterized LSTM acceleration system for off-chip block transfer of data and a method of designing the same.

Disclosure of Invention

The invention aims to solve the problems and the shortcomings, and provides a parameterized LSTM acceleration system for off-chip block transmission and a design method thereof. The invention makes the system flexibly configured for different network structures and hardware platforms by carrying out parameterized design on each component module of the LSTM acceleration system; meanwhile, the LSTM network weight is transmitted to on-chip parallel operation by off-chip circulation blocking, so that the problem of resource balance between a large-scale network and a small-scale embedded system is effectively coordinated, and the universality of the system is further improved.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows: the parameterized LSTM acceleration system for data off-chip block transmission is characterized by comprising a parameterized parallel acceleration processing unit LAU, a parameterized on-chip memory module ONM, a data distributor DS and a parameterized off-chip memory module FFB;

the LAU mainly comprises a front operation unit HPU and a back operation unit TPU; the ONM comprises an input feature storage module LMB, a weight storage module LWB, an intermediate state storage module LCB and a result storage module LOB;

external pending feature input x _t The outputs of the FFB are transmitted to LWB in batches through DS, and the outputs of the LMB and the LWB are transmitted to the HPU of the LAU in parallel after being processed by DS; the output of the HPU is transmitted to the TPU module in parallel, the output of the LCB is transmitted to the TPU through the DS, and the output of the TPU is transmitted to the LCB and the LOB through the DS; the output of the LOB is transmitted to the LMB and is output to the off-chip giving system output h _t 。

Further, the HPU comprises CV FP units, and each FP unit comprises a parameterized multiply-accumulate unit MAU and a block cache accumulate unit ABF; the MAU can realize multiply-accumulate operation of RE N-bit fixed points in parallel, and the ABF can realize storage and accumulation of N N-bit fixed points; the TPU comprises CT BP units, and each BP unit can realize activation and dot multiplication operation in parallel; wherein CV, CT, RE, N and n are both non-negative integer powers of 2, satisfy cv=4ct, and are both parametrizable.

Further, the storage widths of FFB, LMB, LCB and LOB are n bits, and the storage depths are H respectively _W *H _L 、H _L 、l _h And l _h Wherein H is _W ＝4l _h ，H _L ＝l _x +l _h ，l _x And l _h The dimensions of the LSTM input vector and the hidden vector are respectively; the LWB is composed of CV blocks BRAM, each of which has a storage depth of N×RE, wherein N, RE, CV, N, l _h 、l _x Are all non-negative integer powers of 2 and can be parametrically implemented.

In addition, the invention also provides a parameterized LSTM acceleration system design method for data off-chip block transmission, which is characterized by comprising the following steps:

(1) According to LSTM parameters appointed by a user, determining design dimensions of each module in the LSTM acceleration system, and instantiating each module to realize:

(1.1) determining system design parameters including the bit width n of the fixed data, the dimension l of the input vector and the hidden vector based on the user description _x And l _h Values for CV, CT, RE and N;

(1.2) determining the design dimension of each component module of the LSTM acceleration system according to the storage characteristic requirement;

(1.3) according to the design dimension (1.2), implementing instantiation of each component module of the LSTM acceleration system, and initializing all storage units and buffered contents to 0;

(2) According to the storage characteristic requirement, interleaving and storing LSTM weight matrix parameters to the FFB:

(2.1) let variable i=0;

(2.2) let the variable cl=round (i/(H) _W *RE))，CS＝round(i％(CV*N*RE)/(4*N*RE))，k＝i％RE，s＝round(i/(N*RE))％4，CW＝round(i％(H _W * RE)/(cv×n×re)), then:

if s=0, ffb [ i ] =wi [ CW x ct+cs ] [ CL x re+k ];

if s=1, ffb [ i ] =wf [ CW x ct+cs ] [ CL x re+k ];

if s=2, ffb [ i ] =wg [ CW x ct+cs ] [ CL x re+k ];

if s=3, ffb [ i ] =wo [ CW x ct+cs ] [ CL x re+k ];

wherein FFB [ i ] is the value of a storage unit with an address of i in FFB, wi [ i ] [ j ], wf [ i ] [ j ], wg [ i ] [ j ] and Wo [ i ] [ j ] are weight parameters of an LSTM network input gate, a forgetting gate, an updating gate and an output gate weight matrix in the ith row and the jth column respectively, round () is a rounding function, and% represents a remainder;

(2.3) let i=i+1, repeating steps (2.2) to (2.3) until i=h _W *H _L ；

(3) Inputting the external feature to be processed into x according to the requirement of the storage feature _t And the output h of the LOB _t Stored to the LMB, whichX in the middle _t Low addresses 0-l stored to LMB _x Output h of LOB-1 _t High address l stored to LMB _x ～H _L -1 and initializing a loop control parameter cl=cw=cs=0;

(4) And the DS circularly and sectionally transmits the weight parameters from the FFB to the LWB, and sends the weight parameters to the LAU to finish network acceleration operation, and the final operation result is stored in the LOB:

(4.1) simultaneously transmitting RE characteristic parameters in the LMB starting from address cl×re to CV FPs of the HPU unit through the DS;

(4.2) transmitting weights of the cv×n×re parameters from the address cw×re in the FFB to the CV block BRAM of the LWB through the DS, wherein corresponding parameters of addresses CW (i×n×re) to CW (i+1) n×re-1) are stored in addresses 0 to (n×re-1) of the i block BRAM respectively (i=0, 1..;

(4.3) transmitting RE data starting from an address CS RE in CV BRAMs of LWB to CV FPs of HPU in parallel through DS, wherein the ith BRAM content is sent to the ith FP, and multiply-accumulate operation is completed in parallel by MAU of each FP, and the calculation result is added to an address CS of ABF thereof;

(4.4) making cs=cs+1, repeating the step (4.3) until cs=n, and completing the operation of n×re parameters;

(4.5) making CW=CW+1, repeating steps (4.2) - (4.3), sending the weight parameters of the off-chip cyclic transmission network to CV pieces of FP to complete multiply-accumulate operation, and accumulating the calculation result until CW=H _W Performing calculation of all parameters in the weight matrix column direction;

(4.6) making CL=CL+1, repeating steps (4.1) - (4.5), transmitting the network weight parameters from off-chip circulation to CV pieces of FP to complete multiply-accumulate operation, and accumulating the calculation result until CL=H _L /RE-1；

(4.7) repeating the steps (4.1) - (4.3), turning (4.4) after the first calculation result is obtained, and simultaneously calculating the CT BP which is transmitted to the TPU module in a group of every 4 CV calculation results until CW=H _W Performing (N: CV) operation on all parameters in the row direction of the LSTM weight matrix;

(5) Outputting the result of the IOB to an off-chip give system outputH is out of _t Repeating the steps (3) - (4), and performing the calculation at the next time.

Further, in the step (4.7), the implementation method for performing calculation on the CT BPs transmitted to the TPU module in a group of 4 CV calculation results is as follows:

(4.7.1) converting CV number of operation results into k-th element i of input gate, forget gate, update gate and output gate matrix vector multiplication results in parallel by the following method _t [k]、f _t [k]、g _t [k]And o _t [k]：

(4.7.1a) let the ABF CS-th temporary storage unit of the i-th FP have the value vt, let the variable temp= (CW x n+cs) x cv+i;

(4.7.1b) let the variable k=round (temp/4), s=temp% 4, where round () is a rounding function,% represents the remainder, then: if s=0, i _t [k]=vt; if s=1, f _t [k]=vt; if s=2, g _t [k]=vt; if s=3, o _t [k]＝vt；

(4.7.2) step (4.7.1 b) the i _t [k]、f _t [k]、g _t [k]And o _t [k]Every four are transmitted in parallel as a group to CT BP units in the TPU, wherein the j-th group of data is transmitted to the j-th BP unit, j=k% (CT),% represents taking the remainder, and the value c of the unit with the address k in the LCB is transmitted through DS _t-1 [k]Transmission to the corresponding BP:

(4.7.3) performing the following activation and point multiplication operations in parallel by the CT BP units, and storing the calculation results:

(4.7.3a) performing the activation operation in parallel: s_i _t [k]＝sigmoid(i _t [k])、s_f _t [k]＝sigmoid(f _t [k])、t_g _t [k]＝tanh(g _t [k]) And s_o _t [k]＝sigmoid(o _t [k]) Obtaining the corresponding n bit output s_i _t [k]、s_f _t [k]、t_g _t [k]And s_o _t [k]Wherein sigmoid (x) = (1/(1+e) ^-x ))，tanh(x)＝((e ^x -e ^-x )/(e ^x +e ^-x ))；

(4.7.2b) performing vector element point multiplication operation in parallel to obtain a corresponding n bit output c _t [k]And h _t [k]Wherein c _t [k]＝s_i _t [k]⊙t_g _t [k]+s_f _t [k]⊙c _t-1 [k]，h _t [k]＝s_o _t [k]⊙tanh(c _t [k]) "" indicates element dot product;

(4.7.2c) h is carried out by DS _t [k]To LOB, c _t [k]To the LCB.

Compared with the prior art, the invention has the beneficial effects that: the parameterized design is carried out on each component module of the LSTM acceleration system, so that the system can be flexibly configured aiming at different network structures and hardware platforms, and the flexibility of the system is improved; meanwhile, the off-chip circulation block transmission operation of the LSTM network weight effectively coordinates the resource balance of the large network model and the small embedded system, and improves the universality of the system.

Drawings

FIG. 1 is a block diagram of an acceleration system according to the present invention

FIG. 2 is a schematic diagram of the parameterized parallel acceleration processing unit LAU according to the present invention

FIG. 3 shows the storage width and depth of each storage module according to the present invention

FIG. 4 is a schematic diagram showing the interleaving of LSTM weight matrix parameters in the FFB according to the present invention

Detailed Description

The following detailed description of the embodiments of the invention is exemplary and is provided merely to illustrate the invention and is not to be construed as limiting the invention. The parameterized LSTM network acceleration system for off-chip block transmission and the design method thereof are described in detail below with reference to the accompanying drawings.

As shown in fig. 1, the parameterized LSTM acceleration system for off-chip block transmission of data according to the present invention includes a parameterized parallel acceleration processing unit LAU, a parameterized on-chip memory module ONM, a data distributor DS, and a parameterized off-chip memory module FFB;

As shown in fig. 2, the HPU includes CV FP units, each FP unit includes a parameterized multiply-accumulate unit MAU and a block-cache accumulate unit ABF; the MAU can realize multiply-accumulate operation of RE N-bit fixed points in parallel, and the ABF can realize storage and accumulation of N N-bit fixed points; the TPU comprises CT BP units, and each BP unit can realize activation and dot multiplication operation in parallel; wherein CV, CT, RE, N and n are both non-negative integer powers of 2, satisfy cv=4ct, and are both parametrizable.

As shown in FIG. 3, the storage widths of the FFB, LMB, LCB and LOB are n bits, and the storage depths are H respectively _W *H _L 、H _L 、l _h And l _h Wherein H is _W ＝4l _h ，H _L ＝l _x +l _h ，l _x And l _h The dimensions of the LSTM input vector and the hidden vector are respectively; the LWB is composed of CV blocks BRAM, each of which has a storage depth of N×RE, wherein N, RE, CV, N, l _h 、l _x Are all non-negative integer powers of 2 and can be parametrically implemented.

The method for designing the parameterized LSTM acceleration system for data off-chip block transmission comprises the following steps.

(1.1) determining system design parameters from user descriptions assuming a fixed-point data bit width n=8bit, input vector and hidden vector dimension l _x And l _h All 128, cv=32, ct=8, re=8, and n=4;

(1.2) rootAccording to the feature requirement, determining the design dimension of each component module of the LSTM acceleration system, wherein the design dimension is represented by l _x And l _h Dimension H of the direction of the weight matrix column _W Dimension H in row direction=512 _L ＝256；

(1.3) instantiating each component module of the LSTM acceleration system according to the design dimension in (1.2), and initializing all storage units and buffered contents to 0.

(2) The storage characterization requirement according to claim 3, wherein LSTM weight matrix parameters are stored to the FFB by interleaving as follows:

(2.1) let variable i=0;

(2.2) let the variable cl=round (i/(H) _W *RE))＝round(i/(512*8))＝round(i/4096)，CS＝round(i％(CV*N*RE)/(4*N*RE))＝round(i％(32*4*8)/(4*4*8))＝round(i％1024/128)，k＝i％RE＝i％8，s＝round(i/(4*8))％4＝round(i/32)％4，CW＝round(i％(H _W * RE)/(cv×n×re))=round (i% (512×8)/(32×4×8))=round (i% 4096/1024), then:

if s=0, ffb [ i ] =wi [ CW x ct+cs ] [ CL x 8+k ];

if s=1, ffb [ i ] =wf [ CW x ct+cs ] [ CL x 8+k ];

if s=2, ffb [ i ] =wg [ CW x ct+cs ] [ CL x 8+k ];

if s=3, ffb [ i ] =wo [ CW x ct+cs ] [ CL x 8+k ];

(2.3) let i=i+1, repeating steps (2.2) to (2.3) until i=h _W *H _L ＝512*256＝131072。

According to the method, an interleaving storage schematic diagram of the LSTM weight matrix parameters in the FFB is shown in fig. 4.

(3) Inputting the external feature to be processed into x according to the requirement of the storage feature _t And the output h of the LOB _t Stored to the LMB, where x _t Deposit toLow addresses 0-127 of LMB, output h of LOB _t High addresses 128 to 255 to the LMB are stored, and a loop control parameter cl=cw=cs=0 is initialized.

(4.1) simultaneously transmitting 8 characteristic parameters in the LMB starting from address CL x 8 to 32 FPs of the HPU unit by means of the DS;

(4.2) transmitting 32×4×8=1024 weight parameters from the address cw×4×8) =cw×1024 in the FFB to the 32 blocks BRAM of the LWB through the DS, wherein the corresponding parameters of the addresses cw× 4*8) - (cw×1× 4*8-1 are stored in the addresses 0 to 4*8-1 (i=0, 1..31) of the i-th block BRAM respectively, and then letting cs=0;

(4.3) transmitting 8 data from address CS x 8 in 32 BRAMs of LWB to 32 FPs of HPU in parallel through DS, wherein the ith BRAM content is sent to the ith FP, multiply-accumulate operation is completed in parallel by MAU of each FP, and the calculation result is added to address CS of ABF thereof;

(4.4) making cs=cs+1, repeating the step (4.3) until cs=4, and completing the operation of 4*8 =32 parameters;

(4.5) making CW=CW+1, repeating steps (4.2) - (4.3), transmitting the network weight parameters from off-chip circulation to 32 FPs to complete multiply-accumulate operation, and accumulating the calculation result until CW=512/(4×32) =4 to complete the operation of all parameters in the weight matrix array direction;

(4.6) making CL=CL+1, repeating the steps (4.1) - (4.5), transmitting the network weight parameters from off-chip circulation to 32 FPs to complete multiply-accumulate operation, and accumulating the calculation results until CL=256/8-1=31;

(4.7) repeating the steps (4.1) - (4.3), turning (4.4) and simultaneously transmitting each of the 32 operation results into 8 BP (back propagation) of a group to the TPU module for operation until CW=512/(4×32) =4, and completing the operation of all parameters in the row direction of the LSTM weight matrix; the method for calculating 8 BP transmitted to the TPU module in a group of 4 calculation results is as follows:

(4.71) converting 32 operation results into a kth element i of input gate, forgetting gate, updating gate and output gate matrix vector multiplication results in parallel by adopting the following method _t [k]、f _t [k]、g _t [k]And o _t [k]：

(4.7.1a) let the value of the ABF CS-th temporary storage unit of the i-th FP be vt, let the variable temp= (CW 4+cs) 32+i;

(4.7.2) step (4.7.1 b) the i _t [k]、f _t [k]、g _t [k]And o _t [k]Every four are transmitted in parallel as a group to 8 BP units in the TPU, wherein the j-th group of data is transmitted to the j-th BP unit, j=k% 8,% represents the remainder, and the value c of the k-addressed unit in the LCB is simultaneously transmitted by DS _t-1 [k]Transmitting to the corresponding BP;

(4.7.3) the following activation and dot product operations are performed in parallel by 8 BP units, and the calculation results are stored and stored:

(4.7.3a) performing the activation operation in parallel: s_i _t [k]＝sigmoid(i _t [k])、s_f _t [k]＝sigmoid(f _t [k])、t_g _t [k]＝tanh(g _t [k]) And s_o _t [k]＝sigmoid(o _t [k]) Obtaining the corresponding 8bit output s_i _t [k]、s_f _t [k]、t_g _t [k]And s_o _t [k]Wherein sigmoid (x) = (1/(1+e) ^-x ))，tanh(x)＝((e ^x -e ^-x )/(e ^x +e ^-x ))；

(4.7.2b) performing vector element point multiplication operation in parallel to obtain a corresponding 8-bit output c _t [k]And h _t [k]Wherein c _t [k]＝s_i _t [k]⊙t_g _t [k]+s_f _t [k]⊙c _t-1 [k]，h _t [k]＝s_o _t [k]⊙tanh(c _t [k]) "" indicates element dot product;

(4.7.2c) h is carried out by DS _t [k]To the LOB of the cell(s),c _t [k]to the LCB.

(5) Outputting the result of the IOB to the off-chip giving system output h _t Repeating the steps (3) - (4), and performing the calculation at the next time.

The foregoing description is only illustrative of the present invention and is not intended to limit the scope of the invention, and all equivalent structures or equivalent processes or direct or indirect application in other related technical fields are included in the scope of the present invention.

Claims

1. The LSTM acceleration system for data off-chip block transmission is characterized by comprising a parameterized parallel acceleration processing unit LAU, a parameterized on-chip memory module ONM, a data distributor DS and a parameterized off-chip memory module FFB;

2. The LSTM acceleration system for off-chip block transfer of claim 1, wherein the HPU includes CV FP units, each FP unit including a parameterized multiply-accumulate unit MAU and a block-cache accumulate unit ABF; the MAU can realize multiply-accumulate operation of RE N-bit fixed points in parallel, and the ABF can realize storage and accumulation of N N-bit fixed points; the TPU comprises CT BP units, and each BP unit can realize activation and dot multiplication operation in parallel; wherein CV, CT, RE, N and n are both non-negative integer powers of 2, satisfy cv=4ct, and are both parametrizable.

3. The LSTM acceleration system for off-chip block transmission of claim 1, wherein said FFB, LMB, LCB and LOB memory widths are n bits and memory depths are H respectively _W *H _L 、H _L 、l _h And l _h Wherein H is _W ＝4l _h ，H _L ＝l _x +l _h ，l _x And l _h The dimensions of the LSTM input vector and the hidden vector are respectively; the LWB is composed of CV blocks BRAM, each of which has a storage depth of N×RE, wherein N, RE, CV, N, l _h 、l _x Are all non-negative integer powers of 2 and can be parametrically implemented.

4. The LSTM acceleration system design method for data off-chip block transmission is characterized by comprising the following steps of:

(1) According to LSTM parameters specified by a user, determining design dimensions of each module in the LSTM acceleration system of claim 1, and instantiating each module to realize:

(1.1) determining system design parameters based on the user description, including the fixed data bit width n, the input vector and the hidden vector dimension l _x And l _h Values for CV, CT, RE and N;

(1.2) determining the design dimensions of the constituent modules of the LSTM acceleration system of claim 1 based on the feature requirements of claims 2 and 3;

(1.3) instantiating each constituent module of the LSTM acceleration system of claim 1 according to the design dimension of (1.2), and initializing all storage units and buffered contents to 0.

(2.1) let variable i=0;

if s=0, ffb [ i ] =wi [ CW x ct+cs ] [ CL x re+k ];

if s=1, ffb [ i ] =wf [ CW x ct+cs ] [ CL x re+k ];

if s=2, ffb [ i ] =wg [ CW x ct+cs ] [ CL x re+k ];

if s=3, ffb [ i ] =wo [ CW x ct+cs ] [ CL x re+k ];

(2.3) let i=i+1, repeating steps (2.2) to (2.3) until i=h _W *H _L ；

(3) A storage feature requirement according to claim 3, inputting an external feature to be processed into x _t And the output h of the LOB _t Stored to the LMB, where x _t Low addresses 0-l stored to LMB _x Output h of LOB-1 _t High address l stored to LMB _x ～H _L -1 and initializing a loop control parameter cl=cw=cs=0;

5. The LSTM acceleration system design method for data off-chip block transmission is characterized in that in the step (4.7), the realization method for carrying out operation on CT BP which is transmitted to the TPU module in a group by every 4 CV operation results is as follows:

(4.7.2) step (4.7.1 b) the i _t [k]、f _t [k]、g _t [k]And o _t [k]Every four are transmitted in parallel as a group to CT BP units in the TPU, wherein the j-th group of data is transmitted to the j-th BP unit, j=k% CT,% represents the remainder, and the value c of the k-address unit in the LCB is determined by DS _t-1 [k]Transmitting to the corresponding BP;

(4.7.2c) h is carried out by DS _t [k]To LOB, c _t [k]To the LCB.