CN116992932A - Parameterized LSTM acceleration system for data off-chip block transmission and design method thereof - Google Patents

Parameterized LSTM acceleration system for data off-chip block transmission and design method thereof Download PDF

Info

Publication number
CN116992932A
CN116992932A CN202210436245.3A CN202210436245A CN116992932A CN 116992932 A CN116992932 A CN 116992932A CN 202210436245 A CN202210436245 A CN 202210436245A CN 116992932 A CN116992932 A CN 116992932A
Authority
CN
China
Prior art keywords
lstm
ffb
parameters
parallel
chip
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210436245.3A
Other languages
Chinese (zh)
Inventor
姚睿
赵杰
余永传
钱淑冰
田祥瑞
陈燕
游霞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Aeronautics and Astronautics
Original Assignee
Nanjing University of Aeronautics and Astronautics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Aeronautics and Astronautics filed Critical Nanjing University of Aeronautics and Astronautics
Priority to CN202210436245.3A priority Critical patent/CN116992932A/en
Publication of CN116992932A publication Critical patent/CN116992932A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0893Caches characterised by their organisation or structure
    • G06F12/0897Caches characterised by their organisation or structure with two or more cache hierarchy levels
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • G06F15/781On-chip cache; Off-chip memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a parameterized LSTM acceleration system for data off-chip block transmission and a design method thereof, belonging to the technical field of neural network acceleration. The acceleration system comprises a parameterized parallel acceleration processing unit LAU, a data distributor DS, an off-chip memory module FFB, and on-chip memory modules LMB, LWB, LCB and LOB. The accelerating system design method comprises the steps of instantiating each module of the accelerating system according to user-specified parameters, and interweaving and storing network weights to the FFB according to a specific rule; then, the external characteristic is input and stored in the LMB, and weight parameters are transmitted to the LWB from the FFB in a circulating and blocking mode through the DS, and the time-sharing multiplexing LAU completes network acceleration operation; and storing the final operation result into the LOB and outputting the final operation result to the outside of the chip. The invention improves the flexibility and the universality of the system and effectively coordinates the resource balance of the large-scale network model and the small-scale embedded system through parameterized design and network weight block transmission operation.

Description

Parameterized LSTM acceleration system for data off-chip block transmission and design method thereof
Technical Field
The invention belongs to the technical field of neural network acceleration, relates to the design of an LSTM acceleration system based on an FPGA, and particularly relates to a parameterized LSTM acceleration system for data off-chip block transmission and a design method thereof.
Background
LSTM has memory to the historical data, and can solve the problem of gradient explosion and gradient disappearance well, is particularly suitable for processing the sequence signal. It belongs to a computationally intensive algorithm, usually employing GPU, ASIC, FPGA for acceleration processing; in the field of embedded edge computing, in consideration of the requirements of embedded type and low power consumption, FPGA is adopted to realize hardware acceleration at present. However, most of the existing LSTM acceleration systems based on the FPGA are realized on specific hardware platforms aiming at a fixed network model, cannot be configured according to different network structures and hardware platforms, and are poor in universality and flexibility; and the hardware platform has limited logic and storage resources on a chip, so that the calculation and parameter storage requirements of the LSTM network with larger scale are difficult to meet. Therefore, there is a need for a parameterized LSTM acceleration system for off-chip block transfer of data and a method of designing the same.
Disclosure of Invention
The invention aims to solve the problems and the shortcomings, and provides a parameterized LSTM acceleration system for off-chip block transmission and a design method thereof. The invention makes the system flexibly configured for different network structures and hardware platforms by carrying out parameterized design on each component module of the LSTM acceleration system; meanwhile, the LSTM network weight is transmitted to on-chip parallel operation by off-chip circulation blocking, so that the problem of resource balance between a large-scale network and a small-scale embedded system is effectively coordinated, and the universality of the system is further improved.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows: the parameterized LSTM acceleration system for data off-chip block transmission is characterized by comprising a parameterized parallel acceleration processing unit LAU, a parameterized on-chip memory module ONM, a data distributor DS and a parameterized off-chip memory module FFB;
the LAU mainly comprises a front operation unit HPU and a back operation unit TPU; the ONM comprises an input feature storage module LMB, a weight storage module LWB, an intermediate state storage module LCB and a result storage module LOB;
external pending feature input x t The outputs of the FFB are transmitted to LWB in batches through DS, and the outputs of the LMB and the LWB are transmitted to the HPU of the LAU in parallel after being processed by DS; the output of the HPU is transmitted to the TPU module in parallel, the output of the LCB is transmitted to the TPU through the DS, and the output of the TPU is transmitted to the LCB and the LOB through the DS; the output of the LOB is transmitted to the LMB and is output to the off-chip giving system output h t
Further, the HPU comprises CV FP units, and each FP unit comprises a parameterized multiply-accumulate unit MAU and a block cache accumulate unit ABF; the MAU can realize multiply-accumulate operation of RE N-bit fixed points in parallel, and the ABF can realize storage and accumulation of N N-bit fixed points; the TPU comprises CT BP units, and each BP unit can realize activation and dot multiplication operation in parallel; wherein CV, CT, RE, N and n are both non-negative integer powers of 2, satisfy cv=4ct, and are both parametrizable.
Further, the storage widths of FFB, LMB, LCB and LOB are n bits, and the storage depths are H respectively W *H L 、H L 、l h And l h Wherein H is W =4l h ,H L =l x +l h ,l x And l h The dimensions of the LSTM input vector and the hidden vector are respectively; the LWB is composed of CV blocks BRAM, each of which has a storage depth of N×RE, wherein N, RE, CV, N, l h 、l x Are all non-negative integer powers of 2 and can be parametrically implemented.
In addition, the invention also provides a parameterized LSTM acceleration system design method for data off-chip block transmission, which is characterized by comprising the following steps:
(1) According to LSTM parameters appointed by a user, determining design dimensions of each module in the LSTM acceleration system, and instantiating each module to realize:
(1.1) determining system design parameters including the bit width n of the fixed data, the dimension l of the input vector and the hidden vector based on the user description x And l h Values for CV, CT, RE and N;
(1.2) determining the design dimension of each component module of the LSTM acceleration system according to the storage characteristic requirement;
(1.3) according to the design dimension (1.2), implementing instantiation of each component module of the LSTM acceleration system, and initializing all storage units and buffered contents to 0;
(2) According to the storage characteristic requirement, interleaving and storing LSTM weight matrix parameters to the FFB:
(2.1) let variable i=0;
(2.2) let the variable cl=round (i/(H) W *RE)),CS=round(i%(CV*N*RE)/(4*N*RE)),k=i%RE,s=round(i/(N*RE))%4,CW=round(i%(H W * RE)/(cv×n×re)), then:
if s=0, ffb [ i ] =wi [ CW x ct+cs ] [ CL x re+k ];
if s=1, ffb [ i ] =wf [ CW x ct+cs ] [ CL x re+k ];
if s=2, ffb [ i ] =wg [ CW x ct+cs ] [ CL x re+k ];
if s=3, ffb [ i ] =wo [ CW x ct+cs ] [ CL x re+k ];
wherein FFB [ i ] is the value of a storage unit with an address of i in FFB, wi [ i ] [ j ], wf [ i ] [ j ], wg [ i ] [ j ] and Wo [ i ] [ j ] are weight parameters of an LSTM network input gate, a forgetting gate, an updating gate and an output gate weight matrix in the ith row and the jth column respectively, round () is a rounding function, and% represents a remainder;
(2.3) let i=i+1, repeating steps (2.2) to (2.3) until i=h W *H L
(3) Inputting the external feature to be processed into x according to the requirement of the storage feature t And the output h of the LOB t Stored to the LMB, whichX in the middle t Low addresses 0-l stored to LMB x Output h of LOB-1 t High address l stored to LMB x ~H L -1 and initializing a loop control parameter cl=cw=cs=0;
(4) And the DS circularly and sectionally transmits the weight parameters from the FFB to the LWB, and sends the weight parameters to the LAU to finish network acceleration operation, and the final operation result is stored in the LOB:
(4.1) simultaneously transmitting RE characteristic parameters in the LMB starting from address cl×re to CV FPs of the HPU unit through the DS;
(4.2) transmitting weights of the cv×n×re parameters from the address cw×re in the FFB to the CV block BRAM of the LWB through the DS, wherein corresponding parameters of addresses CW (i×n×re) to CW (i+1) n×re-1) are stored in addresses 0 to (n×re-1) of the i block BRAM respectively (i=0, 1..;
(4.3) transmitting RE data starting from an address CS RE in CV BRAMs of LWB to CV FPs of HPU in parallel through DS, wherein the ith BRAM content is sent to the ith FP, and multiply-accumulate operation is completed in parallel by MAU of each FP, and the calculation result is added to an address CS of ABF thereof;
(4.4) making cs=cs+1, repeating the step (4.3) until cs=n, and completing the operation of n×re parameters;
(4.5) making CW=CW+1, repeating steps (4.2) - (4.3), sending the weight parameters of the off-chip cyclic transmission network to CV pieces of FP to complete multiply-accumulate operation, and accumulating the calculation result until CW=H W Performing calculation of all parameters in the weight matrix column direction;
(4.6) making CL=CL+1, repeating steps (4.1) - (4.5), transmitting the network weight parameters from off-chip circulation to CV pieces of FP to complete multiply-accumulate operation, and accumulating the calculation result until CL=H L /RE-1;
(4.7) repeating the steps (4.1) - (4.3), turning (4.4) after the first calculation result is obtained, and simultaneously calculating the CT BP which is transmitted to the TPU module in a group of every 4 CV calculation results until CW=H W Performing (N: CV) operation on all parameters in the row direction of the LSTM weight matrix;
(5) Outputting the result of the IOB to an off-chip give system outputH is out of t Repeating the steps (3) - (4), and performing the calculation at the next time.
Further, in the step (4.7), the implementation method for performing calculation on the CT BPs transmitted to the TPU module in a group of 4 CV calculation results is as follows:
(4.7.1) converting CV number of operation results into k-th element i of input gate, forget gate, update gate and output gate matrix vector multiplication results in parallel by the following method t [k]、f t [k]、g t [k]And o t [k]:
(4.7.1a) let the ABF CS-th temporary storage unit of the i-th FP have the value vt, let the variable temp= (CW x n+cs) x cv+i;
(4.7.1b) let the variable k=round (temp/4), s=temp% 4, where round () is a rounding function,% represents the remainder, then: if s=0, i t [k]=vt; if s=1, f t [k]=vt; if s=2, g t [k]=vt; if s=3, o t [k]=vt;
(4.7.2) step (4.7.1 b) the i t [k]、f t [k]、g t [k]And o t [k]Every four are transmitted in parallel as a group to CT BP units in the TPU, wherein the j-th group of data is transmitted to the j-th BP unit, j=k% (CT),% represents taking the remainder, and the value c of the unit with the address k in the LCB is transmitted through DS t-1 [k]Transmission to the corresponding BP:
(4.7.3) performing the following activation and point multiplication operations in parallel by the CT BP units, and storing the calculation results:
(4.7.3a) performing the activation operation in parallel: s_i t [k]=sigmoid(i t [k])、s_f t [k]=sigmoid(f t [k])、t_g t [k]=tanh(g t [k]) And s_o t [k]=sigmoid(o t [k]) Obtaining the corresponding n bit output s_i t [k]、s_f t [k]、t_g t [k]And s_o t [k]Wherein sigmoid (x) = (1/(1+e) -x )),tanh(x)=((e x -e -x )/(e x +e -x ));
(4.7.2b) performing vector element point multiplication operation in parallel to obtain a corresponding n bit output c t [k]And h t [k]Wherein c t [k]=s_i t [k]⊙t_g t [k]+s_f t [k]⊙c t-1 [k],h t [k]=s_o t [k]⊙tanh(c t [k]) "" indicates element dot product;
(4.7.2c) h is carried out by DS t [k]To LOB, c t [k]To the LCB.
Compared with the prior art, the invention has the beneficial effects that: the parameterized design is carried out on each component module of the LSTM acceleration system, so that the system can be flexibly configured aiming at different network structures and hardware platforms, and the flexibility of the system is improved; meanwhile, the off-chip circulation block transmission operation of the LSTM network weight effectively coordinates the resource balance of the large network model and the small embedded system, and improves the universality of the system.
Drawings
FIG. 1 is a block diagram of an acceleration system according to the present invention
FIG. 2 is a schematic diagram of the parameterized parallel acceleration processing unit LAU according to the present invention
FIG. 3 shows the storage width and depth of each storage module according to the present invention
FIG. 4 is a schematic diagram showing the interleaving of LSTM weight matrix parameters in the FFB according to the present invention
Detailed Description
The following detailed description of the embodiments of the invention is exemplary and is provided merely to illustrate the invention and is not to be construed as limiting the invention. The parameterized LSTM network acceleration system for off-chip block transmission and the design method thereof are described in detail below with reference to the accompanying drawings.
As shown in fig. 1, the parameterized LSTM acceleration system for off-chip block transmission of data according to the present invention includes a parameterized parallel acceleration processing unit LAU, a parameterized on-chip memory module ONM, a data distributor DS, and a parameterized off-chip memory module FFB;
the LAU mainly comprises a front operation unit HPU and a back operation unit TPU; the ONM comprises an input feature storage module LMB, a weight storage module LWB, an intermediate state storage module LCB and a result storage module LOB;
external pending feature input x t The outputs of the FFB are transmitted to LWB in batches through DS, and the outputs of the LMB and the LWB are transmitted to the HPU of the LAU in parallel after being processed by DS; the output of the HPU is transmitted to the TPU module in parallel, the output of the LCB is transmitted to the TPU through the DS, and the output of the TPU is transmitted to the LCB and the LOB through the DS; the output of the LOB is transmitted to the LMB and is output to the off-chip giving system output h t
As shown in fig. 2, the HPU includes CV FP units, each FP unit includes a parameterized multiply-accumulate unit MAU and a block-cache accumulate unit ABF; the MAU can realize multiply-accumulate operation of RE N-bit fixed points in parallel, and the ABF can realize storage and accumulation of N N-bit fixed points; the TPU comprises CT BP units, and each BP unit can realize activation and dot multiplication operation in parallel; wherein CV, CT, RE, N and n are both non-negative integer powers of 2, satisfy cv=4ct, and are both parametrizable.
As shown in FIG. 3, the storage widths of the FFB, LMB, LCB and LOB are n bits, and the storage depths are H respectively W *H L 、H L 、l h And l h Wherein H is W =4l h ,H L =l x +l h ,l x And l h The dimensions of the LSTM input vector and the hidden vector are respectively; the LWB is composed of CV blocks BRAM, each of which has a storage depth of N×RE, wherein N, RE, CV, N, l h 、l x Are all non-negative integer powers of 2 and can be parametrically implemented.
The method for designing the parameterized LSTM acceleration system for data off-chip block transmission comprises the following steps.
(1) According to LSTM parameters appointed by a user, determining design dimensions of each module in the LSTM acceleration system, and instantiating each module to realize:
(1.1) determining system design parameters from user descriptions assuming a fixed-point data bit width n=8bit, input vector and hidden vector dimension l x And l h All 128, cv=32, ct=8, re=8, and n=4;
(1.2) rootAccording to the feature requirement, determining the design dimension of each component module of the LSTM acceleration system, wherein the design dimension is represented by l x And l h Dimension H of the direction of the weight matrix column W Dimension H in row direction=512 L =256;
(1.3) instantiating each component module of the LSTM acceleration system according to the design dimension in (1.2), and initializing all storage units and buffered contents to 0.
(2) The storage characterization requirement according to claim 3, wherein LSTM weight matrix parameters are stored to the FFB by interleaving as follows:
(2.1) let variable i=0;
(2.2) let the variable cl=round (i/(H) W *RE))=round(i/(512*8))=round(i/4096),CS=round(i%(CV*N*RE)/(4*N*RE))=round(i%(32*4*8)/(4*4*8))=round(i%1024/128),k=i%RE=i%8,s=round(i/(4*8))%4=round(i/32)%4,CW=round(i%(H W * RE)/(cv×n×re))=round (i% (512×8)/(32×4×8))=round (i% 4096/1024), then:
if s=0, ffb [ i ] =wi [ CW x ct+cs ] [ CL x 8+k ];
if s=1, ffb [ i ] =wf [ CW x ct+cs ] [ CL x 8+k ];
if s=2, ffb [ i ] =wg [ CW x ct+cs ] [ CL x 8+k ];
if s=3, ffb [ i ] =wo [ CW x ct+cs ] [ CL x 8+k ];
wherein FFB [ i ] is the value of a storage unit with an address of i in FFB, wi [ i ] [ j ], wf [ i ] [ j ], wg [ i ] [ j ] and Wo [ i ] [ j ] are weight parameters of an LSTM network input gate, a forgetting gate, an updating gate and an output gate weight matrix in the ith row and the jth column respectively, round () is a rounding function, and% represents a remainder;
(2.3) let i=i+1, repeating steps (2.2) to (2.3) until i=h W *H L =512*256=131072。
According to the method, an interleaving storage schematic diagram of the LSTM weight matrix parameters in the FFB is shown in fig. 4.
(3) Inputting the external feature to be processed into x according to the requirement of the storage feature t And the output h of the LOB t Stored to the LMB, where x t Deposit toLow addresses 0-127 of LMB, output h of LOB t High addresses 128 to 255 to the LMB are stored, and a loop control parameter cl=cw=cs=0 is initialized.
(4) And the DS circularly and sectionally transmits the weight parameters from the FFB to the LWB, and sends the weight parameters to the LAU to finish network acceleration operation, and the final operation result is stored in the LOB:
(4.1) simultaneously transmitting 8 characteristic parameters in the LMB starting from address CL x 8 to 32 FPs of the HPU unit by means of the DS;
(4.2) transmitting 32×4×8=1024 weight parameters from the address cw×4×8) =cw×1024 in the FFB to the 32 blocks BRAM of the LWB through the DS, wherein the corresponding parameters of the addresses cw× 4*8) - (cw×1× 4*8-1 are stored in the addresses 0 to 4*8-1 (i=0, 1..31) of the i-th block BRAM respectively, and then letting cs=0;
(4.3) transmitting 8 data from address CS x 8 in 32 BRAMs of LWB to 32 FPs of HPU in parallel through DS, wherein the ith BRAM content is sent to the ith FP, multiply-accumulate operation is completed in parallel by MAU of each FP, and the calculation result is added to address CS of ABF thereof;
(4.4) making cs=cs+1, repeating the step (4.3) until cs=4, and completing the operation of 4*8 =32 parameters;
(4.5) making CW=CW+1, repeating steps (4.2) - (4.3), transmitting the network weight parameters from off-chip circulation to 32 FPs to complete multiply-accumulate operation, and accumulating the calculation result until CW=512/(4×32) =4 to complete the operation of all parameters in the weight matrix array direction;
(4.6) making CL=CL+1, repeating the steps (4.1) - (4.5), transmitting the network weight parameters from off-chip circulation to 32 FPs to complete multiply-accumulate operation, and accumulating the calculation results until CL=256/8-1=31;
(4.7) repeating the steps (4.1) - (4.3), turning (4.4) and simultaneously transmitting each of the 32 operation results into 8 BP (back propagation) of a group to the TPU module for operation until CW=512/(4×32) =4, and completing the operation of all parameters in the row direction of the LSTM weight matrix; the method for calculating 8 BP transmitted to the TPU module in a group of 4 calculation results is as follows:
(4.71) converting 32 operation results into a kth element i of input gate, forgetting gate, updating gate and output gate matrix vector multiplication results in parallel by adopting the following method t [k]、f t [k]、g t [k]And o t [k]:
(4.7.1a) let the value of the ABF CS-th temporary storage unit of the i-th FP be vt, let the variable temp= (CW 4+cs) 32+i;
(4.7.1b) let the variable k=round (temp/4), s=temp% 4, where round () is a rounding function,% represents the remainder, then: if s=0, i t [k]=vt; if s=1, f t [k]=vt; if s=2, g t [k]=vt; if s=3, o t [k]=vt;
(4.7.2) step (4.7.1 b) the i t [k]、f t [k]、g t [k]And o t [k]Every four are transmitted in parallel as a group to 8 BP units in the TPU, wherein the j-th group of data is transmitted to the j-th BP unit, j=k% 8,% represents the remainder, and the value c of the k-addressed unit in the LCB is simultaneously transmitted by DS t-1 [k]Transmitting to the corresponding BP;
(4.7.3) the following activation and dot product operations are performed in parallel by 8 BP units, and the calculation results are stored and stored:
(4.7.3a) performing the activation operation in parallel: s_i t [k]=sigmoid(i t [k])、s_f t [k]=sigmoid(f t [k])、t_g t [k]=tanh(g t [k]) And s_o t [k]=sigmoid(o t [k]) Obtaining the corresponding 8bit output s_i t [k]、s_f t [k]、t_g t [k]And s_o t [k]Wherein sigmoid (x) = (1/(1+e) -x )),tanh(x)=((e x -e -x )/(e x +e -x ));
(4.7.2b) performing vector element point multiplication operation in parallel to obtain a corresponding 8-bit output c t [k]And h t [k]Wherein c t [k]=s_i t [k]⊙t_g t [k]+s_f t [k]⊙c t-1 [k],h t [k]=s_o t [k]⊙tanh(c t [k]) "" indicates element dot product;
(4.7.2c) h is carried out by DS t [k]To the LOB of the cell(s),c t [k]to the LCB.
(5) Outputting the result of the IOB to the off-chip giving system output h t Repeating the steps (3) - (4), and performing the calculation at the next time.
The foregoing description is only illustrative of the present invention and is not intended to limit the scope of the invention, and all equivalent structures or equivalent processes or direct or indirect application in other related technical fields are included in the scope of the present invention.

Claims (5)

1. The LSTM acceleration system for data off-chip block transmission is characterized by comprising a parameterized parallel acceleration processing unit LAU, a parameterized on-chip memory module ONM, a data distributor DS and a parameterized off-chip memory module FFB;
the LAU mainly comprises a front operation unit HPU and a back operation unit TPU; the ONM comprises an input feature storage module LMB, a weight storage module LWB, an intermediate state storage module LCB and a result storage module LOB;
external pending feature input x t The outputs of the FFB are transmitted to LWB in batches through DS, and the outputs of the LMB and the LWB are transmitted to the HPU of the LAU in parallel after being processed by DS; the output of the HPU is transmitted to the TPU module in parallel, the output of the LCB is transmitted to the TPU through the DS, and the output of the TPU is transmitted to the LCB and the LOB through the DS; the output of the LOB is transmitted to the LMB and is output to the off-chip giving system output h t
2. The LSTM acceleration system for off-chip block transfer of claim 1, wherein the HPU includes CV FP units, each FP unit including a parameterized multiply-accumulate unit MAU and a block-cache accumulate unit ABF; the MAU can realize multiply-accumulate operation of RE N-bit fixed points in parallel, and the ABF can realize storage and accumulation of N N-bit fixed points; the TPU comprises CT BP units, and each BP unit can realize activation and dot multiplication operation in parallel; wherein CV, CT, RE, N and n are both non-negative integer powers of 2, satisfy cv=4ct, and are both parametrizable.
3. The LSTM acceleration system for off-chip block transmission of claim 1, wherein said FFB, LMB, LCB and LOB memory widths are n bits and memory depths are H respectively W *H L 、H L 、l h And l h Wherein H is W =4l h ,H L =l x +l h ,l x And l h The dimensions of the LSTM input vector and the hidden vector are respectively; the LWB is composed of CV blocks BRAM, each of which has a storage depth of N×RE, wherein N, RE, CV, N, l h 、l x Are all non-negative integer powers of 2 and can be parametrically implemented.
4. The LSTM acceleration system design method for data off-chip block transmission is characterized by comprising the following steps of:
(1) According to LSTM parameters specified by a user, determining design dimensions of each module in the LSTM acceleration system of claim 1, and instantiating each module to realize:
(1.1) determining system design parameters based on the user description, including the fixed data bit width n, the input vector and the hidden vector dimension l x And l h Values for CV, CT, RE and N;
(1.2) determining the design dimensions of the constituent modules of the LSTM acceleration system of claim 1 based on the feature requirements of claims 2 and 3;
(1.3) instantiating each constituent module of the LSTM acceleration system of claim 1 according to the design dimension of (1.2), and initializing all storage units and buffered contents to 0.
(2) The storage characterization requirement according to claim 3, wherein LSTM weight matrix parameters are stored to the FFB by interleaving as follows:
(2.1) let variable i=0;
(2.2) let the variable cl=round (i/(H) W *RE)),CS=round(i%(CV*N*RE)/(4*N*RE)),k=i%RE,s=round(i/(N*RE))%4,CW=round(i%(H W * RE)/(cv×n×re)), then:
if s=0, ffb [ i ] =wi [ CW x ct+cs ] [ CL x re+k ];
if s=1, ffb [ i ] =wf [ CW x ct+cs ] [ CL x re+k ];
if s=2, ffb [ i ] =wg [ CW x ct+cs ] [ CL x re+k ];
if s=3, ffb [ i ] =wo [ CW x ct+cs ] [ CL x re+k ];
wherein FFB [ i ] is the value of a storage unit with an address of i in FFB, wi [ i ] [ j ], wf [ i ] [ j ], wg [ i ] [ j ] and Wo [ i ] [ j ] are weight parameters of an LSTM network input gate, a forgetting gate, an updating gate and an output gate weight matrix in the ith row and the jth column respectively, round () is a rounding function, and% represents a remainder;
(2.3) let i=i+1, repeating steps (2.2) to (2.3) until i=h W *H L
(3) A storage feature requirement according to claim 3, inputting an external feature to be processed into x t And the output h of the LOB t Stored to the LMB, where x t Low addresses 0-l stored to LMB x Output h of LOB-1 t High address l stored to LMB x ~H L -1 and initializing a loop control parameter cl=cw=cs=0;
(4) And the DS circularly and sectionally transmits the weight parameters from the FFB to the LWB, and sends the weight parameters to the LAU to finish network acceleration operation, and the final operation result is stored in the LOB:
(4.1) simultaneously transmitting RE characteristic parameters in the LMB starting from address cl×re to CV FPs of the HPU unit through the DS;
(4.2) transmitting weights of the cv×n×re parameters from the address cw×re in the FFB to the CV block BRAM of the LWB through the DS, wherein corresponding parameters of addresses CW (i×n×re) to CW (i+1) n×re-1) are stored in addresses 0 to (n×re-1) of the i block BRAM respectively (i=0, 1..;
(4.3) transmitting RE data starting from an address CS RE in CV BRAMs of LWB to CV FPs of HPU in parallel through DS, wherein the ith BRAM content is sent to the ith FP, and multiply-accumulate operation is completed in parallel by MAU of each FP, and the calculation result is added to an address CS of ABF thereof;
(4.4) making cs=cs+1, repeating the step (4.3) until cs=n, and completing the operation of n×re parameters;
(4.5) making CW=CW+1, repeating steps (4.2) - (4.3), sending the weight parameters of the off-chip cyclic transmission network to CV pieces of FP to complete multiply-accumulate operation, and accumulating the calculation result until CW=H W Performing calculation of all parameters in the weight matrix column direction;
(4.6) making CL=CL+1, repeating steps (4.1) - (4.5), transmitting the network weight parameters from off-chip circulation to CV pieces of FP to complete multiply-accumulate operation, and accumulating the calculation result until CL=H L /RE-1;
(4.7) repeating the steps (4.1) - (4.3), turning (4.4) after the first calculation result is obtained, and simultaneously calculating the CT BP which is transmitted to the TPU module in a group of every 4 CV calculation results until CW=H W Performing (N: CV) operation on all parameters in the row direction of the LSTM weight matrix;
(5) Outputting the result of the IOB to the off-chip giving system output h t Repeating the steps (3) - (4), and performing the calculation at the next time.
5. The LSTM acceleration system design method for data off-chip block transmission is characterized in that in the step (4.7), the realization method for carrying out operation on CT BP which is transmitted to the TPU module in a group by every 4 CV operation results is as follows:
(4.7.1) converting CV number of operation results into k-th element i of input gate, forget gate, update gate and output gate matrix vector multiplication results in parallel by the following method t [k]、f t [k]、g t [k]And o t [k]:
(4.7.1a) let the ABF CS-th temporary storage unit of the i-th FP have the value vt, let the variable temp= (CW x n+cs) x cv+i;
(4.7.1b) let the variable k=round (temp/4), s=temp% 4, where round () is a rounding function,% represents the remainder, then: if s=0, i t [k]=vt; if s=1, f t [k]=vt; if s=2, g t [k]=vt; if s=3, o t [k]=vt;
(4.7.2) step (4.7.1 b) the i t [k]、f t [k]、g t [k]And o t [k]Every four are transmitted in parallel as a group to CT BP units in the TPU, wherein the j-th group of data is transmitted to the j-th BP unit, j=k% CT,% represents the remainder, and the value c of the k-address unit in the LCB is determined by DS t-1 [k]Transmitting to the corresponding BP;
(4.7.3) performing the following activation and point multiplication operations in parallel by the CT BP units, and storing the calculation results:
(4.7.3a) performing the activation operation in parallel: s_i t [k]=sigmoid(i t [k])、s_f t [k]=sigmoid(f t [k])、t_g t [k]=tanh(g t [k]) And s_o t [k]=sigmoid(o t [k]) Obtaining the corresponding n bit output s_i t [k]、s_f t [k]、t_g t [k]And s_o t [k]Wherein sigmoid (x) = (1/(1+e) -x )),tanh(x)=((e x -e -x )/(e x +e -x ));
(4.7.2b) performing vector element point multiplication operation in parallel to obtain a corresponding n bit output c t [k]And h t [k]Wherein c t [k]=s_i t [k]⊙t_g t [k]+s_f t [k]⊙c t-1 [k],h t [k]=s_o t [k]⊙tanh(c t [k]) "" indicates element dot product;
(4.7.2c) h is carried out by DS t [k]To LOB, c t [k]To the LCB.
CN202210436245.3A 2022-04-24 2022-04-24 Parameterized LSTM acceleration system for data off-chip block transmission and design method thereof Pending CN116992932A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210436245.3A CN116992932A (en) 2022-04-24 2022-04-24 Parameterized LSTM acceleration system for data off-chip block transmission and design method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210436245.3A CN116992932A (en) 2022-04-24 2022-04-24 Parameterized LSTM acceleration system for data off-chip block transmission and design method thereof

Publications (1)

Publication Number Publication Date
CN116992932A true CN116992932A (en) 2023-11-03

Family

ID=88528886

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210436245.3A Pending CN116992932A (en) 2022-04-24 2022-04-24 Parameterized LSTM acceleration system for data off-chip block transmission and design method thereof

Country Status (1)

Country Link
CN (1) CN116992932A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118070865A (en) * 2024-04-25 2024-05-24 北京壁仞科技开发有限公司 Optimization method and device of artificial intelligent model, electronic equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118070865A (en) * 2024-04-25 2024-05-24 北京壁仞科技开发有限公司 Optimization method and device of artificial intelligent model, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
Wang et al. Approximate policy-based accelerated deep reinforcement learning
CN108416422B (en) FPGA-based convolutional neural network implementation method and device
CN108090560A (en) The design method of LSTM recurrent neural network hardware accelerators based on FPGA
US20180164866A1 (en) Low-power architecture for sparse neural network
CN107301456B (en) Deep neural network multi-core acceleration implementation method based on vector processor
CN107633298B (en) Hardware architecture of recurrent neural network accelerator based on model compression
CN109409512B (en) Flexibly configurable neural network computing unit, computing array and construction method thereof
CN109409510B (en) Neuron circuit, chip, system and method thereof, and storage medium
US20140344203A1 (en) Neural network computing apparatus and system, and method therefor
CN108763159A (en) To arithmetic accelerator before a kind of LSTM based on FPGA
CN106127301A (en) A kind of stochastic neural net hardware realization apparatus
CN108304926B (en) Pooling computing device and method suitable for neural network
CN108205704A (en) A kind of neural network chip
CN114282678A (en) Method for training machine learning model and related equipment
CN112381209A (en) Model compression method, system, terminal and storage medium
CN111767994A (en) Neuron calculation module
CN116992932A (en) Parameterized LSTM acceleration system for data off-chip block transmission and design method thereof
CN111831354A (en) Data precision configuration method, device, chip array, equipment and medium
CN111831359A (en) Weight precision configuration method, device, equipment and storage medium
CN109685208B (en) Method and device for thinning and combing acceleration of data of neural network processor
CN110689123A (en) Long-short term memory neural network forward acceleration system and method based on pulse array
CN114519425A (en) Convolution neural network acceleration system with expandable scale
CN111831356B (en) Weight precision configuration method, device, equipment and storage medium
CN109978143B (en) Stack type self-encoder based on SIMD architecture and encoding method
RU2294561C2 (en) Device for hardware realization of probability genetic algorithms

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination