WO2022062391A1 - 一种加速rnn网络的系统、方法及存储介质 - Google Patents

一种加速rnn网络的系统、方法及存储介质 Download PDF

Info

Publication number
WO2022062391A1
WO2022062391A1 PCT/CN2021/089936 CN2021089936W WO2022062391A1 WO 2022062391 A1 WO2022062391 A1 WO 2022062391A1 CN 2021089936 W CN2021089936 W CN 2021089936W WO 2022062391 A1 WO2022062391 A1 WO 2022062391A1
Authority
WO
WIPO (PCT)
Prior art keywords
state
output
data
gate
circuit
Prior art date
Application number
PCT/CN2021/089936
Other languages
English (en)
French (fr)
Inventor
刘海威
董刚
赵雅倩
李仁刚
蒋东东
杨宏斌
梁玲燕
Original Assignee
苏州浪潮智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州浪潮智能科技有限公司 filed Critical 苏州浪潮智能科技有限公司
Priority to US18/012,938 priority Critical patent/US11775803B2/en
Publication of WO2022062391A1 publication Critical patent/WO2022062391A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F5/00Methods or arrangements for data conversion without changing the order or content of the data handled
    • G06F5/06Methods or arrangements for data conversion without changing the order or content of the data handled for changing the speed of data flow, i.e. speed regularising or timing, e.g. delay lines, FIFO buffers; over- or underrun control therefor
    • G06F5/065Partitioned buffers, e.g. allowing multiple independent queues, bidirectional FIFO's
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Definitions

  • the present invention relates to the technical field of neural networks, in particular to a system, method and storage medium for accelerating RNN network.
  • RNN Recurrent Neural Network, Recurrent Neural Network
  • RNN Recurrent Neural Network, Recurrent Neural Network
  • DNN Deep Neural Network
  • a neural network for processing sequence data. It is one of the most promising tools in deep learning. It is widely used in speech recognition, machine translation, text generation and other fields. It solves the problem that traditional neural networks cannot share location features from data.
  • traditional CNN, DNN and other neural network models from the input layer to the hidden layer to the output layer, the layers are fully connected, and the nodes between each layer are unconnected.
  • Such ordinary neural networks are powerless for many problems. For example, if you need to predict what the next word of a sentence is, you generally need to use the previous words, because the front and rear words in a sentence are not independent.
  • RNNs are called recurrent neural networks because the current output of a sequence is also related to the previous output.
  • the specific manifestation is that the network will memorize the previous information and apply it to the calculation of the current output, that is, the nodes between the hidden layers are no longer unconnected but connected, and the input of the hidden layer not only includes the input layer
  • the output also includes the output of the hidden layer at the previous moment.
  • FIG. 1 is a standard RNN structure diagram, each arrow represents a transformation, that is, the arrows are connected with weights.
  • the left side is what it looks like folded, the right side is what it looks like unfolded, and the arrow next to the h in the left side represents the "loop" in this structure.
  • x is the input
  • h is the hidden layer unit
  • o is the output
  • L is the loss function
  • y is the label of the training set.
  • the t in the upper right corner of these elements represents the state at time t. It can be seen that the performance of the policy unit h at time t is not only determined by the input at this time, but also affected by the time before time t.
  • V, W, and U are weights, and weights of the same type are connected with the same weight.
  • One of the key points of RNN is that it can be used to connect previous information to the current task.
  • GRU and LSTM are the more commonly used RNN networks.
  • LSTM Long Short-Term Memory networks
  • LSTM Long Short-Term Memory networks
  • Figure 2 is a schematic diagram of the LSTM structure and calculation formula.
  • LSTM removes or adds the information of "cell state” through the "gate” structure, realizing the retention of important content and the removal of unimportant content.
  • a probability value between 0 and 1 is output through the sigmoid layer, describing how much each part can pass, 0 means “do not allow the task variable to pass”, and 1 means “allow all variables to pass”.
  • the gate structures included are forget gate, input gate i t , forget gate ft , output gate o t and cell gate
  • RNNs are more and more widely used in speech recognition, machine translation, language modeling, sentiment analysis, and text prediction, the requirements for RNN networks are getting higher and higher. Therefore, in the face of increasingly complex networks and increasingly large model parameters, it is very important to use appropriate methods to accelerate the RNN network.
  • the purpose of the present invention is to provide a system, method and storage medium for accelerating the RNN network, so as to effectively accelerate the RNN network, reduce the time consumption, and improve the operation efficiency.
  • the present invention provides the following technical solutions:
  • a system for accelerating RNN networks including:
  • the first cache is used for cyclic switching between the first state and the second state, and in the first state, it divides N channels to output W x1 to W xN in parallel, and the parallelism is all k, and in the second state, Divide N parallel outputs from W h1 to W hN , and the parallel degrees are all k; N is a positive integer ⁇ 2;
  • a second buffer used for cyclically switching between the first state and the second state, and outputting x t in the first state and outputting h t-1 in the second state;
  • the vector multiplication circuit is configured to calculate W x1 x t to W xN x t respectively by using N groups of multiplication arrays when receiving W x1 to W xN output by the first buffer, and when receiving the output of the first buffer From W h1 to W hN , use N groups of multiplication arrays to calculate W h1 h t-1 to W hN h t-1 respectively; wherein, the vector multiplication circuit includes N groups of multiplication arrays, and each group of multiplication arrays includes k multiplication units ;
  • Addition circuit for receiving b 1 to b N sent by the offset data buffer and implementing W x1 x t +W h1 h t-1 +b 1 to W xN x t +W hN h t-1 + using the vector buffer Calculation of b N ;
  • the state update circuit is used to obtain c t-1 from the cell state cache, and calculates c t and h t according to the output of the activation circuit, and uses c t to update the cell state cache after c t is calculated. c t-1 , and send h t to the second buffer;
  • the offset data cache the vector cache; the cell state cache;
  • W x1 to W xN represent the weight data matrix of the first gate to the Nth gate in turn;
  • W h1 to W hN represent the hidden state weight data matrix of the first gate to the Nth gate in turn;
  • x t represents the input data at time t,
  • h t-1 represents the hidden state data at time t-1,
  • h t represents the hidden state data at time t,
  • c t represents the data at time t
  • the cell state, c t-1 represents the cell state at time t-1.
  • the first cache is specifically used to: cyclically switch between the first state and the second state, and in the first state, divide 4 parallel outputs W xi , W xf , W xo and W xc , and the parallelism is k, in the second state, divide into 4 parallel outputs W hi , W hf , W ho and W hc , and the parallelism is all k;
  • the second cache is specifically used to: cyclically switch between the first state and the second state, and output x t in the first state, and output h t-1 in the second state;
  • the vector multiplication circuit is specifically used for: when receiving W xi , W xf , W xo and W xc output by the first buffer, using 4 sets of multiplication arrays to calculate W xi x t , W xf x t , W xo respectively x t and W xc x t , when receiving W hi , W hf , W ho and W hc output by the first buffer, use 4 sets of multiplication arrays to calculate W hi h t-1 , W hf h t- 1 , Who h t-1 and W hc h t-1 ; wherein, the vector multiplication circuit includes 4 groups of multiplication arrays, and each group of multiplication arrays includes k multiplication units;
  • the adding circuit is specifically used for : receiving the bi, b f , b o and b c sent by the offset data buffer, and using the vector buffer to realize W xi x t +W hi h t-1 +b i , W xf x t +W hf h t-1 +b f , W xo x t +W ho h t-1 +b o , and calculation of W xc x t +W hc h t-1 +b c ;
  • the activation circuit is specifically used for: performing an activation operation according to the output of the adding circuit, and outputting it , ft , o t and
  • a state update circuit which is specifically used for: acquiring ct-1 from the cell state cache, calculating ct and h t according to the output of the activation circuit, and updating the cell state cache by using ct after calculating ct c t-1 in , and send h t to the second buffer;
  • W xi , W xf , W xo and W xc represent the input gate weight data matrix, the forget gate weight data matrix, the output gate weight data matrix and the cell gate weight data matrix in turn;
  • W hi , W hf , W ho and W hc In turn represent the input gate hidden state weight data matrix, the forget gate hidden state weight data matrix, the output gate hidden state weight data matrix and the cell gate hidden state weight data matrix;
  • b i , b f , b o and b c represent the input gate bias in turn set data, forget gate bias data, output gate bias data and cell gate bias data;
  • x t represents the input data at time t
  • h t-1 represents the hidden state data at time t-1
  • h t represents the hidden state data at time t
  • c t represents The cell state at
  • the vector multiplication circuit is in a first pipeline
  • the adding circuit is in a second pipeline
  • the activation circuit and the state update circuit are in a third pipeline
  • the first pipeline, the first pipeline The second pipeline and the third pipeline run in parallel.
  • the first cache includes:
  • a first storage unit used for obtaining the target quantity W xi , the target quantity W xf , the target quantity W xo and the target quantity W xc from the off-chip storage;
  • the second storage unit is used to obtain the target quantity of W hi , the target quantity of W hf , the target quantity of W ho and the target quantity of W hc from the off-chip storage;
  • a first multiplexer connected to the first storage unit and the second storage unit respectively, for realizing cyclic switching between the first state and the second state, and selecting the first storage unit in the first state
  • the unit performs data output, and in the second state, the second storage unit is selected for data output;
  • the first memory, the second memory, the third memory and the fourth memory are all connected to the first multiplexer through a data classifier, and when the first multiplexer is in the first state, are used for Parallel outputs W xi , W xf , W xo and W xc , and the parallel degrees are all k, and when the first multiplexer is in the second state, are used to output W hi , W hf , W ho and W hc , and the degree of parallelism is k;
  • the target number is greater than k.
  • both the first storage unit and the second storage unit use a first clock
  • the first memory, the second memory, the third memory and the fourth memory all use a second clock
  • the first clock and the second clock are independent of each other, so that the output rate of any one of the first memory, the second memory, the third memory and the fourth memory is low
  • unsent data is buffered in this memory.
  • the second cache includes:
  • a third storage unit used to obtain x t at odd-numbered moments from off-chip storage
  • the fourth storage unit used to obtain x t at even-numbered moments from off-chip storage
  • a second multiplexer connected to the third storage unit and the fourth storage unit respectively, for realizing cyclic switching between the first state and the second state, and selecting the third storage unit in the first state
  • the unit performs data output, and in the second state, the fourth storage unit is selected for data output;
  • a fifth storage unit configured to obtain h t and h 0 at even-numbered moments through the third multiplexer
  • a sixth storage unit configured to obtain h t at odd-numbered moments through the third multiplexer
  • the fourth multiplexer is used to realize cyclic switching between the first state and the second state, and select the fifth storage unit for data output in the first state, and select the sixth storage unit in the second state perform data output;
  • the fifth multiplexer is used to realize cyclic switching between the first state and the second state, and selects the second multiplexer for data output in the first state, and selects the fourth multiplexer in the second state Multiplexer for data output.
  • the adding circuit includes:
  • each group of adder circuits is used to sum the input k data
  • a vector adder circuit connected to the outputs of the 4 sets of adder circuits for receiving bi, bf, bo and bc sent by the offset data buffer, according to the output of each set of the adder circuits, and using the vector
  • the cache implements W xi x t +W hi h t-1 +b i , W xf x t +W hf h t-1 +b f , W xo x t +W ho h t-1 +b o , and W xc Calculation of x t +W hc h t-1 +b c .
  • the activation circuit is specifically used for:
  • the sigmoid activation operation and the tanh activation operation are performed according to the output of the adding circuit, and output it , ft , o t and
  • the state update circuit is specifically used for:
  • h t o t ⁇ tanh(c t ); ⁇ denotes dot product.
  • a method for accelerating RNN network applied in the system for accelerating RNN network described in any of the above, comprising:
  • the first cache switches cyclically between the first state and the second state, and in the first state, it divides N ways to output W x1 to W xN in parallel, and the parallelism is all k, and in the second state, divides N ways Parallel output W h1 to W hN , and the parallelism is all k; N is a positive integer ⁇ 2;
  • the second cache switches cyclically between the first state and the second state, and outputs x t in the first state, and outputs h t-1 in the second state;
  • the vector multiplication circuit uses N sets of multiplication arrays to calculate W x1 x t to W xN x t respectively when receiving W x1 to W xN output by the first buffer, and when receiving W h1 to W x N output from the first buffer W hN , using N groups of multiplication arrays to calculate W h1 h t-1 to W hN h t-1 respectively; wherein, the vector multiplication circuit includes N groups of multiplication arrays, and each group of multiplication arrays includes k multiplication units;
  • the adding circuit receives b 1 to b N sent by the biased data buffer, and implements W x1 x t +W h1 h t-1 +b 1 to W xN x t +W hN h t-1 +b N using the vector buffer calculate;
  • the activation circuit performs an activation operation according to the output of the addition circuit
  • the state update circuit obtains c t-1 from the cell state cache, and calculates c t and h t according to the output of the activation circuit, and uses c t to update c t- in the cell state cache after calculating c t 1 , and send h t to the second buffer;
  • W x1 to W xN represent the weight data matrix of the first gate to the Nth gate in turn;
  • W h1 to W hN represent the hidden state weight data matrix of the first gate to the Nth gate in turn;
  • b 1 to b N are the first to the Nth gate.
  • x t represents the input data at time t,
  • h t-1 represents the hidden state data at time t-1,
  • h t represents the hidden state data at time t,
  • c t represents the cell state at time t ,
  • c t-1 represents the cell state at time t-1.
  • a computer-readable storage medium storing a computer program on the computer-readable storage medium, when the computer program is executed by a processor, implements the steps of the above-mentioned method for accelerating an RNN network.
  • the application includes N groups.
  • the vector multiplication circuit of the multiplication array each group of multiplication arrays includes k multiplication units, which is beneficial to improve the calculation speed.
  • the calculation of W x x t and W h h t-1 are combined and calculated together. When the dimension of x t or h t-1 is large, the calculation speed will be very slow.
  • W x x t and W h h t-1 are time-divisionally and segmentally calculated, that is, it is not necessary to wait until all the values of W x x t and W h h t-1 are generated. Accumulation is beneficial to further improve the acceleration effect of the scheme.
  • the first cache is used for cyclic switching between the first state and the second state, and in the first state, it divides N channels to output W x1 to W xN in parallel, and the parallelism is all k, and in the second state In the state, the parallel outputs W h1 to W hN are divided into N channels, and the parallelism is all k; N is a positive integer ⁇ 2; the second cache is used for cyclic switching between the first state and the second state, and in the In the first state, x t is output, and in the second state, h t-1 is output.
  • the vector multiplication circuit will use N sets of multiplication arrays to calculate W x1 x t to W xN x t respectively when receiving W x1 to W xN output by the first buffer, and when receiving W h1 to W hN output by the first buffer
  • the adding circuit can then receive b 1 to b N sent by the offset data buffer, and use the vector buffer to realize W x1 x t +W h1 h t-1 +b 1 to W xN x t +W hN h t-1 +b Calculation of N.
  • each group of multiplication arrays includes k multiplication units, and by setting and adjusting the value of k, the solution of the present application can be adapted to RNN networks of different sizes, that is, the solution of the present application has a very high Strong flexibility and scalability. To sum up, the solution of the present application effectively realizes the acceleration of the RNN network, and has strong flexibility and scalability.
  • Figure 1 is a standard RNN structure diagram
  • Figure 2 is a schematic diagram of the LSTM structure and calculation formula
  • FIG. 3 is a schematic structural diagram of a system for accelerating RNN network in the present invention.
  • FIG. 4 is a schematic structural diagram of a first cache in the present invention.
  • Fig. 5 is a kind of structural representation of the second cache in the present invention.
  • Fig. 6 is the structural representation of a group of multiplying arrays in the present invention.
  • Fig. 7 is a kind of structural representation of adding circuit in the present invention.
  • FIG. 8 is a schematic diagram of pipeline operation in a specific embodiment of the present invention.
  • the core of the present invention is to provide a system for accelerating the RNN network, which effectively realizes the acceleration of the RNN network, and has strong flexibility and expansibility.
  • FIG. 3 is a schematic structural diagram of a system for accelerating RNN network in the present invention.
  • the system for accelerating RNN network can be applied to hardware such as FPGA, ASIC and reconfigurable chip.
  • FPGA has strong flexibility, The advantages of configurability and low power consumption, so FPGA is used as an example to explain later.
  • the system for accelerating the RNN network may include:
  • the first cache 10 is used for cyclic switching between the first state and the second state, and in the first state, divides N channels to output W x1 to W xN in parallel, and the parallelism is all k, and in the second state , divided into N parallel outputs W h1 to W hN , and the parallelism is all k; N is a positive integer ⁇ 2;
  • the second buffer 20 is used for cyclic switching between the first state and the second state, and outputs x t in the first state, and outputs h t-1 in the second state;
  • the vector multiplication circuit 30 is configured to calculate W x1 x t to W xN x t respectively by using N sets of multiplication arrays when receiving W x1 to W xN output by the first buffer 10 , and when receiving the W output from the first buffer 10 When h1 to W hN , use N groups of multiplication arrays to calculate W h1 h t-1 to W hN h t-1 respectively; wherein, the vector multiplication circuit 30 includes N groups of multiplication arrays, and each group of multiplication arrays includes k multiplication units;
  • Addition circuit 40 for receiving b 1 to b N sent by the offset data buffer, and implementing W x1 x t +W h1 h t-1 +b 1 to W xN x t +W hN h t-1 using the vector buffer +b calculation of N ;
  • an activation circuit 50 for performing an activation operation according to the output of the adding circuit
  • the state update circuit 60 is used to obtain ct-1 from the cell state cache, and calculate ct and h t according to the output of the activation circuit 50, and use ct to update the data in the cell state cache after ct is calculated. c t-1 , and send h t to the second buffer;
  • Offset data cache 70 vector cache 80; cell state cache 90;
  • W x1 to W xN represent the weight data matrix of the first gate to the Nth gate in turn;
  • W h1 to W hN represent the hidden state weight data matrix of the first gate to the Nth gate in turn;
  • x t represents the input data at time t,
  • h t-1 represents the hidden state data at time t-1,
  • h t represents the hidden state data at time t,
  • c t represents the data at time t
  • the cell state, c t-1 represents the cell state at time t-1.
  • N can be set according to the actual situation.
  • GRU and LSTM are more commonly used RNN networks.
  • RNN networks can be used in speech recognition, text recognition, text translation, language modeling, sentiment analysis, and text prediction. Especially the LSTM network has been more and more widely used due to its excellent characteristics.
  • the input data can be operated on, and the output result can be finally obtained.
  • it is an LSTM network
  • the LSTM network is an LSTM network applied to speech recognition
  • the input data x t at time t is specifically the speech input data to be recognized at time t.
  • the speech recognition result can be output.
  • the LSTM network is an LSTM network applied to text recognition
  • the input data x t at time t is specifically the image input data carrying the text to be recognized at time t.
  • the text recognition result can be output.
  • the input data x t at time t is specifically the text input data to be translated at time t, and the translation result can be output through the recognition of the LSTM network.
  • the input data x t at time t is specifically the input data of the emotion to be analyzed at time t, which can be voice input data or text input data. , the analysis results can be output.
  • the first cache 10 is specifically used to: cyclically switch between the first state and the second state, and in the first state, divide into 4 parallel outputs W xi , W xf , W xo and W xc , and the parallelism is all is k, in the second state, it is divided into 4 parallel outputs W hi , W hf , W ho and W hc , and the parallelism is all k;
  • the second cache 20 is specifically used to: cyclically switch between the first state and the second state, and output x t in the first state, and output h t-1 in the second state;
  • the vector multiplication circuit 30 is specifically used for: when receiving W xi , W xf , W xo and W xc output by the first buffer 10 , using four sets of multiplication arrays to calculate W xi x t , W xf x t , W xo respectively x t and W xc x t , when receiving W hi , W hf , W ho and W hc output by the first buffer 10 , use 4 sets of multiplication arrays to calculate W hi h t-1 , W hf h t-1 respectively , W h h t-1 and W hc h t-1 ; wherein, the vector multiplication circuit 30 includes 4 groups of multiplication arrays, and each group of multiplication arrays includes k multiplication units;
  • the adding circuit 40 is specifically used for : receiving bi , b f , b o and b c sent by the offset data buffer 70 , and using the vector buffer 80 to realize W xi x t +W hi h t-1 +b i , W Calculations of xf x t +W hf h t-1 +b f , W xo x t +W ho h t-1 +b o , and W xc x t +W hc h t-1 +b c ;
  • the activation circuit 50 is specifically used to: perform an activation operation according to the output of the addition circuit 40, and output it, ft , ot and
  • the state update circuit 60 is specifically used to: obtain ct-1 from the Cell state cache 90, calculate ct and h t according to the output of the activation circuit 50, and update the Cell state with ct after ct is calculated ct -1 in the buffer 90, and send h t to the second buffer 20;
  • Offset data cache 70 Vector cache 80; Cell state cache 90;
  • W xi , W xf , W xo and W xc represent the input gate weight data matrix, the forget gate weight data matrix, the output gate weight data matrix and the cell gate weight data matrix in turn;
  • W hi , W hf , W ho and W hc In turn represent the input gate hidden state weight data matrix, the forget gate hidden state weight data matrix, the output gate hidden state weight data matrix and the cell gate hidden state weight data matrix;
  • b i , b f , b o and b c represent the input gate bias in turn set data, forget gate bias data, output gate bias data and cell gate bias data;
  • x t represents the input data at time t
  • h t-1 represents the hidden state data at time t-1
  • h t represents the hidden state data at time t
  • c t represents The cell state at
  • technicians usually refer to the four gate structures as input gates, forget gates, output gates and cell gates.
  • W x1 , W x2 , W x3 and W x4 are used in turn W x1 , W x2 , W x3 and W x4 xi , W xf , W xo and W xc represent, as described above, W xi , W xf , W xo and W xc represent the input gate weight data matrix, forget gate weight data matrix, output gate weight data matrix and cell gate in turn Weights data matrix.
  • b 1 to b N described above represent the bias data of the first to Nth gates in turn.
  • W h1 to W hN described above represent the hidden state weight data matrix of the first gate to the Nth gate in turn.
  • W hi , W hf , W ho and W hc are used to refer to W in turn h1 , W h2 , W h3 and W h4 .
  • W x in this application refers to W xi , W xf , W xo and W xc
  • W h refers to W hi , W hf , W ho and W hc , and the same is true hereinafter.
  • the first cache 10 switches cyclically between the first state and the second state, and the parallelism of each output is k, so that the solution of the present application does not need to combine W x x t and W h h t-1 Calculated together, but can be calculated in time-sharing and segmented, which is beneficial to prevent each part of the system accelerating the LSTM network from stalling, thereby improving efficiency.
  • the first cache 10 includes:
  • the first storage unit 101 is used to obtain the target quantity W xi , the target quantity W xf , the target quantity W xo and the target quantity W xc from the off-chip storage;
  • the second storage unit 102 is used to obtain the target quantity W hi , the target quantity W hf , the target quantity W ho and the target quantity W hc from the off-chip storage;
  • the first multiplexer 103 connected to the first storage unit 101 and the second storage unit 102 respectively is used to realize the cyclic switching between the first state and the second state, and select the first storage unit 101 in the first state for Data output, selecting the second storage unit 102 for data output in the second state;
  • the first memory 105, the second memory 106, the third memory 107 and the fourth memory 108 are all connected to the first multiplexer 103 through the data classifier 104, and when the first multiplexer 103 is in the first state, They are sequentially used to output W xi , W xf , W xo and W xc in parallel, and the parallel degrees are all k. When the first multiplexer 103 is in the second state, they are used to output W hi , W hf , and W in parallel in sequence. ho and W hc , and the degree of parallelism is k;
  • the number of targets is greater than k.
  • the first storage unit 101 can obtain a target amount of W x from the off-chip storage
  • the second storage unit 102 can obtain a target amount of W h from the off-chip storage, and the target amount is greater than k. It is to consider that a large amount of data is continuously read to the FPGA chip at one time, which can reduce the number of communication between the FPGA and the off-chip storage. It can be understood that the first storage unit 101 and the second storage unit 102 can be set to have larger capacities.
  • the capacity allows, for example, all the data of W x occupies a total size of 2M
  • the 2M data can be stored in the first storage unit 101 at one time, that is, the target quantity is 2M, and it is no longer necessary to store the data from Get W x in off-chip storage.
  • the capacity of the FPGA is limited.
  • W x occupies a total size of 2M, but the capacity of the first storage unit 101 is only 1M.
  • the target quantity can be set to 1M, and then read cyclically, that is, read 2M at a time. The first 1M in , the next time is to read the last 1M in 2M, and this cycle.
  • the first storage unit 101 is used to store W x
  • the second storage unit 102 is used to store W h , which constitute a ping-pong structure, which can ensure high-speed and continuous output of data.
  • the present application implements the switching between W x and W h through the first multiplexer 103 .
  • the first multiplexer 103 is in the first state
  • the first storage unit 101 is selected for data output, that is, Wx is output.
  • the second storage unit 102 is selected for data output, that is, output W h .
  • the parallelism is both k, that is, not all W x and W h in the first storage unit 101 and the second storage unit 102 are output.
  • the dimension of W x is expressed as N h ⁇ N x
  • the dimension of W h is expressed as N h ⁇ N h
  • the dimension of x t is expressed as N x ⁇ 1
  • the dimension of the bias data B is expressed as N h ⁇ 1.
  • the dimension of W x is 100 ⁇ 500
  • the degree of parallelism k is 10
  • the dimension of W h is 100 ⁇ 100
  • the degree of parallelism k is 10.
  • the first multiplexer 103 first selects the first 10 data of the first row of Wx
  • the first multiplexer 103 selects the first 10 data of the first row of Wh .
  • the first multiplexer 103 selects the 11th to 20th data of the first row of W x , and then the first multiplexer 103 selects the first row of W h .
  • the offset data B described in this application refers to bi , b f , bo and b c .
  • W x in this application includes W xi , W xf , W xo and W xc
  • the vector multiplication circuit 30 includes 4 sets of multiplication arrays. Therefore, it needs to be classified by the data classifier 104.
  • the W xi , W xf , W xo and W xc output by the way selector 103 are transferred to different multiplication arrays. The same is true for W h .
  • the first memory 105 , the second memory 106 , the third memory 107 and the fourth memory 108 are all FIFO memories, and the FIFO-Wi 105 in FIG.
  • FIG. 4 represents the first memory 105 for outputting W xi and W hi , correspondingly, FIFO-Wf106, FIFO-Wo107, and FIFO-Wc108 in FIG. 4 represent the second memory 106, the third memory 107 and the fourth memory 108 in turn, which are used to output W xf and W hf in sequence , W xo and W ho , W xc and W hc .
  • LSTM The calculation of LSTM can be expressed by the following six formulas, namely:
  • Input gate i t ⁇ (W xi x t +W hi h t-1 +b i )
  • Output gate o t ⁇ (W xo x t +W ho h t-1 +b o )
  • W x and W h are provided by the first buffer for 10 minutes, and x t and h t-1 are provided by the second buffer for 20 minutes.
  • both the first storage unit 101 and the second storage unit 102 use the first clock
  • the first memory 105 , the second memory 106 , the third memory 107 and the fourth memory 108 Both use the second clock, and the first clock and the second clock are independent of each other. Therefore, when the output rate of any one of the first memory 105 , the second memory 106 , the third memory 107 and the fourth memory 108 is lower than the input rate, the unsent data can be buffered in the memory. That is, the first memory 105 , the second memory 106 , the third memory 107 and the fourth memory 108 function to cache data.
  • the continuous output of data by the first storage unit 101 and the second storage unit 102 will not be affected, which is beneficial to further guarantee the acceleration effect of the solution of the present application.
  • the second cache 20 may specifically include:
  • the third storage unit 201 is used to obtain x t at odd-numbered moments from off-chip storage;
  • the fourth storage unit 202 is used to obtain x t at even-numbered moments from off-chip storage
  • the second multiplexer 205 which is respectively connected with the third storage unit 201 and the fourth storage unit 202, is used to realize the cyclic switching between the first state and the second state, and select the third storage unit 201 in the first state for Data output, select the fourth storage unit 202 for data output in the second state;
  • the fifth storage unit 203 is used for obtaining h t and h 0 at even-numbered moments through the third multiplexer 206;
  • the sixth storage unit 204 configured to obtain h t at odd-numbered moments through the third multiplexer 206;
  • the fourth multiplexer 207 is used to realize cyclic switching between the first state and the second state, and selects the fifth storage unit 203 for data output in the first state, and selects the sixth storage unit 204 for data output in the second state data output;
  • the fifth multiplexer 208 is used to realize cyclic switching between the first state and the second state, and selects the second multiplexer 205 for data output in the first state, and selects the fourth multiplexer in the second state
  • the selector 207 performs data output.
  • the storage units are all BRAM storage units, that is, in FIGS. 4 and 5 , the first storage unit 101 , the second storage unit 102 , and the third storage unit 201, the fourth storage unit 202, the fifth storage unit 203 and the sixth storage unit 204 are sequentially represented as a first BRAM101, a second BRAM102, a third BRAM201, a fourth BRAM202, a fifth BRAM203 and a sixth BRAM204.
  • the third storage unit 201 is used to obtain x t at odd-numbered moments from off-chip storage
  • the fourth storage unit 202 is used to obtain x t at even-numbered moments from off-chip storage, considering that a single storage unit cannot Realizing the simultaneous read and write operations of x t is not conducive to high-speed continuous data output. Therefore, the third storage unit 201 and the fourth storage unit 202 form a ping-pong structure, which is conducive to high-speed continuous data output.
  • the fifth storage unit 203 and the sixth storage unit 204 also form a ping-pong structure, which is conducive to realizing high-speed and continuous output of data.
  • the hidden state data all come from the state update circuit 60 .
  • the division of x t is performed by odd-numbered moments and even-numbered moments, so as to be placed in the third storage unit 201 or the fourth storage unit 202. In other embodiments, it can also be set as Other division manners will not affect the implementation of the present invention. For example, in a specific scenario, x t at the first, second and third moments are all placed in the third storage unit 201, x t at the next three moments are all placed in the fourth storage unit 202, and then Then, x t at the next three times are all placed in the third storage unit 201, and the cycle is repeated.
  • the first multiplexer 103 first selects the first 10 data of the first row of Wx, and at the same time, the fifth multiplexer 208 is in the first state, that is, the fifth multiplexer The selector 208 selects the second multiplexer 205 for data output. At this time, the second multiplexer 205 is in the first state, that is, the second multiplexer 205 at this time selects the third storage unit 201 for data output. In this specific scenario, the first time is selected. The first 10 data of x t , that is, the first 10 data of x 1 . That is to say, at this time, the vector multiplication circuit 30 calculates the multiplication of the first 10 data of the first row of W x and the first 10 data of x 1 .
  • the first multiplexer 103 selects the first 10 data of the first row of W h , and at the same time, the fifth multiplexer 208 is in the second state, that is, the fifth multiplexer 208 selects the first Four multiplexers 207 perform data output.
  • the fourth multiplexer 207 is in the first state, that is, the fourth multiplexer 207 at this time selects the fifth storage unit 203 for data output. In this specific scenario, it selects the top 10 of h0 data. That is to say, at this time, the vector multiplication circuit 30 calculates the multiplication of the first 10 data of the first row of W h and the first 10 data of h 0 .
  • the first multiplexer 103 selects the 11th to 20th data of the first row of Wx, and at the same time, the fifth multiplexer 208 is in the first state, that is, the fifth multiplexer 208
  • the second multiplexer 205 is selected for data output.
  • the second multiplexer 205 is still in the first state, that is, at this time, the second multiplexer 205 selects the third storage unit 201 for data output.
  • the first time is selected.
  • the 11th to 20th data of x t that is, the 11th to 20th data of x 1 . That is to say, at this time, the vector multiplication circuit 30 calculates the multiplication of the 11th to 20th data of the first row of W x and the 11th to 20th data of x 1 .
  • the vector multiplication circuit 30 of the present application includes 4 groups of identical multiplication arrays, each group of multiplication arrays includes k multiplication units, please refer to FIG. 6 , each PE in FIG. 6 is a multiplication unit, and FIG. 6 shows a multiplication unit. Schematic diagram of the structure of the group multiplication array, each PE completes a multiplication operation. For example, in the foregoing embodiment, when the value of k is 10, each group of multiplication arrays includes 10 PEs.
  • the dimension of W x is expressed as N h ⁇ N x
  • the dimension of W h is expressed as N h ⁇ N h
  • the dimension of x t is expressed as N x ⁇ 1
  • the dimension of the bias data B is expressed as is N h ⁇ 1.
  • W x is a matrix with 3 rows and 5 columns
  • x t is a vector with 5 rows and 1 column.
  • k is 5
  • x t is repeated 3 times, and the first time x t is used.
  • the vector V x which is the result of the calculation of the entire W x x t .
  • the structure of the vector multiplication circuit 30 of the present application is very simple. According to the size structure of the LSTM network, that is, when the values of N h and N x change, the present application only needs to change the value of k. Well adapted to LSTM networks of different size structures.
  • the adding circuit 40 of the present application is used to receive bi , b f , b o and b c sent by the offset data buffer 70 , and use the vector buffer 80 to realize W xi x t +W hi h t-1 +b i , Calculations of W xf x t +W hf h t-1 +b f , W xo x t +W ho h t-1 +b o , and W xc x t +W hc h t-1 +b c .
  • the addition circuit 40 needs to use the vector buffer 80 because each group of multiplication arrays of the vector multiplication circuit 30 of the present application outputs k results instead of all the results of W x x t or W h h t-1 . That is, what the adding circuit 40 obtains each time is a partial sum of matrix-vector multiplication.
  • the adding circuit 40 may include:
  • each group of adder circuits is used to sum the input k data
  • the vector addition circuit 401 which is connected to the outputs of the four sets of adder circuits, is used to receive the bi , b f , b o and b c sent by the offset data buffer 70, according to the output of each set of adder circuits, and use the vector Cache 80 implements W xi x t +W hi h t-1 +b i , W xf x t +W hf h t-1 +b f , W xo x t +W ho h t-1 +b o , and W Calculation of xc x t +W hc h t-1 +b c .
  • W x x t for example, taking W xi x t as an example, the summation of every k data output by the multiplication array is called an accumulation, then after After several accumulations, a number in the final output vector W xi x t can be obtained. and after After times, the vector W xi x t with dimension N h can be obtained, that is, the calculation result V xi of the entire W xi x t can be obtained.
  • the calculation of W hi h t-1 is the same.
  • W xi x t and W hi h t-1 are obtained, W xi x t , W hi h t-1 and b i are summed, that is, The calculation of W xi x t +W hi h t-1 + bi is realized.
  • the activation circuit 50 can usually complete the activation operations of four gate structures at the same time, that is, in a specific implementation manner of the present invention, the activation circuit 50 is specifically used to: perform the sigmoid activation operation and the tanh activation operation according to the output of the summing circuit 40, and output i t , ft , o t and
  • the sigmoid activation operation is also the ⁇ symbol in the aforementioned 6 formulas of the LSTM calculation
  • the tanh activation operation refers to the tanh symbol in the aforementioned 6 formulas of the LSTM calculation.
  • the state update circuit 60 uses ct to update ct -1 in the Cell state cache 90 for calculating ct at the next time step.
  • the cell state of the first time step can come from off-chip storage, that is, c 0 can come from off-chip storage.
  • the vector multiplication circuit 30 is in the first pipeline
  • the addition circuit 40 is in the second pipeline
  • the activation circuit and the state update circuit 60 are in the third pipeline
  • the first pipeline, the second pipeline The pipeline and the third pipeline run in parallel.
  • the vector multiplication circuit 30 is arranged in the first pipeline
  • the addition circuit 40 is arranged in the second pipeline
  • the activation circuit 50 and the state update circuit 60 are arranged in the third pipeline
  • the first pipeline, the second pipeline and the The third pipelines all run in parallel. In this way, it is not necessary to obtain all the results of W h h t-1 , and the subsequent addition operation can be started.
  • the multiplication circuit 30 has already started the multiplication operation of the next time step, and
  • the adding circuit 40 also performs the summation of the partial sums immediately, so that each part of the system of the present application does not need to be paused, that is, the aforementioned dependence is eliminated by the design of the pipeline, and the operation efficiency of the LSTM network is further improved.
  • the vector multiplication circuit 30 starts the operation of the next time step, that is, The operation of W x x 2 is started, followed by the operation of W h h 1 , and so on, until all time steps are calculated, that is, x t at each moment is calculated, and the LSTM network completes the business process.
  • the application includes N groups.
  • the vector multiplication circuit of the multiplication array each group of multiplication arrays includes k multiplication units, which is beneficial to improve the calculation speed.
  • the calculation of W x x t and W h h t-1 are combined and calculated together. When the dimension of x t or h t-1 is large, the calculation speed will be very slow.
  • W x x t and W h h t-1 are time-divisionally and segmentally calculated, that is, it is not necessary to wait until all the values of W x x t and W h h t-1 are generated. Accumulation is beneficial to further improve the acceleration effect of the scheme.
  • the first cache is used for cyclic switching between the first state and the second state, and in the first state, it divides N channels to output W x1 to W xN in parallel, and the parallelism is all k, and in the second state In the state, the parallel outputs W h1 to W hN are divided into N channels, and the parallelism is all k; N is a positive integer ⁇ 2; the second cache is used for cyclic switching between the first state and the second state, and in the In the first state, x t is output, and in the second state, h t-1 is output.
  • the vector multiplication circuit will use N sets of multiplication arrays to calculate W x1 x t to W xN x t respectively when receiving W x1 to W xN output by the first buffer, and when receiving W h1 to W hN output by the first buffer
  • the adding circuit can then receive b 1 to b N sent by the offset data buffer, and use the vector buffer to realize W x1 x t +W h1 h t-1 +b 1 to W xN x t +W hN h t-1 +b Calculation of N.
  • each group of multiplication arrays includes k multiplication units, and by setting and adjusting the value of k, the solution of the present application can be adapted to RNN networks of different sizes, that is, the solution of the present application has a very high Strong flexibility and scalability. To sum up, the solution of the present application effectively realizes the acceleration of the RNN network, and has strong flexibility and scalability.
  • the embodiment of the present invention further provides a method for accelerating an RNN network, which can be referred to in the above.
  • the method for accelerating the RNN network can be applied to the system for accelerating the RNN network in any of the above-mentioned embodiments, including:
  • Step 1 The first cache switches cyclically between the first state and the second state, and in the first state, divides N channels to output W x1 to W xN in parallel, and the parallelism is all k, and in the second state, Divide N parallel outputs from W h1 to W hN , and the parallel degrees are all k; N is a positive integer ⁇ 2;
  • Step 2 The second cache switches cyclically between the first state and the second state, and outputs x t in the first state and h t-1 in the second state;
  • Step 3 When the vector multiplication circuit receives W x1 to W xN output by the first buffer, it uses N sets of multiplication arrays to calculate W x1 x t to W xN x t respectively.
  • W h1 to W hN respectively calculate W h1 h t-1 to W hN h t-1 by using N groups of multiplication arrays; wherein, the vector multiplication circuit includes N groups of multiplication arrays, and each group of multiplication arrays includes k multiplication units;
  • Step 4 The adding circuit receives b 1 to b N sent by the offset data buffer, and uses the vector buffer to realize W x1 x t +W h1 h t-1 +b 1 to W xN x t +W hN h t-1 + Calculation of b N ;
  • Step 5 the activation circuit performs an activation operation according to the output of the adding circuit
  • Step 6 The state update circuit obtains c t-1 from the cell state cache, and calculates c t and h t according to the output of the activation circuit, and uses c t to update the cell state cache after calculating c t . c t-1 , and send h t to the second buffer;
  • W x1 to W xN represent the weight data matrix of the first gate to the Nth gate in turn;
  • W h1 to W hN represent the hidden state weight data matrix of the first gate to the Nth gate in turn;
  • b 1 to b N are the first to the Nth gate.
  • x t represents the input data at time t,
  • h t-1 represents the hidden state data at time t-1,
  • h t represents the hidden state data at time t,
  • c t represents the cell state at time t ,
  • c t-1 represents the cell state at time t-1.
  • step 1 the first cache is cyclically switched between the first state and the second state, and in the first state, W xi , W xf , W xo and W xc are divided into 4 parallel outputs, and the degree of parallelism is are all k, in the second state, 4 channels of parallel output W hi , W hf , W ho and W hc , and the parallelism is all k;
  • the second step is specifically: the second cache is cyclically switched between the first state and the second state, and in the first state, output x t , and in the second state, output h t-1 ;
  • Step 3 is as follows: when the vector multiplication circuit receives W xi , W xf , W xo and W xc output by the first buffer, it uses 4 sets of multiplication arrays to calculate W xi x t , W xf x t , W xo x t respectively and W xc x t , when receiving W hi , W hf , W ho and W hc output by the first buffer, use 4 sets of multiplication arrays to calculate W hi h t-1 , W hf h t-1 , W ho respectively h t-1 and W hc h t-1 ; wherein, the vector multiplication circuit includes 4 groups of multiplication arrays, and each group of multiplication arrays includes k multiplication units;
  • Step 4 is specifically as follows : the adding circuit receives bi , b f , b o and b c sent by the offset data buffer, and uses the vector buffer to realize W xi x t +W hi h t-1 +b i , W xf x t +W hf h t-1 +b f , W xo x t +W ho h t-1 +b o , and calculation of W xc x t +W hc h t-1 +b c ;
  • Step 5 is specifically: the activation circuit performs an activation operation according to the output of the adding circuit, and outputs it, ft , ot and
  • Step 6 is specifically as follows: the state update circuit obtains ct -1 from the cell state cache, calculates ct and h t according to the output of the activation circuit, and uses ct to update the data in the cell state cache after ct is calculated. c t-1 , and send h t to the second buffer;
  • W xi , W xf , W xo and W xc represent the input gate weight data matrix, the forget gate weight data matrix, the output gate weight data matrix and the cell gate weight data matrix in turn;
  • W hi , W hf , W ho and W hc In turn represent the input gate hidden state weight data matrix, the forget gate hidden state weight data matrix, the output gate hidden state weight data matrix and the cell gate hidden state weight data matrix;
  • b i , b f , b o and b c represent the input gate bias in turn set data, forget gate bias data, output gate bias data and cell gate bias data;
  • x t represents the input data at time t
  • h t-1 represents the hidden state data at time t-1
  • h t represents the hidden state data at time t
  • c t represents The cell state at
  • the vector multiplication circuit is in the first pipeline
  • the adding circuit is in the second pipeline
  • the activation circuit and the state update circuit are in the third pipeline
  • the first pipeline, the second pipeline and the third pipeline Three pipelines run in parallel.
  • step 1 includes:
  • the first storage unit obtains the target quantity of W xi , the target quantity of W xf , the target quantity of W xo and the target quantity of W xc from the off-chip storage;
  • the second storage unit obtains the target amount of W hi , the target amount of W hf , the target amount of W ho and the target amount of W hc from the off-chip storage;
  • the first multiplexer connected to the first storage unit and the second storage unit respectively realizes the cyclic switching between the first state and the second state, and selects the first storage unit for data output in the first state, and outputs the data in the second state. Select the second storage unit for data output in the state;
  • the first memory, the second memory, the third memory and the fourth memory are all connected to the first multiplexer through the data classifier, and when the first multiplexer is in the first state, output W xi , W in parallel in sequence xf , W xo and W xc , and the degrees of parallelism are all k, when the first multiplexer is in the second state, W hi , W hf , W ho and W hc are output in parallel in sequence, and the degrees of parallelism are all k;
  • the number of targets is greater than k.
  • both the first storage unit and the second storage unit use a first clock
  • the fourth memory uses a second clock
  • the first clock and the second clock are independent of each other, so that the first memory, the second memory, the third memory and the fourth memory
  • unsent data is buffered in that memory.
  • step 2 includes:
  • the third storage unit obtains x t at odd-numbered moments from off-chip storage
  • the fourth storage unit obtains x t at even-numbered moments from off-chip storage
  • the second multiplexer connected with the third storage unit and the fourth storage unit respectively realizes the cyclic switching between the first state and the second state, and selects the third storage unit for data output in the first state, and in the second state Select the fourth storage unit for data output in the state;
  • the fifth storage unit obtains h t and h 0 at even-numbered moments through the third multiplexer
  • the sixth storage unit obtains h t at odd-numbered moments through the third multiplexer
  • the fourth multiplexer realizes cyclic switching between the first state and the second state, and selects the fifth storage unit for data output in the first state, and selects the sixth storage unit for data output in the second state;
  • the fifth multiplexer realizes cyclic switching between the first state and the second state, and selects the second multiplexer for data output in the first state, and selects the fourth multiplexer for data output in the second state .
  • step 4 includes:
  • a vector addition circuit connected to the outputs of the 4 sets of adder circuits receives the bi , b f , b o and b c sent by the offset data buffer, according to the output of each set of adder circuits, and uses the vector buffer to realize W xi x t +W hi h t-1 +b i , W xf x t +W hf h t-1 +b f , W xo x t +W ho h t-1 +b o , and W xc x t +W Calculation of hc h t-1 +b c .
  • step 5 includes:
  • the activation circuit performs a sigmoid activation operation and a tanh activation operation according to the output of the adding circuit, and outputs it , ft , o t and
  • step 6 includes:
  • the state update circuit obtains c t-1 from the cell state cache, calculates c t and h t according to the output of the activation circuit, and uses c t to update c t-1 in the cell state cache after calculating c t , and send h t to the second cache;
  • h t o t ⁇ tanh(c t ); ⁇ denotes dot product.
  • an embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, any of the foregoing embodiments is implemented.
  • the steps of the method for accelerating the LSTM network can be referred to the above.
  • the computer-readable storage medium mentioned here includes random access memory (RAM), internal memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disks, removable disks, CD-ROMs, Or any other form of storage medium known in the technical field.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Neurology (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

一种加速RNN网络的系统、方法及存储介质,包括:第一缓存,用于通过循环切换的方式分N路并行输出W x1至W xN,或者W h1至W hN,且并行度均为k;第二缓存,用于通过循环切换的方式输出x t或者h t-1;向量乘法电路,用于利用N组乘法阵列分别计算W x1x t至W xNx t,或者分别计算W h1h t-1至W hNh t-1;加法电路,用于实现W x1x t+W h1h t-1+b 1至W xNx t+W hNh t-1+b N的计算;激活电路用于根据加法电路的输出进行激活操作;状态更新电路,用于获取c t-1并进行c t及h t的计算,并更新c t-1且发送h t至第二缓存;偏置数据缓存;向量缓存;cell状态缓存。应用本申请的方案有效地实现了RNN网络的加速,并具有很强的灵活性和扩展性。

Description

一种加速RNN网络的系统、方法及存储介质
本申请要求于2020年09月25日提交至中国专利局、申请号为202011023267.4、发明名称为“一种加速RNN网络的系统、方法及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及神经网络技术领域,特别是涉及一种加速RNN网络的系统、方法及存储介质。
背景技术
RNN(Recurrent Neural Network,循环神经网络)是一种用于处理序列数据的神经网络,是目前深度学习中最有前景的工具之一,广泛应用于语音识别、机器翻译、文本生成等领域。它解决了传统的神经网络不能从数据中共享位置特征的问题。在传统的CNN、DNN等神经网络模型中,从输入层到隐含层再到输出层,层与层之间是全连接的,每层之间的节点是无连接的。这种普通的神经网络对于很多问题无能无力。例如,需要预测句子的下一个单词是什么,一般需要用到前面的单词,因为一个句子中前后单词并不是独立的。RNN被称为循环神经网路,是因为一个序列当前的输出与前面的输出也有关联。具体的表现形式是网络会对前面的信息进行记忆并应用于当前输出的计算中,即隐藏层之间的节点不再是无连接而是有连接的,并且隐藏层的输入不仅包括输入层的输出还包括前面时刻隐藏层的输出。
图1是一个标准的RNN结构图,每个箭头代表做一次变换,也即箭头连接带有权值。左侧是折叠起来的样子,右侧是展开的样子,左侧中h旁边的箭头代表此结构中的“循环”。x是输入,h是隐层单元,o为输出,L为损失函数,y为训练集的标签。这些元素右上角带的t代表t时刻的状态。可以看出,因策单元h在t时刻的表现不仅由此刻的输入决定,还受t时刻之前的时刻的影响。V、W、U是权值,同一类型的权连接权值相同。RNN的关键点之一就是可以用来连接先前的信息到当前的任务上。
GRU和LSTM是较为常用的RNN网络。LSTM(Long Short-Term Memory networks,长短期记忆网络)可以解决长依赖问题,适合处理和预测时间序列中的间隔和延迟非常长的重要事件。
图2是LSTM结构以及计算公式示意图,LSTM通过“门”结构来去除或者增加“细胞状态”的信息,实现了对重要内容的保留和对不重要内容的去除。通过sigmoid层输出一个0到1之间的概率值,描述每个部分有多少量可以通过,0表示“不允许任务变量通过”,1表示“允许所有变量通过”。其中包含的门结构有遗忘门,输入门i t,遗忘门f t,输出门o t以及cell门
Figure PCTCN2021089936-appb-000001
随着RNN在语音识别、机器翻译、语言建模、情感分析和文本预测等领域应用地越来越广泛,对于RNN网络的要求也越来越高。因此面对越来越复杂、模型参数越来越庞大的网络,采用合适的方式对RNN网络加速显得十分重要。
综上所述,如何有效地加速RNN网络,降低耗时,提高运行效率,是目前本领域技术人员急需解决的技术问题。
发明内容
本发明的目的是提供一种加速RNN网络的系统、方法及存储介质,以有效地加速RNN网络,降低耗时,提高运行效率。
为解决上述技术问题,本发明提供如下技术方案:
一种加速RNN网络的系统,包括:
第一缓存,用于在第一状态和第二状态之间循环切换,且在第一状态时,分N路并行输出W x1至W xN,且并行度均为k,在第二状态时,分N路并行输出W h1至W hN,且并行度均为k;N为≥2的正整数;
第二缓存,用于在第一状态和第二状态之间循环切换,且在第一状态时输出x t,在第二状态时输出h t-1
向量乘法电路,用于当接收到所述第一缓存输出的W x1至W xN时,利用N组乘法阵列分别计算W x1x t至W xNx t,当接收到所述第一缓存输出的W h1至W hN时,利用N组乘法阵列分别计算W h1h t-1至W hNh t-1;其中,所述向量乘 法电路包括N组乘法阵列,每组乘法阵列包括k个乘法单元;
加法电路,用于接收偏置数据缓存发送的b 1至b N,并且利用向量缓存实现W x1x t+W h1h t-1+b 1至W xNx t+W hNh t-1+b N的计算;
激活电路,用于根据所述加法电路的输出进行激活操作;
状态更新电路,用于从cell状态缓存中获取c t-1,并根据所述激活电路的输出进行c t以及h t的计算,并在计算出c t之后利用c t更新cell状态缓存中的c t-1,并将h t发送至第二缓存;
所述偏置数据缓存;所述向量缓存;所述cell状态缓存;
其中,W x1至W xN依次表示第一门至第N门的权重数据矩阵;W h1至W hN依次表示第一门至第N门的隐状态权重数据矩阵;b 1至b N依次表示第一门至第N门的偏置数据;x t表示t时刻的输入数据,h t-1表示t-1时刻的隐状态数据,h t表示t时刻的隐状态数据,c t表示t时刻的cell状态,c t-1表示t-1时刻的cell状态。
优选的,RNN网络具体为LSTM网络,N=4,包括:
第一缓存,具体用于:在第一状态和第二状态之间循环切换,且在第一状态时,分4路并行输出W xi,W xf,W xo以及W xc,且并行度均为k,在第二状态时,分4路并行输出W hi,W hf,W ho以及W hc,且并行度均为k;
第二缓存,具体用于:在第一状态和第二状态之间循环切换,且在第一状态时,输出x t,在第二状态时,输出h t-1
向量乘法电路,具体用于:当接收到所述第一缓存输出的W xi,W xf,W xo以及W xc时,利用4组乘法阵列分别计算W xix t,W xfx t,W xox t以及W xcx t,当接收到所述第一缓存输出的W hi,W hf,W ho以及W hc时,利用4组乘法阵列分别计算W hih t-1,W hfh t-1,W hoh t-1以及W hch t-1;其中,所述向量乘法电路包括4组乘法阵列,每组乘法阵列包括k个乘法单元;
加法电路,具体用于:接收偏置数据缓存发送的b i,b f,b o以及b c,并且利用向量缓存实现W xix t+W hih t-1+b i,W xfx t+W hfh t-1+b f,W xox t+W hoh t-1+b o,以及W xcx t+W hch t-1+b c的计算;
激活电路,具体用于:根据所述加法电路的输出进行激活操作,并输出i t,f t,o t以及
Figure PCTCN2021089936-appb-000002
状态更新电路,具体用于:从cell状态缓存中获取c t-1,并根据所述激活电路的输出进行c t以及h t的计算,并在计算出c t之后利用c t更新cell状态缓存中的c t-1,并将h t发送至第二缓存;
其中,W xi,W xf,W xo以及W xc依次表示输入门权重数据矩阵,遗忘门权重数据矩阵,输出门权重数据矩阵以及cell门权重数据矩阵;W hi,W hf,W ho以及W hc依次表示输入门隐状态权重数据矩阵,遗忘门隐状态权重数据矩阵,输出门隐状态权重数据矩阵以及cell门隐状态权重数据矩阵;b i,b f,b o以及b c依次表示输入门偏置数据,遗忘门偏置数据,输出门偏置数据以及cell门偏置数据;i t,f t,o t以及
Figure PCTCN2021089936-appb-000003
依次表示输入门,遗忘门,输出门以及cell门;x t表示t时刻的输入数据,h t-1表示t-1时刻的隐状态数据,h t表示t时刻的隐状态数据,c t表示t时刻的cell状态,c t-1表示t-1时刻的cell状态。
优选的,所述向量乘法电路处于第一流水线中,所述加法电路处于第二流水线中,所述激活电路和所述状态更新电路处于第三流水线中,并且所述第一流水线,所述第二流水线以及所述第三流水线并行运行。
优选的,所述第一缓存包括:
第一存储单元,用于从片外存储中获取目标数量的W xi,目标数量的W xf,目标数量的W xo以及目标数量的W xc
第二存储单元,用于从片外存储中获取目标数量的W hi,目标数量的W hf,目标数量的W ho以及目标数量的W hc
分别与所述第一存储单元以及所述第二存储单元连接的第一多路选择器,用于实现第一状态和第二状态的循环切换,并且在第一状态下选择所述第一存储单元进行数据输出,在第二状态下选择所述第二存储单元进行数据输出;
第一存储器,第二存储器,第三存储器以及第四存储器均通过数据分类器与所述第一多路选择器连接,并且在所述第一多路选择器为第一状态时,依次用于并行输出W xi,W xf,W xo以及W xc,且并行度均为k,在所述 第一多路选择器为第二状态时,依次用于并行输出W hi,W hf,W ho以及W hc,且并行度均为k;
所述数据分类器;
其中,所述目标数量大于k。
优选的,所述第一存储单元与所述第二存储单元均采用第一时钟,所述第一存储器,所述第二存储器,所述第三存储器以及所述第四存储器均采用第二时钟,且所述第一时钟与所述第二时钟相互独立,以使得所述第一存储器、所述第二存储器、所述第三存储器以及所述第四存储器中的任一存储器的输出速率低于输入速率时,将未发送的数据缓存在该存储器中。
优选的,所述第二缓存,包括:
第三存储单元,用于从片外存储中获取奇数时刻的x t
第四存储单元,用于从片外存储中获取偶数时刻的x t
分别与所述第三存储单元以及所述第四存储单元连接的第二多路选择器,用于实现第一状态和第二状态的循环切换,并且在第一状态下选择所述第三存储单元进行数据输出,在第二状态下选择所述第四存储单元进行数据输出;
第三多路选择器,用于从片外存储中获取h 0并接收状态更新电路发送的h t,并且仅在首次选择时选择h 0;h 0表示t=1时刻的隐状态数据;
第五存储单元,用于通过所述第三多路选择器获取偶数时刻的h t以及h 0
第六存储单元,用于通过所述第三多路选择器获取奇数时刻的h t
第四多路选择器,用于实现第一状态和第二状态的循环切换,并且在第一状态下选择所述第五存储单元进行数据输出,在第二状态下选择所述第六存储单元进行数据输出;
第五多路选择器,用于实现第一状态和第二状态的循环切换,并且在第一状态下选择所述第二多路选择器进行数据输出,在第二状态下选择所述第四多路选择器进行数据输出。
优选的,所述加法电路,包括:
4组log 2k级的加法器电路,每组加法器电路用于进行输入的k个数据 的求和;
与4组加法器电路的输出均连接的向量加法电路,用于接收偏置数据缓存发送的b i,b f,b o以及b c,根据每组所述加法器电路的输出,并且利用向量缓存实现W xix t+W hih t-1+b i,W xfx t+W hfh t-1+b f,W xox t+W hoh t-1+b o,以及W xcx t+W hch t-1+b c的计算。
优选的,所述激活电路,具体用于:
根据所述加法电路的输出进行sigmoid激活操作以及tanh激活操作,并输出i t,f t,o t以及
Figure PCTCN2021089936-appb-000004
优选的,所述状态更新电路,具体用于:
从cell状态缓存中获取c t-1,并根据所述激活电路的输出进行c t以及h t的计算的,并在计算出c t之后利用c t更新cell状态缓存中的c t-1,并将h t发送至第二缓存;
Figure PCTCN2021089936-appb-000005
h t=o t⊙tanh(c t);⊙表示点乘。
一种加速RNN网络的方法,应用于上述任一项所述的加速RNN网络的系统中,包括:
第一缓存在第一状态和第二状态之间循环切换,且在第一状态时,分N路并行输出W x1至W xN,且并行度均为k,在第二状态时,分N路并行输出W h1至W hN,且并行度均为k;N为≥2的正整数;
第二缓存在第一状态和第二状态之间循环切换,且在第一状态时输出x t,在第二状态时输出h t-1
向量乘法电路当接收到所述第一缓存输出的W x1至W xN时,利用N组乘法阵列分别计算W x1x t至W xNx t,当接收到所述第一缓存输出的W h1至W hN,利用N组乘法阵列分别计算W h1h t-1至W hNh t-1;其中,所述向量乘法电路包括N组乘法阵列,每组乘法阵列包括k个乘法单元;
加法电路接收偏置数据缓存发送的b 1至b N,并且利用向量缓存实现W x1x t+W h1h t-1+b 1至W xNx t+W hNh t-1+b N的计算;
激活电路根据所述加法电路的输出进行激活操作;
状态更新电路从cell状态缓存中获取c t-1,并根据所述激活电路的输出 进行c t以及h t的计算,并在计算出c t之后利用c t更新cell状态缓存中的c t-1,并将h t发送至第二缓存;
其中,W x1至W xN依次表示第一门至第N门的权重数据矩阵;W h1至W hN依次表示第一门至第N门的隐状态权重数据矩阵;b 1至b N依次第一至第N门的偏置数据;x t表示t时刻的输入数据,h t-1表示t-1时刻的隐状态数据,h t表示t时刻的隐状态数据,c t表示t时刻的cell状态,c t-1表示t-1时刻的cell状态。
一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现上述加速RNN网络的方法的步骤。
应用本发明实施例所提供的技术方案,具体的,考虑到门结构的计算占据了整个RNN网络的计算的绝大部分,其中主要是矩阵和向量相乘的计算,本申请设置了包括N组乘法阵列的向量乘法电路,每组乘法阵列包括k个乘法单元,有利于提高计算速度。并且考虑到传统的方案中,W xx t的计算和W hh t-1是合并在一起计算的,当x t或者h t-1的维度较大时,就会导致计算速度很慢。因此,本申请的方案中,将W xx t和W hh t-1进行分时、分段地计算,即不需要等到W xx t和W hh t-1的全部值产生才进行累加,有利于进一步地提高方案的加速效果。具体的,第一缓存,用于在第一状态和第二状态之间循环切换,且在第一状态时,分N路并行输出W x1至W xN,且并行度均为k,在第二状态时,分N路并行输出W h1至W hN,且并行度均为k;N为≥2的正整数;第二缓存,用于在第一状态和第二状态之间循环切换,且在第一状态时输出x t,在第二状态时输出h t-1。向量乘法电路则会在接收到第一缓存输出的W x1至W xN时,利用N组乘法阵列分别计算W x1x t至W xNx t,当接收到第一缓存输出的W h1至W hN时,利用N组乘法阵列分别计算W h1h t-1至W hNh t-1。加法电路便可以接收偏置数据缓存发送的b 1至b N,并且利用向量缓存实现W x1x t+W h1h t-1+b 1至W xNx t+W hNh t-1+b N的计算。此外,本申请的方案中,每组乘法阵列包括k个乘法单元,通过对k的数值的设定和调整,使得本申请的方案可以适应不同尺寸的RNN网络,即使得本申请的方案具有很强的灵活性和扩展性。综上所述,本申请的方案有效地实现了对于 RNN网络的加速,并且具有很强的灵活性和扩展性。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为一个标准的RNN结构图;
图2为LSTM结构以及计算公式示意图;
图3为本发明中一种加速RNN网络的系统的结构示意图;
图4为本发明中第一缓存的一种结构示意图;
图5为本发明中第二缓存的一种结构示意图;
图6为本发明中一组乘法阵列的结构示意图;
图7为本发明中加法电路的一种结构示意图;
图8为本发明一种具体实施方式中的流水线式工作示意图。
具体实施方式
本发明的核心是提供一种加速RNN网络的系统,有效地实现了对于RNN网络的加速,并且具有很强的灵活性和扩展性。
为了使本技术领域的人员更好地理解本发明方案,下面结合附图和具体实施方式对本发明作进一步的详细说明。显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
请参考图3,图3为本发明中一种加速RNN网络的系统的结构示意图,该加速RNN网络的系统,可以应用于FPGA,ASIC以及可重构芯片等硬件中,FPGA具有灵活性强、可配置以及低功耗的优点,因此后文便以FPGA为例进行说明。
该加速RNN网络的系统可以包括:
第一缓存10,用于在第一状态和第二状态之间循环切换,且在第一状态时,分N路并行输出W x1至W xN,且并行度均为k,在第二状态时,分N路并行输出W h1至W hN,且并行度均为k;N为≥2的正整数;
第二缓存20,用于在第一状态和第二状态之间循环切换,且在第一状态时输出x t,在第二状态时输出h t-1
向量乘法电路30,用于当接收到第一缓存10输出的W x1至W xN时,利用N组乘法阵列分别计算W x1x t至W xNx t,当接收到第一缓存10输出的W h1至W hN时,利用N组乘法阵列分别计算W h1h t-1至W hNh t-1;其中,向量乘法电路30包括N组乘法阵列,每组乘法阵列包括k个乘法单元;
加法电路40,用于接收偏置数据缓存发送的b 1至b N,并且利用向量缓存实现W x1x t+W h1h t-1+b 1至W xNx t+W hNh t-1+b N的计算;
激活电路50,用于根据加法电路的输出进行激活操作;
状态更新电路60,用于从cell状态缓存中获取c t-1,并根据激活电路50的输出进行c t以及h t的计算,并在计算出c t之后利用c t更新cell状态缓存中的c t-1,并将h t发送至第二缓存;
偏置数据缓存70;向量缓存80;cell状态缓存90;
其中,W x1至W xN依次表示第一门至第N门的权重数据矩阵;W h1至W hN依次表示第一门至第N门的隐状态权重数据矩阵;b 1至b N依次表示第一门至第N门的偏置数据;x t表示t时刻的输入数据,h t-1表示t-1时刻的隐状态数据,h t表示t时刻的隐状态数据,c t表示t时刻的cell状态,c t-1表示t-1时刻的cell状态。
N的取值可以根据实际情况进行设定,例如GRU和LSTM是较为常用的RNN网络,对于GRU而言,具有2个门结构,即N=2,而对于LSTM网络而言,具有4个门结构,因此N=4。
RNN网络可以应用在语音识别、文字识别,文本翻译、语言建模、情感分析和文本预测等领域。特别是LSTM网络,由于其优良的特性,得到了越来越广泛的应用。
本申请的后文中,便具体以LSTM网络进行说明。
通过RNN网络,可以对输入数据进行运算,最终得到输出结果。例如具体为LSTM网络,且LSTM网络为应用于语音识别的LSTM网络时,t时刻的输入数据x t具体便为t时刻的待识别的语音输入数据,通过LSTM网络的识别,可以输出语音识别结果。LSTM网络为应用于文字识别的LSTM网络时,t时刻的输入数据x t具体便为t时刻的携带待识别文字的图像输入数据,通过LSTM网络的识别,可以输出文字识别结果。LSTM网络为应用于文本翻译的LSTM网络时,t时刻的输入数据x t具体便为t时刻的待翻译的文本输入数据,通过LSTM网络的识别,可以输出翻译结果。LSTM网络为应用于情感分析的LSTM网络时,t时刻的输入数据x t具体便为t时刻的待分析情感的输入数据,可以是语音输入数据,也可以是文本输入数据,通过LSTM网络的识别,可以输出分析结果。
本申请的图3的实施方式中,是以LSTM网络进行说明,即图3的加速RNN网络的系统具体为加速LSTM网络的系统,N=4,该加速LSTM网络的系统可以包括:
第一缓存10,具体用于:在第一状态和第二状态之间循环切换,且在第一状态时,分4路并行输出W xi,W xf,W xo以及W xc,且并行度均为k,在第二状态时,分4路并行输出W hi,W hf,W ho以及W hc,且并行度均为k;
第二缓存20,具体用于:在第一状态和第二状态之间循环切换,且在第一状态时,输出x t,在第二状态时,输出h t-1
向量乘法电路30,具体用于:当接收到第一缓存10输出的W xi,W xf,W xo以及W xc时,利用4组乘法阵列分别计算W xix t,W xfx t,W xox t以及W xcx t,当接收到第一缓存10输出的W hi,W hf,W ho以及W hc时,利用4组乘法阵列分别计算W hih t-1,W hfh t-1,W hoh t-1以及W hch t-1;其中,向量乘法电路30包括4组乘法阵列,每组乘法阵列包括k个乘法单元;
加法电路40,具体用于:接收偏置数据缓存70发送的b i,b f,b o以及b c,并且利用向量缓存80实现W xix t+W hih t-1+b i,W xfx t+W hfh t-1+b f,W xox t+W hoh t-1+b o,以及W xcx t+W hch t-1+b c的计算;
激活电路50,具体用于:根据加法电路40的输出进行激活操作,并 输出i t,f t,o t以及
Figure PCTCN2021089936-appb-000006
状态更新电路60,具体用于:从Cell状态缓存90中获取c t-1,并根据激活电路50的输出进行c t以及h t的计算,并在计算出c t之后利用c t更新Cell状态缓存90中的c t-1,并将h t发送至第二缓存20;
偏置数据缓存70;向量缓存80;Cell状态缓存90;
其中,W xi,W xf,W xo以及W xc依次表示输入门权重数据矩阵,遗忘门权重数据矩阵,输出门权重数据矩阵以及cell门权重数据矩阵;W hi,W hf,W ho以及W hc依次表示输入门隐状态权重数据矩阵,遗忘门隐状态权重数据矩阵,输出门隐状态权重数据矩阵以及cell门隐状态权重数据矩阵;b i,b f,b o以及b c依次表示输入门偏置数据,遗忘门偏置数据,输出门偏置数据以及cell门偏置数据;i t,f t,o t以及
Figure PCTCN2021089936-appb-000007
依次表示输入门,遗忘门,输出门以及cell门;x t表示t时刻的输入数据,h t-1表示t-1时刻的隐状态数据,h t表示t时刻的隐状态数据,c t表示t时刻的cell状态,c t-1表示t-1时刻的cell状态。
需要说明的是,前文中,描述了W x1至W xN依次表示第一门至第N门的权重数据矩阵,而本申请具体以LSTM网络进行说明,即N=4,也就意味着具有W x1,W x2,W x3以及W x4,依次表示第一门的权重数据矩阵,第二门的权重数据矩阵,第三门的权重数据矩阵以及第四门的权重数据矩阵。在LSTM网络中,技术人员通常将四个门结构称为输入门,遗忘门,输出门以及cell门,因此,本申请的方案中,W x1,W x2,W x3以及W x4,依次用W xi,W xf,W xo以及W xc表示,如上文的描述,W xi,W xf,W xo以及W xc依次表示输入门权重数据矩阵,遗忘门权重数据矩阵,输出门权重数据矩阵以及cell门权重数据矩阵。
同理,上文中描述的b 1至b N依次表示第一至第N门的偏置数据,具体到LSTM中,依次用b i,b f,b o以及b c,依次指代b 1至b 4。同理,上文描述的W h1至W hN依次表示第一门至第N门的隐状态权重数据矩阵,具体到LSTM中,用W hi,W hf,W ho以及W hc,依次指代W h1,W h2,W h3以及W h4
具体的,在本申请的方案中,对于LSTM网络,通过第一缓存10进行W x以及W h的输出。对于LSTM网络而言,本申请的W x即表示W xi,W xf, W xo以及W xc,W h即表示W hi,W hf,W ho以及W hc,后文中也是如此。第一缓存10通过在第一状态和第二状态之间循环切换,且每次输出的并行度均为k,使得本申请的方案并不需要将W xx t和W hh t-1合并在一起计算,而是可以进行分时、分段地计算,有利于使得加速LSTM网络的系统中的各个部分不会产生停顿,从而有利于提高效率。
第一缓存10的具体结构可以根据实际需要进行设定和调整,例如在本发明的一种具体实施方式中,可参阅图4,第一缓存10包括:
第一存储单元101,用于从片外存储中获取目标数量的W xi,目标数量的W xf,目标数量的W xo以及目标数量的W xc
第二存储单元102,用于从片外存储中获取目标数量的W hi,目标数量的W hf,目标数量的W ho以及目标数量的W hc
分别与第一存储单元101以及第二存储单元102连接的第一多路选择器103,用于实现第一状态和第二状态的循环切换,并且在第一状态下选择第一存储单元101进行数据输出,在第二状态下选择第二存储单元102进行数据输出;
第一存储器105,第二存储器106,第三存储器107以及第四存储器108均通过数据分类器104与第一多路选择器103连接,并且在第一多路选择器103为第一状态时,依次用于并行输出W xi,W xf,W xo以及W xc,且并行度均为k,在第一多路选择器103为第二状态时,依次用于并行输出W hi,W hf,W ho以及W hc,且并行度均为k;
数据分类器104;
其中,目标数量大于k。
该种实施方式中,第一存储单元101可以从片外存储中获取目标数量的W x,第二存储单元102可以从片外存储中获取目标数量的W h,目标数量大于k。是考虑到一次性连续读取大量数据到FPGA片上,可以减少FPGA与片外存储的通信次数。可以理解的是,第一存储单元101和第二存储单元102可以设置成较大的容量。当然,在容量允许的情况下,例如W x的全部数据一共占据2M大小,可以一次性地将这2M数据存储在第一存储单元101中,即目标数量便为2M,之后便不再需要从片外存储中获取W x。 在更多的情况下,FPGA的容量有限,例如W x一共占据2M大小,但第一存储单元101的容量只有1M,则目标数量例如可以设置为1M,然后循环读取,即一次读取2M中的前1M,下一次是读取2M中的后1M,以此循环。
并且,本申请利用第一存储单元101存储W x,利用第二存储单元102存储W h,二者构成了乒乓结构,这样可以保证数据的高速连续的输出。
本申请通过第一多路选择器103实现W x和W h的切换,具体的,第一多路选择器103为第一状态时,选择第一存储单元101进行数据输出,即输出W x。第一多路选择器103为第二状态时,选择第二存储单元102进行数据输出,即输出W h。并且在输出W x以及输出W h时,并行度均为k,即并不是将第一存储单元101以及第二存储单元102中全部的W x和W h进行输出。例如,将W x的维度表示为N h×N x,将W h的维度表示为N h×N h,x t的维度表示为N x×1,偏置数据B的维度表示为N h×1。例如一种具体场景中,W x的维度为100×500,并行度k为10,W h的维度为100×100,并行度k为10。则该种场景中,例如第一多路选择器103首先选取的是W x的第一行的前10个数据,然后第一多路选择器103选取的是W h的第一行的前10个数据,再然后第一多路选择器103选取的是W x的第一行的第11至第20个数据,再然后第一多路选择器103选取的是W h的第一行的第11至第20个数据,以此类推,当W x的全部数据都被读取之后,再从头开始W x的读取,W h与此同理。本申请描述的偏置数据B即表示b i,b f,b o以及b c
本申请的W x包括了W xi,W xf,W xo以及W xc,而向量乘法电路30中包括了4组乘法阵列,因此,需要通过数据分类器104进行分类,即,需要将第一多路选择器103输出的W xi,W xf,W xo以及W xc传输到不同的乘法阵列中。W h与此同理。并且在图4的实施方式中,第一存储器105,第二存储器106,第三存储器107以及第四存储器108均为FIFO存储器,图4中的FIFO-Wi105即表示第一存储器105,用于输出W xi和W hi,相应的,图4中的FIFO-Wf106,FIFO-Wo107,FIFO-Wc108依次表示第二存储器106,第三存储器107以及第四存储器108,依次用于输出W xf和W hf,W xo和W ho,W xc和W hc
LSTM的计算可以用下述6个公式表示,即:
输入门i t=σ(W xix t+W hih t-1+b i)
遗忘门f t=σ(W xfx t+W hfh t-1+b f)
输出门o t=σ(W xox t+W hoh t-1+b o)
Cell门
Figure PCTCN2021089936-appb-000008
Cell状态
Figure PCTCN2021089936-appb-000009
隐状态h t=o t⊙tanh(c t)
可以看出,在进行前4个公式的计算时,本申请由第一缓存10分时提供W x和W h,由第二缓存20分时提供x t和h t-1
进一步的,在本发明的一种具体实施方式中,第一存储单元101与第二存储单元102均采用第一时钟,第一存储器105,第二存储器106,第三存储器107以及第四存储器108均采用第二时钟,且第一时钟与第二时钟相互独立。因此,可以使得第一存储器105,第二存储器106,第三存储器107以及第四存储器108中的任一存储器的输出速率低于输入速率时,将未发送的数据缓存在该存储器中。即,使得第一存储器105,第二存储器106,第三存储器107以及第四存储器108起到了缓存数据的作用。相较于统一设定一个第一时钟,如果第一存储器105,第二存储器106,第三存储器107以及第四存储器108中的任一存储器出现短暂的数据输出不及时的情况,本申请的该种实施方式,也不会影响第一存储单元101和第二存储单元102进行数据的连续输出,也就有利于进一步地保障本申请方案的加速效果。
在本发明的一种具体实施方式中,可参阅图5,第二缓存20可以具体包括:
第三存储单元201,用于从片外存储中获取奇数时刻的x t
第四存储单元202,用于从片外存储中获取偶数时刻的x t
分别与第三存储单元201以及第四存储单元202连接的第二多路选择器205,用于实现第一状态和第二状态的循环切换,并且在第一状态下选择第三存储单元201进行数据输出,在第二状态下选择第四存储单元202 进行数据输出;
第三多路选择器206,用于从片外存储中获取h 0并接收状态更新电路60发送的h t,并且仅在首次选择时选择h 0;h 0表示t=1时刻的隐状态数据;
第五存储单元203,用于通过第三多路选择器206获取偶数时刻的h t以及h 0
第六存储单元204,用于通过第三多路选择器206获取奇数时刻的h t
第四多路选择器207,用于实现第一状态和第二状态的循环切换,并且在第一状态下选择第五存储单元203进行数据输出,在第二状态下选择第六存储单元204进行数据输出;
第五多路选择器208,用于实现第一状态和第二状态的循环切换,并且在第一状态下选择第二多路选择器205进行数据输出,在第二状态下选择第四多路选择器207进行数据输出。
需要说明的是,在图4和图5的实施方式中,存储单元均采用的是BRAM存储单元,即图4和图5中,第一存储单元101,第二存储单元102,第三存储单元201,第四存储单元202,第五存储单元203以及第六存储单元204,依次表示为第一BRAM101,第二BRAM102,第三BRAM201,第四BRAM202,第五BRAM203以及第六BRAM204。
图5的实施方式中,利用第三存储单元201从片外存储中获取奇数时刻的x t,利用第四存储单元202从片外存储中获取偶数时刻的x t,是考虑到单个存储单元不能实现x t的同时的读写操作,不利于进行数据的高速连续的输出,因此,通过第三存储单元201和第四存储单元202构成乒乓结构,有利于实现数据的高速连续的输出。第五存储单元203和第六存储单元204与此同理,也是构成乒乓结构,有利于实现数据的高速连续的输出。
第三多路选择器206只会在首次选择时选择h 0,h 0表示t=1时刻的隐状态数据,即第一个时间步的隐状态数据h 0来自片外存储,其余时间步的隐状态数据均来自状态更新电路60。
此外,需要指出的是,该种实施方式中,通过奇数时刻以及偶数时刻进行x t的划分,从而置入第三存储单元201或者第四存储单元202中,其他实施方式中,也可以设置为其他的划分方式,并不会影响本发明的实施。 例如一种具体场景中,第一个,第二个以及第三个时刻的x t均置入第三存储单元201中,之后的三个时刻的x t均置入第四存储单元202,再然后的三个时刻的x t均置入第三存储单元201中,以此循环。
例如一种具体场景中,第一多路选择器103首先选取的是W x的第一行的前10个数据,同时,第五多路选择器208为第一状态,即第五多路选择器208选择的是第二多路选择器205进行数据输出。此时的第二多路选择器205则为第一状态,即此时的第二多路选择器205选择第三存储单元201进行数据输出,该种具体场景中,则是选择第一时刻的x t的前10个数据,即x 1的前10个数据。也就是说,此时的向量乘法电路30计算的是W x的第一行的前10个数据与x 1的前10个数据的乘法。
然后,第一多路选择器103选取的是W h的第一行的前10个数据,同时,第五多路选择器208为第二状态,即第五多路选择器208选择的是第四多路选择器207进行数据输出。此时的第四多路选择器207为第一状态,即此时的第四多路选择器207选择第五存储单元203进行数据输出,该种具体场景中,则是选择h 0的前10个数据。也就是说,此时的向量乘法电路30计算的是W h的第一行的前10个数据与h 0的前10个数据的乘法。
在之后,第一多路选择器103选取的是W x的第一行的第11到第20个数据,同时,第五多路选择器208为第一状态,即第五多路选择器208选择的是第二多路选择器205进行数据输出。此时的第二多路选择器205仍然是第一状态,即此时的第二多路选择器205选择第三存储单元201进行数据输出,该种具体场景中,则是选择第一时刻的x t的第11到第20个数据,即x 1的第11到第20个数据。也就是说,此时的向量乘法电路30计算的是W x的第一行的第11到第20个数据与x 1的第11到第20个数据的乘法。
后续的过程与此类似,直到实现整个W xx t的计算以及实现整个W hh t-1的计算,此处便不再赘述。
本申请的向量乘法电路30包括4组完全相同的乘法阵列,每组乘法阵列包括k个乘法单元,可参阅图6,图6中的每个PE即为一个乘法单元,图6示出了一组乘法阵列的结构示意图,每个PE完成一个乘法操作。例如前述实施方式中,k的取值为10时,则每组乘法阵列便包括10个PE。
并且需要说明的是,将W x的维度表示为N h×N x,将W h的维度表示为N h×N h,x t的维度表示为N x×1,偏置数据B的维度表示为N h×1。在计算W xx t时,权重数据W x需要遍历
Figure PCTCN2021089936-appb-000010
次,
Figure PCTCN2021089936-appb-000011
表示向上取整,x t则需要遍历
Figure PCTCN2021089936-appb-000012
次。即可以理解的是,x t需要重复使用,重复次数为
Figure PCTCN2021089936-appb-000013
用一个简单的例子表述,例如W x为3行5列的矩阵,x t则为5行1列的向量,例如k取值为5,则x t重复使用3次,第一次将x t与W x的第一行相乘,第二次将x t与W x的第二行相乘,最后一次是将x t与W x的第三行相乘,从而得到一个3行1列的向量V x,也即整个W xx t的计算结果。
相应的,W hh t-1的过程与此同理,不再赘述。
此外,可以看出,本申请的向量乘法电路30结构十分简单,可以根据LSTM网络的尺寸结构,即当N h和N x的数值变化时,本申请只需要改变k的取值,便可以很好地适应不同尺寸结构的LSTM网络。
本申请的加法电路40,用于接收偏置数据缓存70发送的b i,b f,b o以及b c,并且利用向量缓存80实现W xix t+W hih t-1+b i,W xfx t+W hfh t-1+b f,W xox t+W hoh t-1+b o,以及W xcx t+W hch t-1+b c的计算。
加法电路40需要利用向量缓存80,是因为本申请的向量乘法电路30的每一组乘法阵列,输出的是k个结果,而不是W xx t或者W hh t-1的全部结果。即加法电路40每次得到的是矩阵向量乘的部分和。此外,加法电路40还需要完成输入门i t=σ(W xix t+W hih t-1+b i),遗忘门f t=σ(W xfx t+W hfh t-1+b f),输出门o t=σ(W xox t+W hoh t-1+b o),Cell门
Figure PCTCN2021089936-appb-000014
这四个公式中的括号内的加法运算。
在本发明的一种具体实施方式中,可参阅图7,加法电路40可以包括:
4组log 2k级的加法器电路,每组加法器电路用于进行输入的k个数据 的求和;
与4组加法器电路的输出均连接的向量加法电路401,用于接收偏置数据缓存70发送的b i,b f,b o以及b c,根据每组加法器电路的输出,并且利用向量缓存80实现W xix t+W hih t-1+b i,W xfx t+W hfh t-1+b f,W xox t+W hoh t-1+b o,以及W xcx t+W hch t-1+b c的计算。
图7中仅仅示出了与向量加法电路401连接的一组log 2k级的加法器电路。此外需要说明的是,前述实施方式中进行举例时,k的取值为10,而在实际应用中,k的取值通常会设置为2的整数倍,以避免k的取值不为2的整数倍时,加法器电路中的部分加法器存在闲置的情况。当然,k的取值不为2的整数倍并不会导致方案不能实施。
对于W xx t而言,例如具体以W xix t为例,将乘法阵列输出的每k个数据的求和称为一次累加,则经过
Figure PCTCN2021089936-appb-000015
次累加之后,可以得到最终的输出向量W xix t中的一个数。而经过
Figure PCTCN2021089936-appb-000016
次之后,可以得到维度为N h的向量W xix t,即得到整个W xix t的计算结果V xi。W hih t-1的计算与此同理,得到了W xix t以及W hih t-1之后,再将W xix t,W hih t-1以及b i进行求和,也即实现了W xix t+W hih t-1+b i的计算。
激活电路50通常可以同时完成4种门结构的激活操作,即在本发明的一种具体实施方式中,激活电路50具体用于:根据加法电路40的输出进行sigmoid激活操作以及tanh激活操作,并输出i t,f t,o t以及
Figure PCTCN2021089936-appb-000017
sigmoid激活操作也即前述的LSTM的计算的6个公式中的σ符号,tanh激活操作则表示前述的LSTM的计算的6个公式中的tanh符号。
状态更新电路60可以完成
Figure PCTCN2021089936-appb-000018
以及h t=o t⊙tanh(c t)的计算。需要说明的是,在计算
Figure PCTCN2021089936-appb-000019
时,c t-1可以从Cell状态缓存90中获取,即在本发明的一种具体实施方式中,状态更新电路60,具体用于:
从Cell状态缓存90中获取c t-1,并根据激活电路50的输出进行c t以及h t的计算的,并在计算出c t之后利用c t更新Cell状态缓存90中的c t-1,并将h t发送至第二缓存20。且
Figure PCTCN2021089936-appb-000020
h t=o t⊙tanh(c t);⊙表示点 乘。
在计算出c t之后,状态更新电路60利用c t更新Cell状态缓存90中的c t-1,用于进行下一个时间步的c t的计算。并且需要指出的是,第一个时间步的cell状态可以来自片外存储,即c 0可以来自片外存储。
在本发明的一种具体实施方式中,向量乘法电路30处于第一流水线中,加法电路40处于第二流水线中,激活电路和状态更新电路60处于第三流水线中,并且第一流水线,第二流水线以及第三流水线并行运行。
由LSTM的计算的6个公式可以看出,c t的更新依赖于c t-1,h t的计算依赖于c t,而计算i t,f t,o t以及
Figure PCTCN2021089936-appb-000021
时依赖于h t-1,虽然通过高并行度的计算,能够加速矩阵向量相乘的操作,但由于这样的依赖关系的存在,使得部分数据只能串行处理,从而导致业务停顿,不利于提高效率。该种实施方式中,通过流水线的调度,进一步地提高了方案的加速效果。
该种实施方式中,考虑到不同时间步的输入数据x t并不存在依赖关系,而且,本申请的方案中,是将W xx t和W hh t-1进行分时、分段地计算,因此,将向量乘法电路30设置在第一流水线中,加法电路40设置在第二流水线中,激活电路50以及状态更新电路60设置在第三流水线中,并且第一流水线,第二流水线以及第三流水线均并行运行。这样就不需要W hh t-1的结果全部得到,就可以开始后续的加法操作,激活电路50以及状态更新电路60在运行时,乘法电路30已经开始了下一个时间步的乘法操作,并且加法电路40也随即进行部分和的求和,使得本申请的系统的各部分不需要停顿,即,使得前述提到的依赖被流水线的设计所消除,LSTM网络运行效率进一步地提高。
便于理解可参阅图8,计算W xx 1时,并不需要得到全部结果,就可以同时进行部分数据的累加,同时,W hh 0进行计算,并且也是边计算边累加,即图8中示出的流水线式累加,使得乘法电路30与加法电路40同时在运行。而加法电路40中进行的向量的加法,激活,cell状态更新和隐状态数据生成这些操作需要的计算时间较长,在此过程中,向量乘法电路30又开始了下一时间步的运行,即开始了W xx 2的运算,紧接着进行W hh 1的运算,以此往复,直至所有时间步计算完毕,即各个时刻的x t被计算完成,LSTM 网络完成了业务进程。
应用本发明实施例所提供的技术方案,具体的,考虑到门结构的计算占据了整个RNN网络的计算的绝大部分,其中主要是矩阵和向量相乘的计算,本申请设置了包括N组乘法阵列的向量乘法电路,每组乘法阵列包括k个乘法单元,有利于提高计算速度。并且考虑到传统的方案中,W xx t的计算和W hh t-1是合并在一起计算的,当x t或者h t-1的维度较大时,就会导致计算速度很慢。因此,本申请的方案中,将W xx t和W hh t-1进行分时、分段地计算,即不需要等到W xx t和W hh t-1的全部值产生才进行累加,有利于进一步地提高方案的加速效果。具体的,第一缓存,用于在第一状态和第二状态之间循环切换,且在第一状态时,分N路并行输出W x1至W xN,且并行度均为k,在第二状态时,分N路并行输出W h1至W hN,且并行度均为k;N为≥2的正整数;第二缓存,用于在第一状态和第二状态之间循环切换,且在第一状态时输出x t,在第二状态时输出h t-1。向量乘法电路则会在接收到第一缓存输出的W x1至W xN时,利用N组乘法阵列分别计算W x1x t至W xNx t,当接收到第一缓存输出的W h1至W hN时,利用N组乘法阵列分别计算W h1h t-1至W hNh t-1。加法电路便可以接收偏置数据缓存发送的b 1至b N,并且利用向量缓存实现W x1x t+W h1h t-1+b 1至W xNx t+W hNh t-1+b N的计算。此外,本申请的方案中,每组乘法阵列包括k个乘法单元,通过对k的数值的设定和调整,使得本申请的方案可以适应不同尺寸的RNN网络,即使得本申请的方案具有很强的灵活性和扩展性。综上所述,本申请的方案有效地实现了对于RNN网络的加速,并且具有很强的灵活性和扩展性。
相应于上面的系统实施例,本发明实施例还提供了一种加速RNN网络的方法,可与上文相互对应参照。
该加速RNN网络的方法可以应用于上述任一实施例中的加速RNN网络的系统中,包括:
步骤一:第一缓存在第一状态和第二状态之间循环切换,且在第一状态时,分N路并行输出W x1至W xN,且并行度均为k,在第二状态时,分N路并行输出W h1至W hN,且并行度均为k;N为≥2的正整数;
步骤二:第二缓存在第一状态和第二状态之间循环切换,且在第一状 态时输出x t,在第二状态时输出h t-1
步骤三:向量乘法电路当接收到所述第一缓存输出的W x1至W xN时,利用N组乘法阵列分别计算W x1x t至W xNx t,当接收到所述第一缓存输出的W h1至W hN,利用N组乘法阵列分别计算W h1h t-1至W hNh t-1;其中,所述向量乘法电路包括N组乘法阵列,每组乘法阵列包括k个乘法单元;
步骤四:加法电路接收偏置数据缓存发送的b 1至b N,并且利用向量缓存实现W x1x t+W h1h t-1+b 1至W xNx t+W hNh t-1+b N的计算;
步骤五:激活电路根据所述加法电路的输出进行激活操作;
步骤六:状态更新电路从cell状态缓存中获取c t-1,并根据所述激活电路的输出进行c t以及h t的计算,并在计算出c t之后利用c t更新cell状态缓存中的c t-1,并将h t发送至第二缓存;
其中,W x1至W xN依次表示第一门至第N门的权重数据矩阵;W h1至W hN依次表示第一门至第N门的隐状态权重数据矩阵;b 1至b N依次第一至第N门的偏置数据;x t表示t时刻的输入数据,h t-1表示t-1时刻的隐状态数据,h t表示t时刻的隐状态数据,c t表示t时刻的cell状态,c t-1表示t-1时刻的cell状态。
进一步的,在本发明的一种具体实施方式中,RNN网络具体为LSTM网络,N=4。
则上述步骤一具体为:第一缓存在第一状态和第二状态之间循环切换,且在第一状态时,分4路并行输出W xi,W xf,W xo以及W xc,且并行度均为k,在第二状态时,分4路并行输出W hi,W hf,W ho以及W hc,且并行度均为k;
步骤二具体为:第二缓存在第一状态和第二状态之间循环切换,且在第一状态时,输出x t,在第二状态时,输出h t-1
步骤三具体为:向量乘法电路当接收到第一缓存输出的W xi,W xf,W xo以及W xc时,利用4组乘法阵列分别计算W xix t,W xfx t,W xox t以及W xcx t,当接收到第一缓存输出的W hi,W hf,W ho以及W hc时,利用4组乘法阵列分别计算W hih t-1,W hfh t-1,W hoh t-1以及W hch t-1;其中,向量乘法电路包括4组乘法阵列,每组乘法阵列包括k个乘法单元;
步骤四具体为:加法电路接收偏置数据缓存发送的b i,b f,b o以及b c,并且利用向量缓存实现W xix t+W hih t-1+b i,W xfx t+W hfh t-1+b f,W xox t+W hoh t-1+b o,以及W xcx t+W hch t-1+b c的计算;
步骤五具体为:激活电路根据加法电路的输出进行激活操作,并输出i t,f t,o t以及
Figure PCTCN2021089936-appb-000022
步骤六具体为:状态更新电路从cell状态缓存中获取c t-1,并根据激活电路的输出进行c t以及h t的计算,并在计算出c t之后利用c t更新cell状态缓存中的c t-1,并将h t发送至第二缓存;
其中,W xi,W xf,W xo以及W xc依次表示输入门权重数据矩阵,遗忘门权重数据矩阵,输出门权重数据矩阵以及cell门权重数据矩阵;W hi,W hf,W ho以及W hc依次表示输入门隐状态权重数据矩阵,遗忘门隐状态权重数据矩阵,输出门隐状态权重数据矩阵以及cell门隐状态权重数据矩阵;b i,b f,b o以及b c依次表示输入门偏置数据,遗忘门偏置数据,输出门偏置数据以及cell门偏置数据;i t,f t,o t以及
Figure PCTCN2021089936-appb-000023
依次表示输入门,遗忘门,输出门以及cell门;x t表示t时刻的输入数据,h t-1表示t-1时刻的隐状态数据,h t表示t时刻的隐状态数据,c t表示t时刻的cell状态,c t-1表示t-1时刻的cell状态。
在本发明的一种具体实施方式中,向量乘法电路处于第一流水线中,加法电路处于第二流水线中,激活电路和状态更新电路处于第三流水线中,并且第一流水线,第二流水线以及第三流水线并行运行。
在本发明的一种具体实施方式中,步骤一包括:
第一存储单元从片外存储中获取目标数量的W xi,目标数量的W xf,目标数量的W xo以及目标数量的W xc
第二存储单元从片外存储中获取目标数量的W hi,目标数量的W hf,目标数量的W ho以及目标数量的W hc
分别与第一存储单元以及第二存储单元连接的第一多路选择器,实现第一状态和第二状态的循环切换,并且在第一状态下选择第一存储单元进行数据输出,在第二状态下选择第二存储单元进行数据输出;
第一存储器,第二存储器,第三存储器以及第四存储器均通过数据分类器与第一多路选择器连接,并且在第一多路选择器为第一状态时,依次并行输出W xi,W xf,W xo以及W xc,且并行度均为k,在第一多路选择器为第二状态时,依次并行输出W hi,W hf,W ho以及W hc,且并行度均为k;
其中,目标数量大于k。
在本发明的一种具体实施方式中,所述第一存储单元与所述第二存储单元均采用第一时钟,所述第一存储器,所述第二存储器,所述第三存储器以及所述第四存储器均采用第二时钟,且所述第一时钟与所述第二时钟相互独立,以使得所述第一存储器、所述第二存储器、所述第三存储器以及所述第四存储器中的任一存储器的输出速率低于输入速率时,将未发送的数据缓存在该存储器中。
在本发明的一种具体实施方式中,步骤二包括:
第三存储单元从片外存储中获取奇数时刻的x t
第四存储单元从片外存储中获取偶数时刻的x t
分别与第三存储单元以及第四存储单元连接的第二多路选择器,实现第一状态和第二状态的循环切换,并且在第一状态下选择第三存储单元进行数据输出,在第二状态下选择第四存储单元进行数据输出;
第三多路选择器从片外存储中获取h 0并接收状态更新电路发送的h t,并且仅在首次选择时选择h 0;h 0表示t=1时刻的隐状态数据;
第五存储单元通过第三多路选择器获取偶数时刻的h t以及h 0
第六存储单元通过第三多路选择器获取奇数时刻的h t
第四多路选择器实现第一状态和第二状态的循环切换,并且在第一状态下选择第五存储单元进行数据输出,在第二状态下选择第六存储单元进行数据输出;
第五多路选择器实现第一状态和第二状态的循环切换,并且在第一状态下选择第二多路选择器进行数据输出,在第二状态下选择第四多路选择器进行数据输出。
在本发明的一种具体实施方式中,步骤四包括:
4组log 2k级的加法器电路,每组加法器电路进行输入的k个数据的求 和;
与4组加法器电路的输出均连接的向量加法电路,接收偏置数据缓存发送的b i,b f,b o以及b c,根据每组加法器电路的输出,并且利用向量缓存实现W xix t+W hih t-1+b i,W xfx t+W hfh t-1+b f,W xox t+W hoh t-1+b o,以及W xcx t+W hch t-1+b c的计算。
在本发明的一种具体实施方式中,步骤五包括:
激活电路根据加法电路的输出进行sigmoid激活操作以及tanh激活操作,并输出i t,f t,o t以及
Figure PCTCN2021089936-appb-000024
在本发明的一种具体实施方式中,步骤六包括:
状态更新电路从cell状态缓存中获取c t-1,并根据激活电路的输出进行c t以及h t的计算的,并在计算出c t之后利用c t更新cell状态缓存中的c t-1,并将h t发送至第二缓存;
Figure PCTCN2021089936-appb-000025
h t=o t⊙tanh(c t);⊙表示点乘。
相应于上面的方法和系统实施例,本发明实施例还提供了一种计算机可读存储介质,计算机可读存储介质上存储有计算机程序,计算机程序被处理器执行时实现上述任一实施例中的加速LSTM网络的方法的步骤,可与上文相互对应参照。这里所说的计算机可读存储介质包括随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质。
还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。
专业人员还可以进一步意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。
本文中应用了具体个例对本发明的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本发明的技术方案及其核心思想。应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以对本发明进行若干改进和修饰,这些改进和修饰也落入本发明权利要求的保护范围内。

Claims (11)

  1. 一种加速RNN网络的系统,其特征在于,包括:
    第一缓存,用于在第一状态和第二状态之间循环切换,且在第一状态时,分N路并行输出W x1至W xN,且并行度均为k,在第二状态时,分N路并行输出W h1至W hN,且并行度均为k;N为≥2的正整数;
    第二缓存,用于在第一状态和第二状态之间循环切换,且在第一状态时输出x t,在第二状态时输出h t-1
    向量乘法电路,用于当接收到所述第一缓存输出的W x1至W xN时,利用N组乘法阵列分别计算W x1x t至W xNx t,当接收到所述第一缓存输出的W h1至W hN时,利用N组乘法阵列分别计算W h1h t-1至W hNh t-1;其中,所述向量乘法电路包括N组乘法阵列,每组乘法阵列包括k个乘法单元;
    加法电路,用于接收偏置数据缓存发送的b 1至b N,并且利用向量缓存实现W x1x t+W h1h t-1+b 1至W xNx t+W hNh t-1+b N的计算;
    激活电路,用于根据所述加法电路的输出进行激活操作;
    状态更新电路,用于从cell状态缓存中获取c t-1,并根据所述激活电路的输出进行c t以及h t的计算,并在计算出c t之后利用c t更新cell状态缓存中的c t-1,并将h t发送至第二缓存;
    所述偏置数据缓存;所述向量缓存;所述cell状态缓存;
    其中,W x1至W xN依次表示第一门至第N门的权重数据矩阵;W h1至W hN依次表示第一门至第N门的隐状态权重数据矩阵;b 1至b N依次表示第一门至第N门的偏置数据;x t表示t时刻的输入数据,h t-1表示t-1时刻的隐状态数据,h t表示t时刻的隐状态数据,c t表示t时刻的cell状态,c t-1表示t-1时刻的cell状态。
  2. 根据权利要求1所述的加速RNN网络的系统,其特征在于,RNN网络具体为LSTM网络,N=4,包括:
    第一缓存,具体用于:在第一状态和第二状态之间循环切换,且在第一状态时,分4路并行输出W xi,W xf,W xo以及W xc,且并行度均为k,在第二状态时,分4路并行输出W hi,W hf,W ho以及W hc,且并行度均为k;
    第二缓存,具体用于:在第一状态和第二状态之间循环切换,且在第 一状态时,输出x t,在第二状态时,输出h t-1
    向量乘法电路,具体用于:当接收到所述第一缓存输出的W xi,W xf,W xo以及W xc时,利用4组乘法阵列分别计算W xix t,W xfx t,W xox t以及W xcx t,当接收到所述第一缓存输出的W hi,W hf,W ho以及W hc时,利用4组乘法阵列分别计算W hih t-1,W hfh t-1,W hoh t-1以及W hch t-1;其中,所述向量乘法电路包括4组乘法阵列,每组乘法阵列包括k个乘法单元;
    加法电路,具体用于:接收偏置数据缓存发送的b i,b f,b o以及b c,并且利用向量缓存实现W xix t+W hih t-1+b i,W xfx t+W hfh t-1+b f,W xox t+W hoh t-1+b o,以及W xcx t+W hch t-1+b c的计算;
    激活电路,具体用于:根据所述加法电路的输出进行激活操作,并输出i t,f t,o t以及
    Figure PCTCN2021089936-appb-100001
    状态更新电路,具体用于:从cell状态缓存中获取c t-1,并根据所述激活电路的输出进行c t以及h t的计算,并在计算出c t之后利用c t更新cell状态缓存中的c t-1,并将h t发送至第二缓存;
    其中,W xi,W xf,W xo以及W xc依次表示输入门权重数据矩阵,遗忘门权重数据矩阵,输出门权重数据矩阵以及cell门权重数据矩阵;W hi,W hf,W ho以及W hc依次表示输入门隐状态权重数据矩阵,遗忘门隐状态权重数据矩阵,输出门隐状态权重数据矩阵以及cell门隐状态权重数据矩阵;b i,b f,b o以及b c依次表示输入门偏置数据,遗忘门偏置数据,输出门偏置数据以及cell门偏置数据;i t,f t,o t以及
    Figure PCTCN2021089936-appb-100002
    依次表示输入门,遗忘门,输出门以及cell门;x t表示t时刻的输入数据,h t-1表示t-1时刻的隐状态数据,h t表示t时刻的隐状态数据,c t表示t时刻的cell状态,c t-1表示t-1时刻的cell状态。
  3. 根据权利要求2所述的加速RNN网络的系统,其特征在于,所述向量乘法电路处于第一流水线中,所述加法电路处于第二流水线中,所述激活电路和所述状态更新电路处于第三流水线中,并且所述第一流水线,所述第二流水线以及所述第三流水线并行运行。
  4. 根据权利要求2所述的加速RNN网络的系统,其特征在于,所述 第一缓存包括:
    第一存储单元,用于从片外存储中获取目标数量的W xi,目标数量的W xf,目标数量的W xo以及目标数量的W xc
    第二存储单元,用于从片外存储中获取目标数量的W hi,目标数量的W hf,目标数量的W ho以及目标数量的W hc
    分别与所述第一存储单元以及所述第二存储单元连接的第一多路选择器,用于实现第一状态和第二状态的循环切换,并且在第一状态下选择所述第一存储单元进行数据输出,在第二状态下选择所述第二存储单元进行数据输出;
    第一存储器,第二存储器,第三存储器以及第四存储器均通过数据分类器与所述第一多路选择器连接,并且在所述第一多路选择器为第一状态时,依次用于并行输出W xi,W xf,W xo以及W xc,且并行度均为k,在所述第一多路选择器为第二状态时,依次用于并行输出W hi,W hf,W ho以及W hc,且并行度均为k;
    所述数据分类器;
    其中,所述目标数量大于k。
  5. 根据权利要求4所述的加速RNN网络的系统,其特征在于,所述第一存储单元与所述第二存储单元均采用第一时钟,所述第一存储器,所述第二存储器,所述第三存储器以及所述第四存储器均采用第二时钟,且所述第一时钟与所述第二时钟相互独立,以使得所述第一存储器、所述第二存储器、所述第三存储器以及所述第四存储器中的任一存储器的输出速率低于输入速率时,将未发送的数据缓存在该存储器中。
  6. 根据权利要求2所述的加速RNN网络的系统,其特征在于,所述第二缓存,包括:
    第三存储单元,用于从片外存储中获取奇数时刻的x t
    第四存储单元,用于从片外存储中获取偶数时刻的x t
    分别与所述第三存储单元以及所述第四存储单元连接的第二多路选择器,用于实现第一状态和第二状态的循环切换,并且在第一状态下选择所述第三存储单元进行数据输出,在第二状态下选择所述第四存储单元进行 数据输出;
    第三多路选择器,用于从片外存储中获取h 0并接收状态更新电路发送的h t,并且仅在首次选择时选择h 0;h 0表示t=1时刻的隐状态数据;
    第五存储单元,用于通过所述第三多路选择器获取偶数时刻的h t以及h 0
    第六存储单元,用于通过所述第三多路选择器获取奇数时刻的h t
    第四多路选择器,用于实现第一状态和第二状态的循环切换,并且在第一状态下选择所述第五存储单元进行数据输出,在第二状态下选择所述第六存储单元进行数据输出;
    第五多路选择器,用于实现第一状态和第二状态的循环切换,并且在第一状态下选择所述第二多路选择器进行数据输出,在第二状态下选择所述第四多路选择器进行数据输出。
  7. 根据权利要求2所述的加速RNN网络的系统,其特征在于,所述加法电路,包括:
    4组log 2k级的加法器电路,每组加法器电路用于进行输入的k个数据的求和;
    与4组加法器电路的输出均连接的向量加法电路,用于接收偏置数据缓存发送的b i,b f,b o以及b c,根据每组所述加法器电路的输出,并且利用向量缓存实现W xix t+W hih t-1+b i,W xfx t+W hfh t-1+b f,W xox t+W hoh t-1+b o,以及W xcx t+W hch t-1+b c的计算。
  8. 根据权利要求2所述的加速RNN网络的系统,其特征在于,所述激活电路,具体用于:
    根据所述加法电路的输出进行sigmoid激活操作以及tanh激活操作,并输出i t,f t,o t以及
    Figure PCTCN2021089936-appb-100003
  9. 根据权利要求2所述的加速RNN网络的系统,其特征在于,所述状态更新电路,具体用于:
    从cell状态缓存中获取c t-1,并根据所述激活电路的输出进行c t以及h t的计算的,并在计算出c t之后利用c t更新cell状态缓存中的c t-1,并将h t发送至第二缓存;
    Figure PCTCN2021089936-appb-100004
    h t=o t⊙tanh(c t);⊙表示点乘。
  10. 一种加速RNN网络的方法,其特征在于,应用于如权利要求1至9任一项所述的加速RNN网络的系统中,包括:
    第一缓存在第一状态和第二状态之间循环切换,且在第一状态时,分N路并行输出W x1至W xN,且并行度均为k,在第二状态时,分N路并行输出W h1至W hN,且并行度均为k;N为≥2的正整数;
    第二缓存在第一状态和第二状态之间循环切换,且在第一状态时输出x t,在第二状态时输出h t-1
    向量乘法电路当接收到所述第一缓存输出的W x1至W xN时,利用N组乘法阵列分别计算W x1x t至W xNx t,当接收到所述第一缓存输出的W h1至W hN,利用N组乘法阵列分别计算W h1h t-1至W hNh t-1;其中,所述向量乘法电路包括N组乘法阵列,每组乘法阵列包括k个乘法单元;
    加法电路接收偏置数据缓存发送的b 1至b N,并且利用向量缓存实现W x1x t+W h1h t-1+b 1至W xNx t+W hNh t-1+b N的计算;
    激活电路根据所述加法电路的输出进行激活操作;
    状态更新电路从cell状态缓存中获取c t-1,并根据所述激活电路的输出进行c t以及h t的计算,并在计算出c t之后利用c t更新cell状态缓存中的c t-1,并将h t发送至第二缓存;
    其中,W x1至W xN依次表示第一门至第N门的权重数据矩阵;W h1至W hN依次表示第一门至第N门的隐状态权重数据矩阵;b 1至b N依次第一至第N门的偏置数据;x t表示t时刻的输入数据,h t-1表示t-1时刻的隐状态数据,h t表示t时刻的隐状态数据,c t表示t时刻的cell状态,c t-1表示t-1时刻的cell状态。
  11. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求10所述的加速RNN网络的方法的步骤。
PCT/CN2021/089936 2020-09-25 2021-04-26 一种加速rnn网络的系统、方法及存储介质 WO2022062391A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/012,938 US11775803B2 (en) 2020-09-25 2021-04-26 System and method for accelerating RNN network, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011023267.4A CN111985626B (zh) 2020-09-25 2020-09-25 一种加速rnn网络的系统、方法及存储介质
CN202011023267.4 2020-09-25

Publications (1)

Publication Number Publication Date
WO2022062391A1 true WO2022062391A1 (zh) 2022-03-31

Family

ID=73450291

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/089936 WO2022062391A1 (zh) 2020-09-25 2021-04-26 一种加速rnn网络的系统、方法及存储介质

Country Status (3)

Country Link
US (1) US11775803B2 (zh)
CN (1) CN111985626B (zh)
WO (1) WO2022062391A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111985626B (zh) 2020-09-25 2022-06-07 苏州浪潮智能科技有限公司 一种加速rnn网络的系统、方法及存储介质
CN112732638B (zh) * 2021-01-22 2022-05-06 上海交通大学 基于ctpn网络的异构加速系统及方法

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107704916A (zh) * 2016-08-12 2018-02-16 北京深鉴科技有限公司 一种基于fpga实现rnn神经网络的硬件加速器及方法
US20180189638A1 (en) * 2016-12-31 2018-07-05 Intel Corporation Hardware accelerator template and design framework for implementing recurrent neural networks
CN108446761A (zh) * 2018-03-23 2018-08-24 中国科学院计算技术研究所 一种神经网络加速器及数据处理方法
CN110826710A (zh) * 2019-10-18 2020-02-21 南京大学 基于横向脉动阵列的rnn前向传播模型的硬件加速实现系统及方法
US20200218965A1 (en) * 2019-01-08 2020-07-09 SimpleMachines Inc. Accelerating parallel processing of data in a recurrent neural network
CN111985626A (zh) * 2020-09-25 2020-11-24 苏州浪潮智能科技有限公司 一种加速rnn网络的系统、方法及存储介质

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10489063B2 (en) * 2016-12-19 2019-11-26 Intel Corporation Memory-to-memory instructions to accelerate sparse-matrix by dense-vector and sparse-vector by dense-vector multiplication
US10445451B2 (en) * 2017-07-01 2019-10-15 Intel Corporation Processors, methods, and systems for a configurable spatial accelerator with performance, correctness, and power reduction features
CN108376285A (zh) * 2018-03-23 2018-08-07 中国科学院计算技术研究所 一种面向多变异体lstm神经网络加速器及数据处理方法
US11307873B2 (en) * 2018-04-03 2022-04-19 Intel Corporation Apparatus, methods, and systems for unstructured data flow in a configurable spatial accelerator with predicate propagation and merging
US11200186B2 (en) * 2018-06-30 2021-12-14 Intel Corporation Apparatuses, methods, and systems for operations in a configurable spatial accelerator
US10817291B2 (en) * 2019-03-30 2020-10-27 Intel Corporation Apparatuses, methods, and systems for swizzle operations in a configurable spatial accelerator
US10915471B2 (en) * 2019-03-30 2021-02-09 Intel Corporation Apparatuses, methods, and systems for memory interface circuit allocation in a configurable spatial accelerator
CN110110851B (zh) * 2019-04-30 2023-03-24 南京大学 一种lstm神经网络的fpga加速器及其加速方法
US11029958B1 (en) * 2019-12-28 2021-06-08 Intel Corporation Apparatuses, methods, and systems for configurable operand size operations in an operation configurable spatial accelerator

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107704916A (zh) * 2016-08-12 2018-02-16 北京深鉴科技有限公司 一种基于fpga实现rnn神经网络的硬件加速器及方法
US20180189638A1 (en) * 2016-12-31 2018-07-05 Intel Corporation Hardware accelerator template and design framework for implementing recurrent neural networks
CN108446761A (zh) * 2018-03-23 2018-08-24 中国科学院计算技术研究所 一种神经网络加速器及数据处理方法
US20200218965A1 (en) * 2019-01-08 2020-07-09 SimpleMachines Inc. Accelerating parallel processing of data in a recurrent neural network
CN110826710A (zh) * 2019-10-18 2020-02-21 南京大学 基于横向脉动阵列的rnn前向传播模型的硬件加速实现系统及方法
CN111985626A (zh) * 2020-09-25 2020-11-24 苏州浪潮智能科技有限公司 一种加速rnn网络的系统、方法及存储介质

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
GAO, SHEN ET AL.: "Survey of FPGA Based Recurrent Neural Network Accelerator", CHINESE JOURNAL OF NETWORK AND INFORMATION SECURITY, vol. 5, no. 4, 31 August 2019 (2019-08-31), pages 1 - 13, XP055915003, ISSN: 2096-109X *
HE, JUNHUA ET AL.: "An LSTM Acceleration Engine for FPGAs Based on Caffe Framework", 2019 IEEE 5TH INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATIONS, 9 December 2019 (2019-12-09), XP033754906, ISSN: 7281-4743 *

Also Published As

Publication number Publication date
US20230196068A1 (en) 2023-06-22
CN111985626A (zh) 2020-11-24
CN111985626B (zh) 2022-06-07
US11775803B2 (en) 2023-10-03

Similar Documents

Publication Publication Date Title
CN111684473B (zh) 提高神经网络阵列的性能
CN108133270A (zh) 卷积神经网络加速方法及装置
WO2022062391A1 (zh) 一种加速rnn网络的系统、方法及存储介质
CN110543939B (zh) 一种基于fpga的卷积神经网络后向训练的硬件加速实现装置
KR102396447B1 (ko) 파이프라인 구조를 가지는 인공신경망용 연산 가속 장치
CN110580519B (zh) 一种卷积运算装置及其方法
CN115423081A (zh) 一种基于fpga的cnn_lstm算法的神经网络加速器
US11983616B2 (en) Methods and apparatus for constructing digital circuits for performing matrix operations
WO1991018347A1 (en) Spin: a sequential pipelined neurocomputer
CN114675805A (zh) 存储器中计算累加器
CN114519425A (zh) 一种规模可扩展的卷积神经网络加速系统
Ghasemzadeh et al. BRDS: An FPGA-based LSTM accelerator with row-balanced dual-ratio sparsification
Li et al. Input-aware dynamic timestep spiking neural networks for efficient in-memory computing
CN115879530B (zh) 一种面向rram存内计算系统阵列结构优化的方法
Tao et al. Hima: A fast and scalable history-based memory access engine for differentiable neural computer
CN117574970A (zh) 用于大规模语言模型的推理加速方法、系统、终端及介质
Anis FPGA implementation of parallel particle swarm optimization algorithm and compared with genetic algorithm
He et al. An LSTM acceleration engine for FPGAs based on caffe framework
Wang et al. COSA: Co-Operative Systolic Arrays for Multi-head Attention Mechanism in Neural Network using Hybrid Data Reuse and Fusion Methodologies
Su et al. Processing element architecture design for deep reinforcement learning with flexible block floating point exploiting signal statistics
Rizk et al. A resource-saving energy-efficient reconfigurable hardware accelerator for bert-based deep neural network language models using FFT multiplication
Dey et al. An application specific processor architecture with 3D integration for recurrent neural networks
Wang et al. Implementation of Bidirectional LSTM Accelerator Based on FPGA
EP3948685A1 (en) Accelerating neuron computations in artificial neural networks by skipping bits
CN114239818B (zh) 基于tcam和lut的存内计算架构神经网络加速器

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21870779

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21870779

Country of ref document: EP

Kind code of ref document: A1