WO2018120016A1 - 用于执行lstm神经网络运算的装置和运算方法 - Google Patents

用于执行lstm神经网络运算的装置和运算方法 Download PDF

Info

Publication number
WO2018120016A1
WO2018120016A1 PCT/CN2016/113493 CN2016113493W WO2018120016A1 WO 2018120016 A1 WO2018120016 A1 WO 2018120016A1 CN 2016113493 W CN2016113493 W CN 2016113493W WO 2018120016 A1 WO2018120016 A1 WO 2018120016A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
vector
unit
data processing
gate
Prior art date
Application number
PCT/CN2016/113493
Other languages
English (en)
French (fr)
Inventor
陈云霁
陈小兵
刘少礼
陈天石
Original Assignee
上海寒武纪信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海寒武纪信息科技有限公司 filed Critical 上海寒武纪信息科技有限公司
Priority to EP16925613.8A priority Critical patent/EP3564863B1/en
Priority to PCT/CN2016/113493 priority patent/WO2018120016A1/zh
Publication of WO2018120016A1 publication Critical patent/WO2018120016A1/zh
Priority to US16/459,549 priority patent/US10853722B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures

Definitions

  • the present invention relates to the field of neural network computing technologies, and more particularly to an apparatus and an arithmetic method for performing LSTM neural network operations.
  • LSTM Long Time Memory Network
  • RNN time recurrent neural network
  • LSTM is a time recurrent neural network
  • the LSTM network exhibits better performance than traditional recurrent neural networks, and it is well-suited for learning from experience to classify, process, and predict time series after unknown time between important events.
  • LSTM networks are widely used in many fields such as speech recognition, video description, machine translation and automatic music synthesis.
  • the performance of LSTM networks has been greatly improved, and it has attracted widespread attention in industry and academia.
  • the operation of the LSTM network involves a variety of algorithms.
  • the specific implementation devices mainly have the following two types:
  • One device that implements LSTM network operations is a general purpose processor.
  • the method supports the above algorithm by executing general purpose instructions using a general purpose register stack and generic functions.
  • One of the disadvantages of this method is that the performance of a single general-purpose processor is low and cannot be accelerated by the usual parallelism of the operation of the LSTM network itself. When executed in parallel by multiple general-purpose processors, communication between processors becomes a performance bottleneck.
  • the general-purpose processor needs to decode the artificial neural network into a series of operations and memory access instructions, and the processor front-end decoding also has a large power consumption overhead.
  • Another known method of supporting LSTM network operations is to use a graphics processing unit (GPU).
  • the method performs the above algorithm by executing a generic SIMD instruction using a general purpose register stack and a general stream processing unit. Since the GPU is a device dedicated to performing graphics and image operations and scientific computing, without the dedicated support for the LSTM network, a large number of front-end translations are still required. Code work can perform LSTM network operations, which brings a lot of overhead.
  • the GPU has only a small on-chip cache. The related parameters used in the LSTM network need to be repeatedly transferred from off-chip, and the off-chip bandwidth becomes a performance bottleneck.
  • the present invention provides an apparatus for performing an LSTM neural network operation, comprising:
  • a plurality of data processing modules arranged in parallel for performing LSTM neural network operations on the input data and the weights and offsets required for the operations from the corresponding data buffer unit; wherein the plurality of data processing modules and the data
  • the cache units are in one-to-one correspondence, and parallel operations are performed between the plurality of data processing modules.
  • the present invention further provides an apparatus for performing an LSTM neural network operation, including:
  • Step 1 Read the weights and offsets for the LSTM neural network operation from the externally specified address space, and divide them into a plurality of parts corresponding to the neurons of the LSTM neural network operation, and store them in the memory.
  • the number of weights and offsets in each of the different spaces is the same; and the input data for the LSTM neural network operation is read from the externally specified address space and stored in each of the different memories In space
  • Step 2 weights and input data in each of the different spaces of the memory Divided into several parts, wherein the weight of each share or the amount of input data is the same as the number of operations of the corresponding vector operation unit; each time a weight and input data are calculated to obtain a partial sum, and then with the previously obtained part and Vector addition, to obtain a new partial sum, wherein the initial value of the partial sum is an offset value;
  • Step 3 after all the input data in each of the different spaces of the memory are processed, the obtained partial sum is a net activation amount corresponding to the neuron, and the function is transformed by a nonlinear function tanh or sigmoid function. The net activation amount of the neuron, the output value of the neuron is obtained;
  • Step 4 in this way, using different weights and offsets, repeat steps 1 to 3 above, respectively calculating the vector values of the forgotten gate, the input gate, the output gate, and the candidate state unit in the LSTM neural network operation;
  • the process of calculating the partial sum is a vector operation instruction, and the input data in each of the different spaces of the memory is calculated by using a parallel operation manner;
  • Step 5 Determine whether the calculation of the current forgotten gate, the input gate, and the candidate state unit vector value in each of the different spaces of the memory is completed. If yes, perform a calculation of a new state unit, that is, the old state unit. And forgetting the gate vector value to obtain a partial sum by the vector point multiplication component, and then obtaining the partial sum by the vector point multiplication component by the value of the candidate state unit and the input gate, and updating the two parts and the vector summation submodule by the vector
  • the state unit transforms the updated state unit by the nonlinear transformation function tanh; determines whether the nonlinear transformation of the currently updated data state unit and the output gate are completed, and if the calculation is completed, the output gate and the band are updated.
  • the non-linearly transformed vector of the data state unit is calculated by the vector point multiply component to obtain the final output value of each of the different spaces of the memory;
  • step 6 the final output value of each of the different spaces of each of the memories is spliced to obtain a final output value.
  • the present invention further provides an operation method of an LSTM neural network, which comprises the following steps:
  • Step S1 reading weights and offsets for LSTM neural network operations from an externally specified address space, and writing them to a plurality of data buffer units arranged in parallel, buffering each data
  • the state unit of the unit is initialized; wherein the weights and offsets read from the externally specified address space are segmented and sent to the respective data buffer units, and the data buffer units are corresponding to the neurons of the LSTM neural network operation.
  • the weights and offsets are the same;
  • Step S2 reading input data from an external designated address space and writing it to the plurality of data buffer units, wherein input data written in each of the data buffer units is complete;
  • Step S3 the plurality of data processing modules corresponding to the plurality of parallelly disposed data buffer units respectively read the weights, offsets, and input data from the corresponding data buffer unit, and adopt vector point multiplication components,
  • the vector addition component, the vector summation component, and the vector nonlinear function conversion component perform LSTM neural network operations on the LSTM neural network to obtain the output values of each data processing module;
  • step S4 the output values of the respective data processing modules are spliced to obtain a final output value, that is, the final result of the LSTM neural network operation.
  • FIG. 1 is a block diagram showing an overall structure of an apparatus for performing an LSTM network operation according to an embodiment of the present invention
  • FIG. 2 shows a schematic diagram of a data processing module of an apparatus for performing LSTM network operations, in accordance with an embodiment of the present invention
  • FIG. 3 shows a flow chart of a method for performing an LSTM network operation, in accordance with an embodiment of the present invention
  • FIG. 4 shows a detailed flow chart of a data processing procedure in a method for performing LSTM network operations in accordance with an embodiment of the present invention.
  • the apparatus for performing LSTM network operations of the present invention can be applied to the following scenarios, including but not limited to: data processing, robots, computers, printers, scanners, telephones, tablets, smart terminals, mobile phones, driving recorders, navigation Instruments, sensors, cameras, cloud servers, cameras, cameras, projectors, watches, headsets, mobile storage, wearable devices and other electronic products; aircraft, ships, vehicles and other vehicles; TV, air conditioning, microwave ovens, refrigerators , rice cookers, humidifiers, washing machines, electric lights, gas stoves, smoking machines and other household appliances, as well as nuclear magnetic resonance instruments, B-ultrasound, electrocardiograph and other medical equipment.
  • the present invention discloses an apparatus for performing an LSTM neural network operation, comprising:
  • a plurality of data processing modules arranged in parallel for performing LSTM neural network operations on the input data and the weights and offsets required for the operations from the corresponding data buffer unit; wherein the plurality of data processing modules and the data
  • the cache units are in one-to-one correspondence, and parallel operations are performed between the plurality of data processing modules.
  • the data cache unit further caches the intermediate result calculated by the data processing module, and only imports the weight and the offset from the direct memory access unit during the entire execution process, and then does not change.
  • each of the data buffer units is written with a weight and an offset that are divided corresponding to the neurons of the LSTM neural network operation, wherein the weights and offsets in each data buffer unit are the same And each data buffer unit gets a complete input data.
  • the data processing module performs the LSTM neural network operation by using a vector point multiply component, a vector addition component, a vector summation component, and a vector nonlinear function conversion component.
  • the vector nonlinear function conversion component performs a function operation by a table lookup method.
  • Each of the data processing modules performs the vector operation by separately calculating vector values of the forget gate, the input gate, the output gate, and the candidate state unit in the LSTM network operation, and then obtaining each vector value. An output value of the data processing module, and finally splicing the output values of the data processing modules to obtain a final output value.
  • the present invention discloses an apparatus for performing LSTM neural network operations, including:
  • a direct memory access unit for acquiring an instruction and data required for an LSTM neural network operation from an external address space outside the device, and transmitting the instruction and data therein to the instruction cache unit and the data buffer unit, and from the data processing
  • the module or data buffer unit writes the operation result back to the external address space
  • An instruction cache unit configured to cache an instruction obtained by the direct memory access unit from the external address space, and input to the controller unit;
  • the controller unit reads the instruction from the instruction cache unit and decodes the instruction into a micro instruction for controlling the direct memory unit to perform the data IO operation, the data processing module performing the correlation operation, and the data buffer unit for data buffering and transmission;
  • the direct memory access unit, the instruction cache unit, the controller unit, the plurality of data buffer units, and the plurality of data processing modules are each implemented by a hardware circuit.
  • the data buffer unit further caches the intermediate result calculated by the data processing module, and only imports the weight and the offset from the direct memory access unit during the entire execution process, and then does not change.
  • said plurality of data processing modules each perform said LSTM neural network operation using a vector point multiply component, a vector addition component, a vector summation component, and a vector nonlinear function conversion component.
  • the vector nonlinear function conversion unit performs a function operation by a table lookup method.
  • said plurality of data processing modules perform parallel operations in the following manner:
  • Step 1 Each corresponding data buffer unit writes a weight and an offset that are read from an externally specified address space and are segmented corresponding to the LSTM neural network operation, wherein the weights in each data buffer unit The number of values and offsets are the same, and each data buffer unit obtains a complete input data; each data processing module divides the weights and input data in each corresponding data buffer unit into several parts, each of which The weight of the share or the amount of input data is the same as the number of operations of the vector operation unit in the corresponding single data processing module; each time a weight and input data are sent to the corresponding data processing module, the partial sum is calculated. And extracting the previously obtained partial sum from the data buffer unit, and performing vector addition on the partial sum to obtain a new partial sum, which is sent back to the data buffer unit, wherein the initial value of the partial sum is an offset value. ;
  • Step 2 After all the input data in each data buffer unit is sent to the corresponding data processing module for processing once, the obtained partial sum is the net activation amount corresponding to the neuron, and the corresponding data processing module passes the nonlinear function tanh or sigmoid function. Transforming a net activation amount of the neuron to obtain an output value of the neuron;
  • Step 3 in this way, using different weights and offsets, repeat steps 1 and 2 above.
  • the vector values of the forgotten gate, the input gate, the output gate, and the candidate state unit in the LSTM network operation are respectively calculated; wherein, in the same data processing module, the process of calculating the partial sum is a vector operation instruction, and the data is used between Parallel operation
  • Step 4 Each data processing module determines whether the calculation of the current forgotten gate, the input gate, and the candidate state unit vector value is completed. If completed, the calculation of the new state unit is performed, that is, the old state unit and the forgotten gate vector value are sent. To the data processing unit, the partial sum is obtained by the vector point multiplication component, and sent back to the data buffer unit, and then the values of the candidate state unit and the input gate are sent to the data processing unit, and the partial sum is obtained by the vector point multiplication component, and the data is obtained.
  • the portion of the buffer unit is sent to the data processing module, and the updated state unit is obtained by the vector summation sub-module, and then sent back to the data buffer unit, and the updated state unit in the data processing module is passed through the non-
  • the linear transformation function tanh performs transformation; each data processing module determines whether the nonlinear transformation of the currently updated data state unit and the output gate are completed. If the calculation is completed, the output gate and the updated data state unit are nonlinearly transformed.
  • the vector is calculated by the vector point multiplication component to get the final output value, and the output value is Back to the data buffer unit;
  • Step 5 After the output values in all the data processing modules are written back to the data buffer unit, the output values in the respective data processing modules are spliced to obtain the final output values, and are sent to the external designated address through the direct memory access unit.
  • the invention also discloses an apparatus for performing LSTM neural network operation, comprising:
  • Step 1 Read the weights and offsets for the LSTM neural network operation from the externally specified address space, and divide them into a plurality of parts corresponding to the neurons of the LSTM neural network operation, and store them in the memory.
  • the number of weights and offsets in each of the different spaces is the same; and the input data for the LSTM neural network operation is read from the externally specified address space and stored in each of the different memories In space
  • Step 2 dividing weights and input data in each of the different spaces of the memory into a plurality of shares, wherein the weight of each share or the number of input data is the same as the number of operations of the corresponding vector operation unit; each time Calculate a weight and input data to get a partial sum, and then The previously obtained part and the vector addition are added to obtain a new partial sum, wherein the initial value of the partial sum is an offset value;
  • Step 3 after all the input data in each of the different spaces of the memory are processed, the obtained partial sum is a net activation amount corresponding to the neuron, and the function is transformed by a nonlinear function tanh or sigmoid function. The net activation amount of the neuron, the output value of the neuron is obtained;
  • Step 4 in this way, using different weights and offsets, repeat steps 1 to 3 above, respectively calculating the vector values of the forgotten gate, the input gate, the output gate, and the candidate state unit in the LSTM neural network operation;
  • the process of calculating the partial sum is a vector operation instruction, and the input data in each of the different spaces of the memory is calculated by using a parallel operation manner;
  • Step 5 Determine whether the calculation of the current forgotten gate, the input gate, and the candidate state unit vector value in each of the different spaces of the memory is completed. If yes, perform a calculation of a new state unit, that is, the old state unit. And forgetting the gate vector value to obtain a partial sum by the vector point multiplication component, and then obtaining the partial sum by the vector point multiplication component by the value of the candidate state unit and the input gate, and updating the two parts and the vector summation submodule by the vector
  • the state unit transforms the updated state unit by the nonlinear transformation function tanh; determines whether the nonlinear transformation of the currently updated data state unit and the output gate are completed, and if the calculation is completed, the output gate and the band are updated.
  • the non-linearly transformed vector of the data state unit is calculated by the vector point multiply component to obtain the final output value of each of the different spaces of the memory;
  • step 6 the final output value of each of the different spaces of each of the memories is spliced to obtain a final output value.
  • the invention also discloses an operation method of the LSTM neural network, comprising the following steps:
  • Step S1 reading the weights and offsets for the LSTM neural network operation from the externally specified address space, and writing them to the data buffer units of the plurality of parallel settings, and initializing the state units of the data buffer units;
  • the weights and offsets read from the externally specified address space are segmented and sent to the respective data buffer units corresponding to the neurons of the LSTM neural network operation, and the weights and offsets in each data buffer unit are respectively the same;
  • Step S2 reading input data from an external designated address space and writing it to the plurality of data buffer units, wherein input data written in each of the data buffer units is complete;
  • Step S3 the plurality of data processing modules corresponding to the plurality of parallelly disposed data buffer units respectively read the weights, offsets, and input data from the corresponding data buffer unit, and adopt vector point multiplication components,
  • the vector addition component, the vector summation component, and the vector nonlinear function conversion component perform LSTM neural network operations on the LSTM neural network to obtain the output values of each data processing module;
  • step S4 the output values of the respective data processing modules are spliced to obtain a final output value, that is, the final result of the LSTM neural network operation.
  • each data processing module separately divides the weight and the input data in the corresponding data buffer unit into a plurality of shares, wherein the weight of each share or the quantity of the input data and the corresponding single data processing module The number of operations of the vector operation unit is the same; each data buffer unit sends a weight and input data to its corresponding data processing module, calculates the partial sum, and then obtains the data from the data buffer unit. a partial sum, a partial sum vector, and a new partial sum, sent back to the data buffer unit, wherein the initial value of the partial sum is an offset value;
  • the obtained part is the net activation amount corresponding to the neuron, and then the net activation amount of the neuron is sent to the data processing module, through the data operation sub-module
  • the nonlinear function tanh or sigmoid function transforms the output value of the neuron. In this way, different weights and offsets are used to calculate the forgotten gate, input gate, output gate, and candidate state unit in the LSTM neural network.
  • Each data processing module determines whether the calculation of the current forgotten gate, the input gate, and the candidate state unit vector value is completed. If completed, the calculation of the new state unit is performed, that is, the old state unit and the forgotten gate vector value are sent to the data processing.
  • the unit obtains the partial sum by the vector point multiplication component, returns it to the data buffer unit, and then sends the value of the candidate state unit and the input gate to the data processing unit, and obtains the partial sum by the vector point multiplication component, and the data buffer unit is The part is sent to the data processing module, and the updated state unit is obtained by the vector summation sub-module, and then sent back to the data buffer unit, and at the same time, the updated state in the data processing module
  • the state unit is transformed by the nonlinear transformation function tanh; each data processing module determines whether the nonlinear transformation of the currently updated data state unit and the output gate are completed, and if the calculation is completed, the output gate and the updated data state unit are output.
  • the nonlinearly transformed vector
  • the non-linear function tanh or sigmoid function performs a function operation by a table lookup method.
  • the present invention discloses an apparatus and method for LSTM network operation, which can be used to accelerate the application of the LSTM network.
  • the method specifically includes the following steps:
  • the weights and offsets used in the LSTM network operation are taken out from the externally specified address space by the direct memory access unit, and written to the respective data buffer units, wherein the weights and offsets are specified from the outside.
  • the space is extracted and divided and sent to each data buffer unit.
  • the weights and offsets in each data buffer unit are the same, and the weights and offsets in each data buffer unit correspond to the neurons, and the data buffer unit is initialized. State unit
  • each data buffer unit (3) dividing the weights and input data in each data buffer unit into a plurality of shares, wherein the weight of each share or the number of input data is the same as the number of operations of the vector operation unit in the corresponding single data processing module, each time A weight and input data are sent to the data processing module, the partial sum is calculated, and the previously obtained partial sum is extracted from the data buffer unit, and the partial sum vector is added to obtain a new partial sum, which is sent back to the data buffer.
  • the initial value of the partial sum is the offset value.
  • the obtained part is the net activation amount corresponding to the neuron, and then the net activation amount of the neuron is sent to the data processing module, and the output of the neuron is obtained by transforming the nonlinear function tanh or sigmoid function in the data operation sub-module.
  • the function transformation here can be carried out by two methods: table lookup and function operation. By using different weights and offsets in this way, the vector values of the forgotten gate, the input gate, the output gate, and the candidate state unit in the LSTM network can be calculated separately.
  • the process of calculating the partial sum is a vector operation instruction, and there is parallel between the data.
  • the data dependency discriminating sub-module in each data processing module determines whether the calculation of the current forgotten gate, the input gate, and the candidate state unit vector value is completed, and if so, performs a calculation of the new state unit.
  • the old state unit and the forgotten gate vector value are sent to the data processing unit, and the partial sum is obtained by the vector point multiplication component in the data operation submodule, and sent back to the data buffer unit; then the candidate state unit and the input are selected.
  • the value of the gate is sent to the data processing unit, and the partial sum is obtained by the vector point multiplication component in the data operation submodule, and the part in the data buffer unit is sent to the data processing module, and the vector summation submodule in the data operation submodule is obtained.
  • the updated state unit is obtained and then sent back to the data buffer unit.
  • the updated state unit in the data processing module is transformed by the nonlinear transformation function tanh in the data operation sub-module.
  • the data dependency discriminant sub-module in each data processing module determines whether the nonlinear transformation of the currently updated data state unit and the output gate are completed. If the calculation is completed, the output gate and the updated data state unit are nonlinearly transformed.
  • the latter vector is calculated by the vector point multiplication component in the data operation sub-module to obtain the final output value, and the output value is written back to the data buffer unit.
  • FIG. 1 is a block diagram showing the overall structure of an apparatus for performing an LSTM network operation according to an embodiment of the present invention.
  • the device includes a direct memory access unit 1.
  • the instruction cache unit 2, the controller unit 3, the data buffer unit 4, and the data processing module 5 can all be implemented by hardware circuits.
  • the direct memory access unit 1 can access the external address space, can read and write data to each cache unit inside the device, and complete data loading and storage. Specifically, the instruction is read from the instruction cache unit 2, and the weights, offsets, and input data required for the LSTM network operation are read from the specified storage unit to the data buffer unit 4, and the output after the operation is directly from the data buffer unit 4. Write to the external specified space.
  • the instruction cache unit 2 reads the instruction through the direct memory access unit 1, and caches the read instruction.
  • the controller unit 3 reads an instruction from the instruction cache unit 2, decodes the instruction into a micro instruction that controls the behavior of other modules, and transmits it to other modules such as a direct memory access unit 1, a data buffer unit 4, a data processing module 5, etc. .
  • the data buffer unit 4 initializes the state unit of the LSTM at the time of device initialization, and reads the weight and the offset from the external designated address through the direct memory access unit 1, and the weight and offset read in each data buffer unit 4 Corresponding to the neurons to be calculated, that is, the weights and offsets read in each data buffer unit 4 are part of the total weights and offsets, and the weights and offsets in all data buffer units 4 are combined.
  • the value is set, and then a part of the weight, the offset and the input value is sent to the data processing module 5, the intermediate value is calculated in the data processing module 5, and then the intermediate value is read out from the data processing module 5 and the data buffer unit 4 is saved.
  • the partial sum is input to the data processing module 5 to calculate the neuron output, and then written back to the data buffer unit 4, and finally Input gate, the output of gate, door and forget the candidate state to the unit value.
  • the forgotten gate and the old state unit are sent to the data processing module 5, the partial sum is calculated, written back to the data buffer unit 4, and the candidate state unit and the input gate are sent to the data processing module 5, Calculating the partial sum, writing the partial sum in the data buffer block 4 to the data processing module 5 and the previously calculated portion and performing the vector addition to obtain the updated state unit, and writing it back to the data buffer unit 4 .
  • each data buffer unit 4 writes its obtained partial output value back to the external designated address space through the direct memory access unit 1.
  • x t is the input data at time t
  • h t-1 represents the output data at time t-1
  • W f , W i , W c and W o respectively represent the forgotten gate, the input gate, the update state unit, and the output gate
  • the corresponding weight vector, b f , b i , b c and b o respectively represent the offset corresponding to the forgotten gate, the input gate, the update state unit and the output gate
  • f t represents the output of the forgotten gate
  • the state unit performs dot multiplication to selectively forget the past state cell value
  • i t represents the output of the input gate, multiplies the obtained candidate state value point at time t to selectively add the candidate state value to the state at time t In the unit
  • c t represents the new state value obtained by adding the state value selected at time t-1 and the state value selected at time t, which will calculate the final output.
  • the product of the element operations; ⁇ is the sigmoid function, and the calculation formula is:
  • the calculation formula for the activation function tanh function is
  • the data processing module 5 reads the partial weights W i /W f /W o /W c and the weights b i /b f /b o /b c and the corresponding input data from the corresponding data buffer unit 4 each time [ h t-1 , x t ], the partial sum calculation is completed by the vector multiplication component and the summation component in the data processing module 5, until all the input data for each neuron is calculated once, then the neuron is obtained.
  • the net activation amount net i /net f /net o /net c and then the calculation of the output value is completed by the transformation of the vector nonlinear function sigmoid or tanh function, in this way, the input gate i i , the forget gate f i , the output are respectively completed.
  • Door o i and candidate status unit Calculation Then, the dot multiplication of the old state unit and the forgotten gate, the candidate state unit and the input gate are calculated by the vector point multiplying component in the data processing module 5, respectively, and then the two results are obtained by the vector plus component operation to obtain a new state unit. c t . The newly obtained status unit is written back to the data buffer unit 4.
  • the state unit in the data processing module 5 performs the conversion of the tanh function by using the vector nonlinear function replacement component to obtain tanh(c t ), which can be completed by calculating the tanh function value or looking up the table in the calculation process. Then, the vector of the output gate and the state unit subjected to the tanh nonlinear transformation is operated by the vector point multiplication component to obtain the final neuron output value h t . Finally, the neuron output value h t is written back to the data buffer unit 4.
  • FIG. 2 shows a schematic diagram of a data processing module of an apparatus for performing LSTM network operations, in accordance with an embodiment of the present invention
  • the data processing unit 5 includes a data processing control sub-module 51, a data dependency discrimination sub-module 52, and a data operation sub-module 53.
  • the data processing control sub-module 51 controls the operations performed by the data operation sub-module 53.
  • the control data dependency discrimination sub-module 52 determines whether or not the current operation has a data dependency. For partial operations, the data processing control sub-module 51 controls the operations performed by the data operation sub-module 53; and for operations that may have data dependencies, the data processing control sub-module 51 first controls the data-dependent discrimination sub-module 52 for the current operation. Whether there is a data dependency determination, if there is a data dependency, the data processing control sub-module 51 inserts a null operation into the data operation sub-module 53, and waits until the data dependency is released, and then controls the data operation sub-module 53 to perform a data operation.
  • the data dependency discrimination sub-module 52 is controlled by the data processing control sub-module 51 to check whether there is a data dependency relationship in the data operation sub-module 53. If the next operation requires a value that has not been calculated yet, then there is currently a data dependency, otherwise there is no data dependency.
  • a method of data dependency detection is that there are registers R1, R2, R3, R4, and R5 in the data operation sub-module 53 for marking an input gate, a forgotten gate, an output gate, a candidate state unit, and an updated state unit, respectively. Whether the conversion operation of the tanh function is completed, the value of the register other than 0 indicates that the operation is completed, and 0 indicates that the operation has not been completed.
  • the data dependency is determined twice, and when the new state unit is calculated, it is judged whether there is data dependency between the input gate, the output gate, and the selected state unit, and the output gate is judged when the output gate is updated. Whether the state unit's tanh function converts data dependencies, and whether R1, R3, and R4 are both non-zero and R2 and R are non-zero. After the judgment is completed, the judgment result needs to be transmitted back to the data processing control sub-module 51.
  • the data operation sub-module 53 is controlled by the data processing module sub-module 51 to complete data processing in the network operation process.
  • the data operation sub-module 53 includes a vector point multiplying unit, a vector adding unit, a vector summation unit, and a vector non-linear transform unit, and registers R1, R2, R3, R4, and R5 for determining whether or not the flag-related data operation is completed.
  • Registers R1, R2, R3, R4, and R5 are used to mark whether the input operation of the input gate, the forget gate, the output gate, the candidate state unit, and the updated state unit are completed.
  • the value of the register other than 0 indicates that the operation is completed. , 0 means not yet completed.
  • the vector addition component adds the corresponding positions of the two vectors to obtain a vector, and the vector summation component divides the vector into several segments, and each segment is internally summed, and the resulting vector length is equal to the number of segments.
  • the vector nonlinear transformation component takes each element in the vector as an input and obtains a nonlinear function transformed output.
  • the specific nonlinear transformation can be done in two ways. Take the sigmoid function with input x as an example. One is to use the function operation method to directly calculate sigmoid(x). The other way is to use the look-up table method.
  • the data operation sub-module 53 maintains a table of sigmoid functions. , respectively, the values of the corresponding outputs y 1 , y 2 ...
  • R1, R2, R3, R4, and R5 are set to zero.
  • the summation component completes the summation operation of the temporary value, and updates the calculation result with the input gate part and the completion part sum; takes another input data and the weight value to perform the same operation update part and waits
  • the obtained partial sum is the net activation amount of the neuron
  • the output value of the input gate is calculated by the vector nonlinear transformation component.
  • the output value is written back to data buffer unit 4 and the R1 register is set to non-zero.
  • the empty operation or the operation of the updated state unit is performed according to the control command of the data processing module sub-module 51.
  • the operation of the updated state unit is: taking out the forgotten gate output value and the old state unit from the data buffer unit 4, calculating the partial sum by the vector point multiplication component, and then taking the input gate output value from the data buffer unit 4 and
  • the candidate state unit obtains the updated state unit by the vector point multiplication component and the state unit portion between the sum and the vector addition component.
  • the last state unit is finally written back to the data buffer unit 4.
  • the operation of the null operation or the LSTM network output value is performed according to the control command of the data processing module sub-module 51.
  • the operation of the output value is: the updated state unit uses the vector nonlinear function change component to calculate the nonlinear transformation value of the state unit, and then sets R5 to be non-zero. Then, the vector point multiplication component is used to perform a point multiplication of the nonlinear transformation values of the output gate and the state unit, and the final output value is calculated as the output value of the neuron corresponding to the LSTM network.
  • the output value is written back to the data buffer unit 4.
  • the device of the present invention operates using a specially designed instruction set, and the efficiency of instruction decoding is high.
  • Parallel computing between multiple data processing modules and parallel operation of multiple data cache units does not require data transfer, greatly improving the parallelism of the operations.
  • placing weights and offsets in the data buffer unit can reduce IO operations between the device and the external address space, reducing the bandwidth required for memory access.
  • FIG. 3 illustrates a flow chart for performing LSTM network operations in accordance with an embodiment of the present invention.
  • step S1 an IO instruction is pre-stored at the first address of the instruction cache unit 2.
  • step S2 the controller unit 3 reads the IO instruction from the first address of the instruction cache unit 2, and according to the translated microinstruction, the direct memory access unit 1 reads all the LSTM network calculation related information from the external address space. Instruction and cache it into instruction cache unit 2 in.
  • step S3 the controller unit 3 reads an IO instruction from the instruction cache unit 2, and according to the translated microinstruction, the direct memory access unit 1 reads the weights and offsets related to the LSTM network operation from the external designated address space.
  • the weights and offsets of the input gate, the output gate, the forgotten gate, and the candidate state unit are included, and the weights and offsets are separately segmented and read into different data cache modules 4 according to the different neurons corresponding to the weights. .
  • step S4 the controller unit 3 reads a state unit initialization instruction from the instruction cache unit 2, initializes the state unit value in the data buffer module 4 according to the translated microinstruction, and inputs the gate portion and the output gate portion. And, forget the door part and, the state unit of the candidate state, and the corresponding neuron offset value.
  • step S5 the controller unit 3 reads an IO instruction from the instruction cache unit 2, and according to the translated microinstruction, the direct memory access unit 1 reads the input value from the externally specified address space into the data buffer unit 4, each Each of the data buffer units 4 receives a copy of the same input value vector.
  • step S6 the controller unit 3 reads a data processing instruction from the instruction cache unit 2, and according to the translated microinstruction, the data processing module 5 acquires the relevant data required for the operation from the corresponding data buffer unit 4, and performs an operation.
  • the obtained result is the output value of the partial neurons corresponding to one time point, and the output values processed by all the data processing modules 5 are combined to output values corresponding to one time point.
  • the data processing module 5 stores the processed intermediate value or output value and the status unit value in the data buffer unit 4.
  • step S7 the controller unit 3 reads an IO instruction from the instruction cache unit 2, and according to the translated microinstruction, the output value in the data buffer unit 4 is spliced and outputted to the external designated address through the direct memory access unit 1.
  • step S8 the controller unit 3 reads a discriminating instruction from the instruction buffer unit 2, and based on the translated microinstruction, the controller unit 3 determines whether the current forward process is completed, and if so, ends the operation. If not completed, go to S6 and continue running.
  • FIG. 4 illustrates a method for performing LSTM network operations in accordance with an embodiment of the present invention. Detailed flow chart of the data processing process.
  • step S1 the data processing module 5 reads in the weights and input values of a part of the input gates from the data buffer unit 4.
  • step S2 the data processing control sub-module 51 in the data processing module 5 controls the vector point multiplication component in the data operation sub-module 53 to calculate the input gate weight value, the dot multiplication of the input value, and then, according to the different neurons to which the result belongs. Grouping, the point multiplication result in the group is calculated by the vector summation component in the data operation sub-module 53 to obtain a partial sum.
  • step S3 the data processing module 5 reads the input gate portion from the data buffer unit 4.
  • step S4 the data processing control sub-module 51 in the data processing module 5 controls the data operation sub-module 53 to add the calculated portion and the newly read-in portion to the updated input gate portion sum.
  • step S5 the data processing module 5 writes the updated partial sum to the data cache module 4.
  • step S6 the data processing module 5 determines whether all the input gate weights have been operated once. If so, the portion of the data buffer unit is the value of the input gate, and the R1 register is set to be non-zero. Otherwise, Instead of using a different part of the input gate weights and input values, S1 continues to run.
  • step S7 the forgetting gate output value, the output gate output value, and the candidate state unit output value are obtained by using the operation mode, and R2, R3, and R4 are set to be non-zero, and the output values are all written back to the data buffer unit. 4 in.
  • step S8 the data processing control sub-module 51 in the data processing module 5 controls the data dependency discriminating sub-module 52 to determine whether the forgetting gate, the input gate, and the candidate state unit complete the operation, that is, whether R1, R2, and R4 are both determined. If it is non-zero, if not, the data processing control sub-module 51 controls the data operation sub-module 53 to perform a null operation, and then proceeds to S8 to continue the operation, and if so, the operation proceeds to S9.
  • step S9 the data processing module 5 reads the old state unit from the data buffer unit 4 and forgets the gate output value.
  • step S10 the data processing control sub-module 51 in the data processing module 5 controls the data operation sub-module 53 to use the vector point multiply component for the old state unit and the forgotten gate output value. Calculate the partial sum.
  • step S11 the data processing module 5 reads the candidate state unit and the input gate output value from the data buffer unit 4.
  • step S12 the data processing control sub-module 51 in the data processing module 5 controls the data operation sub-module 53 to select the state unit and the input gate output value to calculate the partial sum by the vector point multiplication component, and calculate the partial sum and S10.
  • the updated portion of the state unit is calculated by the vector addition component.
  • step S13 the data processing module 5 sends the updated status unit back to the data buffer unit 4.
  • step S14 the data processing control sub-module 51 in the data processing module 5 controls the data operation sub-module 53 to calculate the nonlinear transformation function tanh transformation value of the state unit from the updated state unit using the vector nonlinear transformation component, and R5 is set to non-zero.
  • step S15 the data control sub-module 51 in the data processing module 5 controls the data dependency discrimination sub-module 52 to determine whether the output gate output value and the calculation of the nonlinear transformation function tanh transformation value of the state unit are completed, that is, whether R3 and R5 are determined. Neither is zero. If not, the data processing control sub-module 51 controls the data operation sub-module 53 to perform a null operation, then proceeds to S15 to continue the operation, and if so, proceeds to S16.
  • step S16 the data processing module 5 reads the output of the output gate from the data buffer unit 4.
  • step S17 the data processing control sub-module 51 in the data processing module 5 controls the data operation sub-module 53 to calculate the output gate output value and the nonlinear transform function tanh transform value of the state unit through the vector point multiply component to obtain an output value, that is, It is the output value in the neurons corresponding to the data processing module 5 in the LSTM network.
  • step S18 the data processing module 5 writes the output value into the data buffer unit 4.
  • the processes or methods depicted in the preceding figures may include hardware (eg, circuitry, dedicated logic, etc.), firmware, software (eg, software embodied on a non-transitory computer readable medium), or both
  • the combined processing logic is executed.
  • the processes or methods have been described above in some order, it should be understood that some of the operations described may vary. Execute in order. Moreover, some operations may be performed in parallel rather than sequentially.
  • the weight and offset of the hidden layer in the LSTM network operation process will be reused, and the weight and offset values will be temporarily stored in the data buffer unit, so that the amount of IO and the external IO are reduced.
  • the overhead incurred by the transmission is reduced;
  • the present invention does not limit the application field of a specific LSTM network, and can be used in fields such as speech recognition, text translation, music synthesis, etc., and has strong scalability;
  • the multiple data processing modules in the device are completely parallel, and the internal parts of the data processing module are parallel, which can fully utilize the parallelism of the LSTM network, and significantly improve the computing speed of the LSTM network;
  • the specific implementation of the vector nonlinear function conversion component can be performed by a table lookup method, and the efficiency is greatly improved compared with the conventional function operation.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

一种用于执行LSTM神经网络运算的装置和运算方法。该装置包括直接内存访问单元(1)、指令缓存单元(2)、控制器单元(3)、并列设置的多个数据缓存单元(4)和并列设置的多个数据处理模块(5),其中所述多个数据处理模块(5)与所述数据缓存单元(4)一一对应,用于从相应数据缓存单元(4)中获取输入数据和运算时所需的权值和偏置,进行LSTM神经网络运算;所述多个数据处理模块(5)之间执行并行运算。该装置采用专用指令运行,运算所需指令数大幅减少,译码开销降低;将权值和偏置缓存,使得数据传输开销降低;多个数据处理模块(5)并行运行,显著提高LSTM网络的运算速度。

Description

用于执行LSTM神经网络运算的装置和运算方法 技术领域
本发明涉及神经网络运算技术领域,更具体地涉及一种用于执行LSTM神经网络运算的装置和运算方法。
背景技术
长短时间记忆网络(LSTM)是一种时间递归神经网络(RNN),由于网络本身独特的结构设计,LSTM适合于处理和预测时间序列中间隔和延时非常长的重要事件。相比于传统的递归神经网络,LSTM网络表现出更好的性能,它非常适合从经验中学习,以便在重要事件之间存在未知大小时间之后时对时间序列进行分类、处理和预测。目前,在语音识别、视频描述、机器翻译和音乐自动合成等诸多领域,LSTM网络被广泛应用。同时,随着对LSTM网络研究的不断深入,LSTM网络的性能得到了大幅的改善,也在工业界和学术界引起广泛重视。
LSTM网络的运算涉及到多种算法,具体的实现装置主要有以下两种:
一种实现LSTM网络运算的装置是通用处理器。该方法通过使用通用寄存器堆栈和通用功能部件执行通用指令来支持上述算法。该方法的缺点之一是单个通用处理器的运算性能较低,无法满足通常的借助LSTM网络本身运算的可并行性来进行加速。而通过多个通用处理器并行执行时,通过处理器之间相互通信又成为了性能瓶颈。另外,通用处理器需要把人工神经网络运算译码成一系列的运算以及访存指令,处理器前端译码也存在较大的功耗开销。
另一种支持LSTM网络运算的已知方法是使用图形处理器(GPU)。该方法通过使用通用寄存器堆栈和通用流处理单元执行通用SIMD指令来执行上述算法。由于GPU是专门用来执行图形图像运算以及科学计算的设备,没有对LSTM网络提供专门的支持,仍然需要大量的前端译 码工作才能执行LSTM网络运算,这会带来大量的额外开销。另外,GPU只有较小的片上缓存,LSTM网络中使用的相关参数需要反复从片外搬运,片外带宽也成为了性能瓶颈。
由此可见,如何设计和提供一种以较小的IO量、低开销的方式实现高运算性能的LSTM网络运算的装置和方法是当前迫切需要解决的技术问题。
发明内容
有鉴于此,本发明的主要目的是提供一种用于执行LSTM网络运算的装置和方法,以解决上述技术问题中的至少之一。
为了实现上述目的,作为本发明的一个方面,本发明提供了一种用于执行LSTM神经网络运算的装置,其特征在于,包括:
并列设置的多个数据缓存单元,用于对运算所需数据、状态和结果进行缓存;
并列设置的多个数据处理模块,用于从对应数据缓存单元中获取输入数据和运算时所需的权值和偏置,进行LSTM神经网络运算;其中所述多个数据处理模块与所述数据缓存单元一一对应,且所述多个数据处理模块之间执行并行运算。
作为本发明的另一个方面,本发明还提供了一种用于执行LSTM神经网络运算的装置,其特征在于,包括:
存储器;
处理器,所述处理器执行如下操作:
步骤1,从外部指定地址空间读取用于LSTM神经网络运算的权值和偏置,并将其分割成与所述LSTM神经网络运算的神经元对应的多个部分后存储到所述存储器的不同空间中,其中每个空间中的权值和偏置的数量相同;并从外部指定地址空间读取用于LSTM神经网络运算的输入数据并将其存储在所述存储器的每一个所述不同空间中;
步骤2,将所述存储器的每一个所述不同空间中的权值和输入数据 分割成若干份,其中每份的权值或输入数据的数量与对应的向量运算单元运算的数量相同;每次将一份权值和输入数据计算得到部分和,再与之前得到的部分和进行向量加,得到新的部分和,其中所述部分和的初始值为偏置值;
步骤3,待所述存储器的每一个所述不同空间中的所有输入数据均进行了处理之后,得到的部分和即为神经元对应的净激活量,通过非线性函数tanh或sigmoid函数变换所述神经元的净激活量,得到所述神经元的输出值;
步骤4,通过这种方式使用不同的权值和偏置,重复上述步骤1~3,分别计算出LSTM神经网络运算中的忘记门、输入门、输出门以及待选状态单元的向量值;其中,计算部分和的过程采用的是向量运算指令,所述存储器的每一个所述不同空间中的输入数据采用并行运算方式进行计算;
步骤5,判断所述存储器的每一个所述不同空间中的当前忘记门、输入门以及待选状态单元向量值计算是否完成,若完成,则进行新的状态单元的计算,即将旧的状态单元和忘记门向量值通过向量点乘部件得到部分和,然后将待选状态单元和输入门的值通过向量点乘部件得到部分和,将所述两个部分和通过向量求和子模块得到更新后的状态单元,同时,将更新后的状态单元通过非线性变换函数tanh进行变换;判断当前更新后的数据状态单元的非线性变换和输出门是否计算完成,如果计算完成,将输出门与带更新后的数据状态单元非线性变换后的向量通过向量点乘部件进行计算得到所述存储器的每一个所述不同空间的最终输出值;
步骤6,将各个所述存储器的每一个所述不同空间的最终输出值通过拼接得到最终的输出值。
作为本发明的再一个方面,本发明还提供了一种LSTM神经网络的运算方法,其特征在于,包括以下步骤:
步骤S1,从外部指定地址空间读取用于LSTM神经网络运算的权重和偏置,并将其写入到多个并行设置的数据缓冲单元,对各数据缓存 单元的状态单元进行初始化;其中,从外部指定地址空间读取的权值和偏置对应于所述LSTM神经网络运算的神经元而被分割送入到各相应数据缓冲单元,各数据缓冲单元中的权值和偏置的数量分别相同;
步骤S2,从外部指定地址空间读取输入数据,并将其写入到所述多个数据缓冲单元,其中每一个所述数据缓冲单元中写入的输入数据均为完整的;
步骤S3,与所述多个并行设置的数据缓冲单元一一对应的多个数据处理模块分别从对应的数据缓冲单元中读取所述权重、偏置和输入数据,并采用向量点乘部件、向量加法部件、向量求和部件和向量非线性函数转换部件对其进行LSTM神经网络运算,分别得到每一个数据处理模块的输出值;
步骤S4,将各个数据处理模块的输出值通过拼接得到最终的输出值,即所述LSTM神经网络运算的最终结果。
附图说明
为了更完整地理解本发明及其优势,现在将参考结合附图的以下描述,其中:
图1示出了根据本发明一实施例的用于执行LSTM网络运算的装置的整体结构示意图;
图2示出了根据本发明一实施例的用于执行LSTM网络运算的装置的数据处理模块示意图;
图3示出了根据本发明一实施例的用于执行LSTM网络运算的方法流程图;
图4示出了根据本发明一实施例的用于执行LSTM网络运算的方法中数据处理过程的详细流程图。
具体实施方式
为使本发明的目的、技术方案和优点更加清楚明白,以下结合具体实施例,并参照附图,对本发明作进一步的详细说明。
在本说明书中,下述用于描述本发明原理的各种实施例只是说明,不应该以任何方式解释为限制发明的范围。参照附图的下述描述用于帮助全面理解由权利要求及其等同物限定的本发明的示例性实施例。下述描述包括多种具体细节来帮助理解,但这些细节应认为仅仅是示例性的。因此,本领域普通技术人员应认识到,在不背离本发明的范围和精神的情况下,可以对本文中描述的实施例进行多种改变和修改。此外,为了清楚和简洁起见,省略了公知功能和结构的描述。此外,贯穿附图,相同参考数字用于相似功能和操作。在本发明中,术语“包括”和“含有”及其派生词意为包括而非限制。
本发明的用于执行LSTM网络运算的装置可以应用于以下场景中,包括但不限于:数据处理、机器人、电脑、打印机、扫描仪、电话、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、云端服务器、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备等各类电子产品;飞机、轮船、车辆等各类交通工具;电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、抽烟机等各类家用电器,以及包括核磁共振仪、B超、心电图仪等各类医疗设备。
具体地,本发明公开了一种用于执行LSTM神经网络运算的装置,包括:
并列设置的多个数据缓存单元,用于对运算所需数据、状态和结果进行缓存;
并列设置的多个数据处理模块,用于从对应数据缓存单元中获取输入数据和运算时所需的权值和偏置,进行LSTM神经网络运算;其中所述多个数据处理模块与所述数据缓存单元一一对应,且所述多个数据处理模块之间执行并行运算。
其中,所述数据缓存单元还对所述数据处理模块计算的中间结果进行缓存,且在整个执行过程中只从所述直接内存访问单元导入一次权值和偏置,之后不再改变。
其中,每一个所述数据缓冲单元中写入了对应于所述LSTM神经网络运算的神经元而被分割的权值和偏置,其中各数据缓冲单元中的权值和偏置的数量均相同,且每个数据缓冲单元都获取了一份完整的输入数据。
其中,所述数据处理模块采用向量点乘部件、向量加法部件、向量求和部件和向量非线性函数转换部件进行所述LSTM神经网络运算。
其中,所述向量非线性函数转换部件通过查表方法进行函数运算。
其中,每一个所述数据处理模块通过分别计算出LSTM网络运算中的忘记门、输入门、输出门以及待选状态单元的向量值来进行所述向量运算,再由各所述向量值得到每一个所述数据处理模块的输出值,最后将各所述数据处理模块的输出值拼接得到最终的输出值。
作为一个优选实施方式,本发明公开了一种用于执行LSTM神经网络运算的装置,包括:
直接内存访问单元,用于从所述装置外的外部地址空间获取指令和LSTM神经网络运算所需的数据,并将其中的指令和数据分别传输给指令缓存单元和数据缓存单元,以及从数据处理模块或数据缓存单元将运算结果写回到外部地址空间;
指令缓存单元,用于缓存直接内存访问单元从外部地址空间获取的指令,并输入到控制器单元;
控制器单元,从指令缓存单元中读取指令,并将指令译码成微指令,用于控制直接内存单元进行数据IO操作、数据处理模块进行相关运算和数据缓存单元进行数据缓存和传输;
并列设置的多个数据缓存单元,用于对运算所需数据、状态和结果进行缓存;
并列设置的多个数据处理模块,用于从相应数据缓存单元中获取输入数据和运算时所需的权值和偏置,进行LSTM神经网络运算,并将运 算结果输入到相应数据缓存单元或直接内存访问单元中;其中所述多个数据处理模块与所述数据缓存单元一一对应,且所述多个数据处理模块之间执行并行运算。
作为优选,所述直接内存访问单元、指令缓存单元、控制器单元、多个数据缓存单元和多个数据处理模块均通过硬件电路实现。
作为优选,所述数据缓存单元还对所述数据处理模块计算的中间结果进行缓存,且在整个执行过程中只从所述直接内存访问单元导入一次权值和偏置,之后不再改变。
作为优选,所述多个数据处理模块均采用向量点乘部件、向量加法部件、向量求和部件和向量非线性函数转换部件进行所述LSTM神经网络运算。
作为优选,所述向量非线性函数转换部件通过查表方法进行函数运算。
作为优选,所述多个数据处理模块通过如下方式进行并行运算:
步骤1,各相应数据缓冲单元中写入了从外部指定地址空间读取的、对应于所述LSTM神经网络运算的神经元而被分割的权值和偏置,其中各数据缓冲单元中的权值和偏置的数量均相同,且每个数据缓冲单元都获取了一份完整的输入数据;每一数据处理模块将各相应数据缓冲单元中的权值和输入数据分割成若干份,其中每份的权值或输入数据的数量与对应的单个数据处理模块中向量运算单元运算的数量相同;每次将一份权值和输入数据送入所述相应数据处理模块中,计算得到部分和,再从所述数据缓冲单元中取出之前得到的部分和,对部分和进行向量加,得到新的部分和,送回到所述数据缓冲单元中,其中所述部分和的初始值为偏置值;
步骤2,待各数据缓冲单元中的所有输入数据送入到相应数据处理模块处理一次后,得到的部分和即为神经元对应的净激活量,相应数据处理模块通过非线性函数tanh或sigmoid函数变换所述神经元的净激活量,得到所述神经元的输出值;
步骤3,通过这种方式使用不同的权值和偏置,重复上述步骤1~2, 分别计算出LSTM网络运算中的忘记门、输入门、输出门以及待选状态单元的向量值;其中,在同一个数据处理模块中,计算部分和的过程采用的是向量运算指令,数据间采用并行运算;
步骤4,每个数据处理模块判断当前忘记门、输入门以及待选状态单元向量值计算是否完成,若完成,则进行新的状态单元的计算,即将旧的状态单元和忘记门向量值送入到数据处理单元,通过向量点乘部件得到部分和,送回到数据缓冲单元中,然后将待选状态单元和输入门的值送到数据处理单元,通过向量点乘部件得到部分和,将数据缓冲单元中的部分和送到数据处理模块中,通过向量求和子模块得到更新后的状态单元,然后,送回到数据缓存单元中,同时,将数据处理模块中的更新后的状态单元通过非线性变换函数tanh进行变换;每个数据处理模块判断当前更新后的数据状态单元的非线性变换和输出门是否计算完成,如果计算完成,将输出门与带更新后的数据状态单元非线性变换后的向量通过向量点乘部件进行计算得到最终的输出值,将输出值写回到数据缓冲单元中;
步骤5,待所有数据处理模块中的输出值写回到数据缓冲单元后,将各个数据处理模块中的输出值通过拼接得到最终的输出值,并通过直接内存访问单元送至外部指定地址处。
本发明还公开了一种用于执行LSTM神经网络运算的装置,包括:
存储器;
处理器,所述处理器执行如下操作:
步骤1,从外部指定地址空间读取用于LSTM神经网络运算的权值和偏置,并将其分割成与所述LSTM神经网络运算的神经元对应的多个部分后存储到所述存储器的不同空间中,其中每个空间中的权值和偏置的数量相同;并从外部指定地址空间读取用于LSTM神经网络运算的输入数据并将其存储在所述存储器的每一个所述不同空间中;
步骤2,将所述存储器的每一个所述不同空间中的权值和输入数据分割成若干份,其中每份的权值或输入数据的数量与对应的向量运算单元运算的数量相同;每次将一份权值和输入数据计算得到部分和,再与 之前得到的部分和进行向量加,得到新的部分和,其中所述部分和的初始值为偏置值;
步骤3,待所述存储器的每一个所述不同空间中的所有输入数据均进行了处理之后,得到的部分和即为神经元对应的净激活量,通过非线性函数tanh或sigmoid函数变换所述神经元的净激活量,得到所述神经元的输出值;
步骤4,通过这种方式使用不同的权值和偏置,重复上述步骤1~3,分别计算出LSTM神经网络运算中的忘记门、输入门、输出门以及待选状态单元的向量值;其中,计算部分和的过程采用的是向量运算指令,所述存储器的每一个所述不同空间中的输入数据采用并行运算方式进行计算;
步骤5,判断所述存储器的每一个所述不同空间中的当前忘记门、输入门以及待选状态单元向量值计算是否完成,若完成,则进行新的状态单元的计算,即将旧的状态单元和忘记门向量值通过向量点乘部件得到部分和,然后将待选状态单元和输入门的值通过向量点乘部件得到部分和,将所述两个部分和通过向量求和子模块得到更新后的状态单元,同时,将更新后的状态单元通过非线性变换函数tanh进行变换;判断当前更新后的数据状态单元的非线性变换和输出门是否计算完成,如果计算完成,将输出门与带更新后的数据状态单元非线性变换后的向量通过向量点乘部件进行计算得到所述存储器的每一个所述不同空间的最终输出值;
步骤6,将各个所述存储器的每一个所述不同空间的最终输出值通过拼接得到最终的输出值。
本发明还公开了一种LSTM神经网络的运算方法,包括以下步骤:
步骤S1,从外部指定地址空间读取用于LSTM神经网络运算的权重和偏置,并将其写入到多个并行设置的数据缓冲单元,对各数据缓存单元的状态单元进行初始化;其中,从外部指定地址空间读取的权值和偏置对应于所述LSTM神经网络运算的神经元而被分割送入到各相应数据缓冲单元,各数据缓冲单元中的权值和偏置的数量分别相同;
步骤S2,从外部指定地址空间读取输入数据,并将其写入到所述多个数据缓冲单元,其中每一个所述数据缓冲单元中写入的输入数据均为完整的;
步骤S3,与所述多个并行设置的数据缓冲单元一一对应的多个数据处理模块分别从对应的数据缓冲单元中读取所述权重、偏置和输入数据,并采用向量点乘部件、向量加法部件、向量求和部件和向量非线性函数转换部件对其进行LSTM神经网络运算,分别得到每一个数据处理模块的输出值;
步骤S4,将各个数据处理模块的输出值通过拼接得到最终的输出值,即所述LSTM神经网络运算的最终结果。
作为优选,所述步骤S3中,每一个数据处理模块分别将对应数据缓冲单元中的权值和输入数据分割成若干份,其中每份的权值或输入数据的数量与对应的单个数据处理模块中向量运算单元运算的数量相同;每一数据缓冲单元每次将一份权值和输入数据送入与其对应的数据处理模块中,计算得到部分和,再从所述数据缓冲单元中取出之前得到的部分和,对部分和进行向量加,得到新的部分和,送回到所述数据缓冲单元中,其中所述部分和的初始值为偏置值;
待所有输入数据送入到数据处理模块一次后,得到的部分和即为神经元对应的净激活量,然后将神经元的净激活量送入到数据处理模块中,通过数据运算子模块中的非线性函数tanh或sigmoid函数变换得到神经元的输出值,通过这种方式使用不同的权值和偏置,分别计算出LSTM神经网络中的忘记门、输入门、输出门以及待选状态单元的向量值;
每个数据处理模块判断当前忘记门、输入门以及待选状态单元向量值计算是否完成,若完成,则进行新的状态单元的计算,即将旧的状态单元和忘记门向量值送入到数据处理单元,通过向量点乘部件得到部分和,送回到数据缓冲单元中,然后将待选状态单元和输入门的值送到数据处理单元,通过向量点乘部件得到部分和,将数据缓冲单元中的部分和送到数据处理模块中,通过向量求和子模块得到更新后的状态单元,然后,送回到数据缓存单元中,同时,将数据处理模块中的更新后的状 态单元通过非线性变换函数tanh进行变换;每个数据处理模块判断当前更新后的数据状态单元的非线性变换和输出门是否计算完成,如果计算完成,将输出门与带更新后的数据状态单元非线性变换后的向量通过向量点乘部件进行计算得到最终的输出值,将输出值写回到数据缓冲单元中。
作为优选,所述非线性函数tanh或sigmoid函数通过查表方法进行函数运算。
根据本发明实施例结合附图对本发明示例性实施例的以下详细描述,本发明的其它方面、优势和突出特征对于本领域技术人员将变得显而易见。
作为本发明的一实施方式,本发明公开了一种LSTM网络运算的装置和方法,可以用以加速使用LSTM网络方面的应用。该方法具体包括以下步骤:
(1)将用于LSTM网络运算中使用的权重和偏置通过直接内存访问单元从外部指定地址空间取出,并写入到各个数据缓冲单元,其中,权值和偏置是将从外部指定地址空间取出分割后送入到各个数据缓冲单元,各个数据缓冲单元中的权值和偏置数量相同,且各个数据缓冲单元中的权值和偏置与神经元相对应,并初始化数据缓存单元中的状态单元;
(2)将输入数据通过直接内存访问单元从外部指定地址空间取出写入到数据缓冲单元,此处每个数据缓冲单元都获取一份完整的输入数据;
(3)将各个数据缓冲单元中的权值和输入数据分割成若干份,其中每份的权值或输入数据的数量与对应的单个数据处理模块中向量运算单元运算的数量相同,每次将一份权值和输入数据送入数据处理模块中,计算得到部分和,再从数据缓冲单元中取出之前得到的部分和,对部分和进行向量加,得到新的部分和,送回到数据缓冲单元中。其中部分和的初始值为偏置值。待所有输入数据送入到数据处理模块一次后, 得到的部分和即为神经元对应的净激活量,然后将神经元的净激活量送入到数据处理模块中,通过数据运算子模块中的非线性函数tanh或sigmoid函数变换得到神经元的输出值,这里的函数变换可以通过查表法和函数运算两种方法进行。通过这种方式使用不同的权值和偏置,可以分别计算出LSTM网络中的忘记门、输入门、输出门以及待选状态单元的向量值。在同一个数据处理模块中,计算部分和的过程采用的是向量运算指令,数据间存在并行。然后,每个数据处理模块中的数据依赖判别子模块会判断当前忘记门、输入门以及待选状态单元向量值计算是否完成,若完成,则进行新的状态单元的计算。首先,将旧的状态单元和忘记门向量值送入到数据处理单元,通过数据运算子模块中的向量点乘部件得到部分和,送回到数据缓冲单元中;然后将待选状态单元和输入门的值送到数据处理单元,通过数据运算子模块中的向量点乘部件得到部分和,将数据缓冲单元中的部分和送到数据处理模块中,通过数据运算子模块中的向量求和子模块得到更新后的状态单元,然后,送回到数据缓存单元中,同时,将数据处理模块中的更新后的状态单元通过数据运算子模块中的非线性变换函数tanh进行变换。每个数据处理模块中的数据依赖判别子模块会判断当前更新后的数据状态单元的非线性变换和输出门是否计算完成,如果计算完成,将输出门与带更新后的数据状态单元非线性变换后的向量通过数据运算子模块中的向量点乘部件进行计算得到最终的输出值,将输出值写回到数据缓冲单元中。在整个运算过程中,不同的数据处理模块之间都不存在数据依赖或数据冲突问题,可以始终并行处理。
(4)待所有数据处理模块中的输出值写回到数据缓冲单元后,将各个数据处理模块中的输出值通过拼接得到最终的输出值通过直接内存访问单元送至外部指定地址处。
(5)判断LSTM网络是否需要进行下一时刻的输出,如果需要,转(2),否则,结束运行。
图1示出了根据本发明一实施方式的用于执行LSTM网络运算的装置的整体结构示意图。如图1所示,该装置包括直接内存访问单元1、 指令缓存单元2、控制器单元3、数据缓存单元4、数据处理模块5,均可以通过硬件电路实现。
直接内存访问单元1能够访问外部地址空间,可以向装置内部各个缓存单元读写数据,完成数据的加载和存储。具体包括向指令缓存单元2读取指令,从指定存储单元之间读取LSTM网络运算所需的权值、偏置以及输入数据到数据缓存单元4,将运算后的输出从数据缓存单元4直接写入外部指定空间。
指令缓存单元2通过直接内存访问单元1读取指令,并缓存读入的指令。
控制器单元3从指令缓存单元2中读取指令,将指令译码为控制其他模块行为的微指令并将其发送给其他模块如直接内存访问单元1、数据缓存单元4、数据处理模块5等。
数据缓存单元4在装置初始化时初始化LSTM的状态单元,并通过直接内存访问单元1从外部指定地址将权值和偏置读取进来,每个数据缓存单元4中读入的权值和偏置与所要计算的神经元对应,即每个数据缓存单元4中读入的权值和偏置为总的权值和偏置的一部分,所有数据缓存单元4中的权值和偏置合并之后为从外部指定地址读入的权值和偏置;在具体运算时,首先会从直接内存访问单元1获取输入数据,每个数据缓存单元4都得到一份输入数据的拷贝,部分和初始化为偏置值,然后将权值、偏置以及输入值中的一部分送至数据处理模块5,在数据处理模块5计算得到中间值,然后将中间值从数据处理模块5读出并保存数据缓存单元4中,当所有的数入都进行一遍运算后,将部分和输入到数据处理模块5计算得到神经元输出,然后写回到数据缓冲单元4中,最终得到输入门、输出门、忘记门和待选状态单元的向量值。然后,将忘记门和旧的状态单元送入到数据处理模块5中,计算得到部分和,写回到数据缓冲单元4中,将待选状态单元和输入门送入到数据处理模块5中,计算得到部分和,将数据缓存单4中的部分和写入到数据处理模块5中与之前计算得到的部分和进行向量加得到更新后的状态单元,并将其写回到数据缓存单元4中。将输出门送入到数据处理模块5中,与 更新后的状态单元的非线性变换函数tanh变换后的值进行向量点乘得到输出值,并将输出值写回到数据缓存单元4中。最终,每个数据缓存单元4中得到对应的更新后的状态单元以及输出值,所有数据缓存单元4中的输出值合并即为最终的输出值。最终,每个数据缓存单元4将其得到的部分输出值通过直接内存访问单元1写回到外部指定地址空间。
LSTM网络中对应的运算如下所示:
ft=σ(Wf[ht-1,xt]+bf);
it=σ(Wi[ht-1,xt]+bi);
Figure PCTCN2016113493-appb-000001
Figure PCTCN2016113493-appb-000002
ot=σ(Wo[ht-1,xt]+bo);
ht=ot⊙tanh(ct);
其中,xt为第t时刻的输入数据,ht-1表示t-1时刻的输出数据,Wf、Wi、Wc和Wo分别表示忘记门、输入门、更新状态单元和输出门所对应的权值向量,bf、bi、bc和bo分别表示忘记门、输入门、更新状态单元和输出门所对应偏置;ft表示忘记门的输出,与t-1时刻的状态单元进行点乘来有选择的遗忘过去的状态单元值;it表示输入门的输出,与t时刻的得到的候选状态值点乘来有选择地将t时刻的候选状态值加入到状态单元中;
Figure PCTCN2016113493-appb-000003
表示t时刻计算得到的候选状态值;ct表示通过将t-1时刻的状态值有选择的遗忘和将t时刻的状态值有选择的加入得到的新的状态值,它将在计算最终输出时刻被使用并传输到下一时刻;ot表示t时刻状态单元中需要作为结果部分输出的选择条件;ht表示t时刻的输出,同时它还将被传输到下一时刻;⊙为向量按元素运算的乘积;σ为sigmoid函数,计算公式为:
Figure PCTCN2016113493-appb-000004
激活函数tanh函数的计算公式为
Figure PCTCN2016113493-appb-000005
数据处理模块5每次从对应的数据缓存单元4中读取部分权值Wi/Wf/Wo/Wc和权值bi/bf/bo/bc以及对应的输入数据[ht-1,xt],通过数据处理模块5中的向量乘部件和求和部件完成部分和的计算,直到对于每个神经元所有的输入数据都运算一次后,即可得到神经元的净激活量neti/netf/neto/netc,然后通过向量非线性函数sigmoid或tanh函数的 换完成输出值的计算,用这种方式分别完成输入门ii、忘记门fi、输出门oi以及待选状态单元
Figure PCTCN2016113493-appb-000006
的计算。然后通过分别通过数据处理模块5中的向量点乘部件计算出旧状态单元与忘记门、待选状态单元与输入门的点乘,然后将这两个结果通过向量加部件运算得到新的状态单元ct。将新得到的状态单元写回到数据缓存单元4中。将数据处理模块5中的状态单元利用向量非线性函数装换部件完成tanh函数的转换,得到tanh(ct),在计算过程中,可以通过计算tanh函数值或者查表两种方式完成。然后,将输出门与状态单元经过tanh非线性变换后的向量通过向量点乘部件运算得到最终的神经元输出值ht。最终,将神经元输出值ht写回到数据缓存单元4中。
图2示出了根据本发明一实施例的用于执行LSTM网络运算的装置的数据处理模块示意图;
如图2所示,数据处理单元5包括数据处理控制子模块51、数据依赖判别子模块52以及数据运算子模块53。
其中,数据处理控制子模块51控制数据运算子模块53所进行的运算。控制数据依赖判别子模块52对当前的运算是否存在数据依赖进行判断。对于部分运算,数据处理控制子模块51控制数据运算子模块53所进行的运算;而对于可能存在数据依赖关系的运算,数据处理控制子模块51首先会控制数据依赖判别子模块52对当前的运算是否存在数据依赖进行判断,如果,存在数据依赖关系,数据处理控制子模块51会使数据运算子模块53中插入空操作,等到数据依赖关系解除之后,再控制数据运算子模块53进行数据运算。
数据依赖判别子模块52受数据处理控制子模块51控制,检查数据运算子模块53中是否存在数据依赖关系。如果下次进行的操作需要用到当前尚未运算完成的值,则说明当前存在数据依赖,否则,则不存在数据依赖。一种数据依赖检测的方法是在数据运算子模块53中存在寄存器R1、R2、R3、R4、R5,分别用来标记输入门、忘记门、输出门、待选状态单元以及更新后的状态单元的tanh函数转换运算是否完成,寄存器的值非0表示运算完成,为0表示尚未完成。对应LSTM网络,数 据依赖判别子模块52会判断两次数据依赖,分别为计算新的状态单元时判断输入门、输出门以及带选状态单元之间是否存在数据依赖和计算输出值时判断输出门与更新后的状态单元的tanh函数转换是否存在数据依赖,分别判断R1、R3、R4是否都非0和R2、R是否都非0即可。判断完成后,需要将判断结果传回给数据处理控制子模块51。
数据运算子模块53受数据处理模块子模块51的控制,用以完成网络运算过程中的数据处理。数据运算子模块53中包括向量点乘部件、向量加法部件、向量求和部件以及向量非线性变换部件以及标志相关数据运算是否完成的寄存器R1、R2、R3、R4、R5。寄存器R1、R2、R3、R4、R5,分别用来标记输入门、忘记门、输出门、待选状态单元以及更新后的状态单元的tanh函数转换运算是否完成,寄存器的值非0表示运算完成,为0表示尚未完成。其中向量加法部件是将两个向量对应位置相加得到一个向量,而向量求和部件是将向量拆分成几段,每段内部求和,最终得到的向量长度与段数相等。向量非线性变换部件是将向量中每个元素作为输入,得到非线性函数变换后的输出。具体的非线性变换可以通过两种方式完成。以输入为x的sigmoid函数为例,一种是采用函数运算方式,直接计算得sigmoid(x),另一种方式是采用查表法完成,数据运算子模块53中维护一张sigmoid函数的表,分别记录了输入x1、x2…xn(x1<x2<…<xn)时对应的输出y1、y2…yn值,计算x对应的函数值是,首先找到区间[xi,xi+1]满足xi<x<xi+1,计算
Figure PCTCN2016113493-appb-000007
作为输出值。对于LSTM网络运算过程中,进行如下的运算:
首先,将R1、R2、R3、R4、R5置为0。用偏置初始化输入门部分和;用部分输入数据和和此输入数据对应的权值通过向量点乘部件计算得到临时值,然后,根据不同神经元对应的临时值向量将临时值分段用向量求和部件完成临时值的求和运算,将计算结果与输入门部分和完成部分和的更新;取另一份输入数据和权值进行相同运算更新部分和,待 所有的输入数据都运算一次后,得到的部分和即为神经元的净激活量,然后通过向量非线性变换部件计算得到输入门的输出值。将输出值写回到数据缓存单元4中,并将R1寄存器置为非0。
用计算输入门输出相同的方法计算出忘记门、输出门以及待选状态单元的输出值,将对应的输出值写回到数据缓存单元4中,并将R2、R3、R4寄存器置为非0。
根据数据处理模块子模块51的控制命令执行空操作或进行更新后的状态单元的运算。更新后的状态单元的运算为:从数据缓存单元4中取出忘记门输出值和旧的状态单元,通过向量点乘部件计算的到部分和,然后从数据缓存单元4中取出输入门输出值和待选状态单元,通过向量点乘部件计算的到部分和,和之间的状态单元部分和通过向量加部件得到更新后的状态单元。最终将最后的状态单元写回到数据缓存单元4中。
根据数据处理模块子模块51的控制命令执行空操作或LSTM网络输出值的运算。输出值的运算为:将更新后的状态单元利用向量非线性函数变化部件计算状态单元的非线性变换值,然后将R5置为非0。然后,利用向量点乘部件将输出门和状态单元的非线性变换值进行点乘运算,计算出最终的输出值即为LSTM网络对应的神经元的输出值。将输出值写回到数据缓存单元4中。
本发明装置使用专门设计的指令集进行工作,指令译码的效率较高。多个数据处理模块之间并行进行计算,以及多个数据缓存单元之间并行运行不需要数据传输,大幅提高了运算的并行性。另外,将权值和偏置放置在数据缓冲单元中可以减少装置与外部地址空间之间的IO操作,降低内存访问所需的带宽。
图3示出了根据本发明实施例提供的用于执行LSTM网络运算的流程图。
在步骤S1中,在指令缓存单元2的首地址处预先存入一条IO指令。
在步骤S2中,控制器单元3从指令缓存单元2的首地址读取该条IO指令,根据译出的微指令,直接内存访问单元1从外部地址空间读取所有与LSTM网络计算有关的所有指令,并将其缓存入指令缓存单元2 中。
在步骤S3中,控制器单元3从指令缓存单元2读入一条IO指令,根据译出的微指令,直接内存访问单元1从外部指定地址空间读取LSTM网络运算相关的权值、偏置,包括输入门、输出门、忘记门以及待选状态单元的权值和偏置,根据权值对应的神经元的不同,将权值、偏置分割后分别读入到不同的数据缓存模块4中。
在步骤S4中,控制器单元3从指令缓存单元2读入一条状态单元初始化指令,根据译出的微指令,将数据缓存模块4中的状态单元值进行初始化将输入门部分和、输出门部分和、忘记门部分和、待选状态单元部分和置为对应的神经元偏置值。
在步骤S5中,控制器单元3从指令缓存单元2读入一条IO指令,根据译出的微指令,直接内存访问单元1从外部指定地址空间读取输入值到数据缓存单元4中,每个数据缓存单元4中都收到一份相同的输入值向量。
在步骤S6中,控制器单元3从指令缓存单元2读入一条数据处理指令,根据译出的微指令,数据处理模块5从对应的数据缓存单元4中获取运算所需相关数据进行运算,运算得到的结果为一个时间点对应的部分神经元的输出值,所有的数据处理模块5处理得到的输出值合并后对应一个时间点的输出值,对于详细的处理过程详见图4。处理结束后,数据处理模块5将处理得到的中间值或输出值以及状态单元值存储到数据缓存单元4中。
在步骤S7中,控制器单元3从指令缓存单元2读入一条IO指令,根据译出的微指令,数据缓存单元4中输出值进行拼接通过直接内存访问单元1输出到外部指定地址。
在步骤S8中,控制器单元3从指令缓存单元2读入一条判别指令,根据译出的微指令,控制器单元3决定本次正向过程是否完成,如果完成,结束运行。如果未完成,转向S6中继续运行。
图4示出了根据本发明一实施例的用于执行LSTM网络运算的方法 中数据处理过程的详细流程图。
在步骤S1中,数据处理模块5从数据缓冲单元4读入一部分输入门的权值、输入值。
在步骤S2中,数据处理模块5中数据处理控制子模块51控制数据运算子模块53中的向量点乘部件计算输入门权值、输入值的点乘,然后,根据结果所属的神经元不同进行分组,将组内的点乘结果通过数据运算子模块53中向量求和部件计算得到部分和。
在步骤S3中,数据处理模块5从数据缓冲单元4读入输入门部分。
在步骤S4中,数据处理模块5中数据处理控制子模块51控制数据运算子模块53将计算得到的部分和和刚读入的部分和相加得到更新后的输入门部分和。
在步骤S5中,数据处理模块5将更新后的部分和写入到数据缓存模块4中。
在步骤S6中,数据处理模块5判断是否所有的输入门权值都进行了一次运算,如果是,数据缓存单元中的部分和即为输入门的值,将R1寄存器置为非零,否则,改用不同的输入门权值和输入值中的一部分转S1继续运行。
在步骤S7中,采用与运算方式运算得到忘记门输出值、输出门输出值以及待选状态单元输出值,并置R2、R3、R4为非零,同时将输出值都写回到数据缓存单元4中。
在步骤S8中,数据处理模块5中的数据处理控制子模块51控制数据依赖判别子模块52判断忘记门、输入门以及待选状态单元之间是否完成运算,即判断R1、R2、R4是否均为非零,如果否,数据处理控制子模块51控制数据运算子模块53进行空操作,然后转向S8继续运行,如果是,转S9运行。
在步骤S9中,数据处理模块5从数据缓存单元4中读取旧的状态单元以及忘记门输出值。
在步骤S10中,数据处理模块5中的数据处理控制子模块51控制数据运算子模块53对旧的状态单元以及忘记门输出值用向量点乘部件 计算得到部分和。
在步骤S11中,数据处理模块5从数据缓存单元4中读取待选状态单元以及输入门输出值。
在步骤S12中,数据处理模块5中的数据处理控制子模块51控制数据运算子模块53对待选状态单元以及输入门输出值用向量点乘部件计算得到部分和,并将部分和与S10计算得到的部分和通过向量加法部件计算得到更新的状态单元。
在步骤S13中,数据处理模块5将更新的状态单元送回到数据缓存单元4中。
在步骤S14中,数据处理模块5中的数据处理控制子模块51控制数据运算子模块53将更新后的状态单元利用向量非线性变换部件计算得到状态单元的非线性变换函数tanh变换值,并将R5置为非零。
在步骤S15中,数据处理模块5中的数据控制子模块51控制数据依赖判别子模块52判断输出门输出值以及状态单元的非线性变换函数tanh变换值的计算是否完成,即判断R3和R5是否均非零,如果否,数据处理控制子模块51控制数据运算子模块53进行空操作,然后转向S15继续运行,如果是,转S16运行。
在步骤S16中,数据处理模块5从数据缓存单元4中读入输出门的输出。
在步骤S17中,数据处理模块5中的数据处理控制子模块51控制数据运算子模块53将输出门输出值以及状态单元的非线性变换函数tanh变换值通过向量点乘部件计算得到输出值,即为LSTM网络中数据处理模块5对应的神经元中的输出值。
在步骤S18中,数据处理模块5将输出值写入到数据缓存单元4中。
前面的附图中所描绘的进程或方法可通过包括硬件(例如,电路、专用逻辑等)、固件、软件(例如,被具体化在非瞬态计算机可读介质上的软件),或两者的组合的处理逻辑来执行。虽然上文按照某些顺序操作描述了进程或方法,但是,应该理解,所描述的某些操作能以不同 顺序来执行。此外,可并行地而非顺序地执行一些操作。
本发明的用于执行神经网络运算的装置和方法,相比于已有的实现方式具有以下的有益效果:
1、采用外部指令运行,相比于已有的实现方式,运算所需的指令数大幅度减少,使得进行LSTM网络运算时产生的译码开销降低;
2、利用LSTM网络运算过程中隐含层的权值和偏置会重复使用这一特点,将权值和偏置值在数据缓存单元中进行暂存,使得装置与外部的IO量减少,数据传输产生的开销降低;
3、本发明不限制具体的LSTM网络的应用领域,可以在诸如语音识别、文本翻译、音乐合成等领域使用,可扩展性强;
4、装置中的多个数据处理模块之间完全并行,数据处理模块内部部分并行,能充分利用LSTM网络的可并行性,显著地提高了LSTM网络的运算速度;
5、作为优选,向量非线性函数转换部件的具体实现可以通过查表方法进行,相比于传统的函数运算,效率大幅度提升。
以上所述的具体实施例,对本发明的目的、技术方案和有益效果进行了进一步详细说明,应理解的是,以上所述仅为本发明的具体实施例而已,并不用于限制本发明,凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (10)

  1. 一种用于执行LSTM神经网络运算的装置,其特征在于,包括:
    并列设置的多个数据缓存单元,用于对运算所需数据、状态和结果进行缓存;
    并列设置的多个数据处理模块,用于从对应数据缓存单元中获取输入数据和运算时所需的权值和偏置,进行LSTM神经网络运算;其中所述多个数据处理模块与所述数据缓存单元一一对应,且所述多个数据处理模块之间执行并行运算。
  2. 如权利要求1所述的装置,其特征在于,所述数据缓存单元还对所述数据处理模块计算的中间结果进行缓存,且在整个执行过程中只从所述直接内存访问单元导入一次权值和偏置,之后不再改变。
  3. 如权利要求1所述的装置,其特征在于,每一个所述数据缓冲单元中写入了对应于所述LSTM神经网络运算的神经元而被分割的权值和偏置,其中各数据缓冲单元中的权值和偏置的数量均相同,且每个数据缓冲单元都获取了一份完整的输入数据。
  4. 如权利要求1所述的装置,其特征在于,所述数据处理模块采用向量点乘部件、向量加法部件、向量求和部件和向量非线性函数转换部件进行所述LSTM神经网络运算。
  5. 如权利要求4所述的装置,其特征在于,所述向量非线性函数转换部件通过查表方法进行函数运算。
  6. 如权利要求1所述的装置,其特征在于,每一个所述数据处理模块通过分别计算出LSTM网络运算中的忘记门、输入门、输出门以及待选状态单元的向量值来进行所述向量运算,再由各所述向量值得到每一个所述数据处理模块的输出值,最后将各所述数据处理模块的输出值拼接得到最终的输出值。
  7. 一种用于执行LSTM神经网络运算的装置,其特征在于,包括:
    存储器;
    处理器,所述处理器执行如下操作:
    步骤1,从外部指定地址空间读取用于LSTM神经网络运算的权值和偏置,并将其分割成与所述LSTM神经网络运算的神经元对应的多个部分后存储到所述存储器的不同空间中,其中每个空间中的权值和偏置的数量相同;并从外部指定地址空间读取用于LSTM神经网络运算的输入数据并将其存储在所述存储器的每一个所述不同空间中;
    步骤2,将所述存储器的每一个所述不同空间中的权值和输入数据分割成若干份,其中每份的权值或输入数据的数量与对应的向量运算单元运算的数量相同;每次将一份权值和输入数据计算得到部分和,再与之前得到的部分和进行向量加,得到新的部分和,其中所述部分和的初始值为偏置值;
    步骤3,待所述存储器的每一个所述不同空间中的所有输入数据均进行了处理之后,得到的部分和即为神经元对应的净激活量,通过非线性函数tanh或sigmoid函数变换所述神经元的净激活量,得到所述神经元的输出值;
    步骤4,通过这种方式使用不同的权值和偏置,重复上述步骤1~3,分别计算出LSTM神经网络运算中的忘记门、输入门、输出门以及待选状态单元的向量值;其中,计算部分和的过程采用的是向量运算指令,所述存储器的每一个所述不同空间中的输入数据采用并行运算方式进行计算;
    步骤5,判断所述存储器的每一个所述不同空间中的当前忘记门、输入门以及待选状态单元向量值计算是否完成,若完成,则进行新的状态单元的计算,即将旧的状态单元和忘记门向量值通过向量点乘部件得到部分和,然后将待选状态单元和输入门的值通过向量点乘部件得到部分和,将所述两个部分和通过向量求和子模块得到更新后的状态单元,同时,将更新后的状态单元通过非线性变换函数tanh进行变换;判断当前更新后的数据状态单元的非线性变换和输出门是否计算完成,如果计算完成,将输出门与带更新后的数据状态单元非线性变换后的向量通过向量点乘部件进行计算得到所述存储器的每一个所述不同空间的最终输出值;
    步骤6,将各个所述存储器的每一个所述不同空间的最终输出值通过拼接得到最终的输出值。
  8. 一种LSTM神经网络的运算方法,其特征在于,包括以下步骤:
    步骤S1,从外部指定地址空间读取用于LSTM神经网络运算的权重和偏置,并将其写入到多个并行设置的数据缓冲单元,对各数据缓存单元的状态单元进行初始化;其中,从外部指定地址空间读取的权值和偏置对应于所述LSTM神经网络运算的神经元而被分割送入到各相应数据缓冲单元,各数据缓冲单元中的权值和偏置的数量分别相同;
    步骤S2,从外部指定地址空间读取输入数据,并将其写入到所述多个数据缓冲单元,其中每一个所述数据缓冲单元中写入的输入数据均为完整的;
    步骤S3,与所述多个并行设置的数据缓冲单元一一对应的多个数据处理模块分别从对应的数据缓冲单元中读取所述权重、偏置和输入数据,并采用向量点乘部件、向量加法部件、向量求和部件和向量非线性函数转换部件对其进行LSTM神经网络运算,分别得到每一个数据处理模块的输出值;
    步骤S4,将各个数据处理模块的输出值通过拼接得到最终的输出值,即所述LSTM神经网络运算的最终结果。
  9. 如权利要求8所述的方法,其特征在于,所述步骤S3中,每一个数据处理模块分别将对应数据缓冲单元中的权值和输入数据分割成若干份,其中每份的权值或输入数据的数量与对应的单个数据处理模块中向量运算单元运算的数量相同;每一数据缓冲单元每次将一份权值和输入数据送入与其对应的数据处理模块中,计算得到部分和,再从所述数据缓冲单元中取出之前得到的部分和,对部分和进行向量加,得到新的部分和,送回到所述数据缓冲单元中,其中所述部分和的初始值为偏置值;
    待所有输入数据送入到数据处理模块一次后,得到的部分和即为神经元对应的净激活量,然后将神经元的净激活量送入到数据处理模块中,通过数据运算子模块中的非线性函数tanh或sigmoid函数变换得到神经 元的输出值,通过这种方式使用不同的权值和偏置,分别计算出LSTM神经网络中的忘记门、输入门、输出门以及待选状态单元的向量值;
    每个数据处理模块判断当前忘记门、输入门以及待选状态单元向量值计算是否完成,若完成,则进行新的状态单元的计算,即将旧的状态单元和忘记门向量值送入到数据处理单元,通过向量点乘部件得到部分和,送回到数据缓冲单元中,然后将待选状态单元和输入门的值送到数据处理单元,通过向量点乘部件得到部分和,将数据缓冲单元中的部分和送到数据处理模块中,通过向量求和子模块得到更新后的状态单元,然后,送回到数据缓存单元中,同时,将数据处理模块中的更新后的状态单元通过非线性变换函数tanh进行变换;每个数据处理模块判断当前更新后的数据状态单元的非线性变换和输出门是否计算完成,如果计算完成,将输出门与带更新后的数据状态单元非线性变换后的向量通过向量点乘部件进行计算得到最终的输出值,将输出值写回到数据缓冲单元中。
  10. 如权利要求9所述的方法,其特征在于,所述非线性函数tanh或sigmoid函数通过查表方法进行函数运算。
PCT/CN2016/113493 2016-12-30 2016-12-30 用于执行lstm神经网络运算的装置和运算方法 WO2018120016A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP16925613.8A EP3564863B1 (en) 2016-12-30 2016-12-30 Apparatus for executing lstm neural network operation, and operational method
PCT/CN2016/113493 WO2018120016A1 (zh) 2016-12-30 2016-12-30 用于执行lstm神经网络运算的装置和运算方法
US16/459,549 US10853722B2 (en) 2016-12-30 2019-07-01 Apparatus for executing LSTM neural network operation, and operational method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/113493 WO2018120016A1 (zh) 2016-12-30 2016-12-30 用于执行lstm神经网络运算的装置和运算方法

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/459,549 Continuation-In-Part US10853722B2 (en) 2016-12-30 2019-07-01 Apparatus for executing LSTM neural network operation, and operational method

Publications (1)

Publication Number Publication Date
WO2018120016A1 true WO2018120016A1 (zh) 2018-07-05

Family

ID=62706561

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/113493 WO2018120016A1 (zh) 2016-12-30 2016-12-30 用于执行lstm神经网络运算的装置和运算方法

Country Status (2)

Country Link
EP (1) EP3564863B1 (zh)
WO (1) WO2018120016A1 (zh)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109670581A (zh) * 2018-12-21 2019-04-23 北京中科寒武纪科技有限公司 一种计算装置及板卡
CN109685252A (zh) * 2018-11-30 2019-04-26 西安工程大学 基于循环神经网络和多任务学习模型的建筑能耗预测方法
CN110084367A (zh) * 2019-04-19 2019-08-02 安徽农业大学 一种基于lstm深度学习模型的土壤墒情预测方法
CN110163337A (zh) * 2018-11-12 2019-08-23 腾讯科技(深圳)有限公司 基于神经网络的数据处理方法、装置、设备及存储介质
CN110687584A (zh) * 2018-07-06 2020-01-14 中国人民解放军陆军防化学院 一种基于lstm的快速核素识别方法
CN111767999A (zh) * 2019-04-02 2020-10-13 上海寒武纪信息科技有限公司 数据处理方法、装置及相关产品
CN112488395A (zh) * 2020-12-01 2021-03-12 湖南大学 一种配电网线损预测方法及系统
CN112793797A (zh) * 2021-02-03 2021-05-14 东航技术应用研发中心有限公司 一种飞机着陆弹跳预警方法和系统
US11042797B2 (en) 2019-01-08 2021-06-22 SimpleMachines Inc. Accelerating parallel processing of data in a recurrent neural network
CN113488052A (zh) * 2021-07-22 2021-10-08 深圳鑫思威科技有限公司 无线语音传输和ai语音识别互操控方法

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104036451A (zh) * 2014-06-20 2014-09-10 深圳市腾讯计算机系统有限公司 基于多图形处理器的模型并行处理方法及装置
CN104657111A (zh) * 2013-11-20 2015-05-27 方正信息产业控股有限公司 一种并行计算方法和装置
US20150356075A1 (en) * 2014-06-06 2015-12-10 Google Inc. Generating representations of input sequences using neural networks
CN105184366A (zh) * 2015-09-15 2015-12-23 中国科学院计算技术研究所 一种时分复用的通用神经网络处理器
US20160034812A1 (en) * 2014-07-31 2016-02-04 Qualcomm Incorporated Long short-term memory using a spiking neural network
CN105512723A (zh) * 2016-01-20 2016-04-20 南京艾溪信息科技有限公司 一种用于稀疏连接的人工神经网络计算装置和方法
US20160196488A1 (en) * 2013-08-02 2016-07-07 Byungik Ahn Neural network computing device, system and method

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3547482B2 (ja) * 1994-04-15 2004-07-28 株式会社日立製作所 情報処理装置
US10140572B2 (en) * 2015-06-25 2018-11-27 Microsoft Technology Licensing, Llc Memory bandwidth management for deep learning applications

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160196488A1 (en) * 2013-08-02 2016-07-07 Byungik Ahn Neural network computing device, system and method
CN104657111A (zh) * 2013-11-20 2015-05-27 方正信息产业控股有限公司 一种并行计算方法和装置
US20150356075A1 (en) * 2014-06-06 2015-12-10 Google Inc. Generating representations of input sequences using neural networks
CN104036451A (zh) * 2014-06-20 2014-09-10 深圳市腾讯计算机系统有限公司 基于多图形处理器的模型并行处理方法及装置
US20160034812A1 (en) * 2014-07-31 2016-02-04 Qualcomm Incorporated Long short-term memory using a spiking neural network
CN105184366A (zh) * 2015-09-15 2015-12-23 中国科学院计算技术研究所 一种时分复用的通用神经网络处理器
CN105512723A (zh) * 2016-01-20 2016-04-20 南京艾溪信息科技有限公司 一种用于稀疏连接的人工神经网络计算装置和方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3564863A4 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110687584A (zh) * 2018-07-06 2020-01-14 中国人民解放军陆军防化学院 一种基于lstm的快速核素识别方法
CN110687584B (zh) * 2018-07-06 2024-01-26 中国人民解放军陆军防化学院 一种基于lstm的快速核素识别方法
CN110163337A (zh) * 2018-11-12 2019-08-23 腾讯科技(深圳)有限公司 基于神经网络的数据处理方法、装置、设备及存储介质
CN109685252B (zh) * 2018-11-30 2023-04-07 西安工程大学 基于循环神经网络和多任务学习模型的建筑能耗预测方法
CN109685252A (zh) * 2018-11-30 2019-04-26 西安工程大学 基于循环神经网络和多任务学习模型的建筑能耗预测方法
CN109670581A (zh) * 2018-12-21 2019-04-23 北京中科寒武纪科技有限公司 一种计算装置及板卡
US11042797B2 (en) 2019-01-08 2021-06-22 SimpleMachines Inc. Accelerating parallel processing of data in a recurrent neural network
CN111767999A (zh) * 2019-04-02 2020-10-13 上海寒武纪信息科技有限公司 数据处理方法、装置及相关产品
CN111767999B (zh) * 2019-04-02 2023-12-05 上海寒武纪信息科技有限公司 数据处理方法、装置及相关产品
CN110084367A (zh) * 2019-04-19 2019-08-02 安徽农业大学 一种基于lstm深度学习模型的土壤墒情预测方法
CN112488395A (zh) * 2020-12-01 2021-03-12 湖南大学 一种配电网线损预测方法及系统
CN112488395B (zh) * 2020-12-01 2024-04-05 湖南大学 一种配电网线损预测方法及系统
CN112793797A (zh) * 2021-02-03 2021-05-14 东航技术应用研发中心有限公司 一种飞机着陆弹跳预警方法和系统
CN113488052B (zh) * 2021-07-22 2022-09-02 深圳鑫思威科技有限公司 无线语音传输和ai语音识别互操控方法
CN113488052A (zh) * 2021-07-22 2021-10-08 深圳鑫思威科技有限公司 无线语音传输和ai语音识别互操控方法

Also Published As

Publication number Publication date
EP3564863A1 (en) 2019-11-06
EP3564863B1 (en) 2024-03-13
EP3564863A4 (en) 2020-08-19

Similar Documents

Publication Publication Date Title
WO2018120016A1 (zh) 用于执行lstm神经网络运算的装置和运算方法
CN111260025B (zh) 用于执行lstm神经网络运算的装置和运算方法
CN109117948B (zh) 画风转换方法及相关产品
KR102470264B1 (ko) 완전연결층 신경망 역방향 트레이닝 실행용 장치와 방법
WO2017185391A1 (zh) 一种用于执行卷积神经网络训练的装置和方法
CN109358900B (zh) 支持离散数据表示的人工神经网络正向运算装置和方法
WO2018192500A1 (zh) 处理装置和处理方法
US10853722B2 (en) Apparatus for executing LSTM neural network operation, and operational method
WO2017124644A1 (zh) 一种人工神经网络压缩编码装置和方法
US20200097799A1 (en) Heterogeneous multiplier
WO2017124646A1 (zh) 一种用于稀疏连接的人工神经网络计算装置和方法
WO2019015541A1 (zh) 一种计算方法及相关产品
WO2017185347A1 (zh) 用于执行循环神经网络和lstm运算的装置和方法
CN112559051A (zh) 使用脉动阵列和融合操作的深度学习实现方式
WO2017185396A1 (zh) 一种用于执行矩阵加/减运算的装置和方法
US20170185888A1 (en) Interconnection Scheme for Reconfigurable Neuromorphic Hardware
JP2019204492A (ja) ニューロモルフィック・アクセラレータ・マルチタスキング
EP3451238A1 (en) Apparatus and method for executing pooling operation
EP3561732A1 (en) Operation apparatus and method for artificial neural network
US20170286829A1 (en) Event-driven Learning and Reward Modulation with Spike Timing Dependent Plasticity in Neuromorphic Computers
CN107315716B (zh) 一种用于执行向量外积运算的装置和方法
WO2017185395A1 (zh) 一种用于执行向量比较运算的装置和方法
WO2017185392A1 (zh) 一种用于执行向量四则运算的装置和方法
WO2017185404A1 (zh) 一种用于执行向量逻辑运算的装置及方法
WO2018058427A1 (zh) 神经网络运算装置及方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16925613

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2016925613

Country of ref document: EP

Effective date: 20190730