CN109284825B

CN109284825B - Apparatus and method for performing LSTM operations

Info

Publication number: CN109284825B
Application number: CN201811279404.3A
Authority: CN
Inventors: 郭崎; 陈峋宇; 陈云霁; 陈天石
Original assignee: Cambricon Technologies Corp Ltd
Current assignee: Cambricon Technologies Corp Ltd
Priority date: 2016-04-29
Filing date: 2016-04-29
Publication date: 2020-04-14
Anticipated expiration: 2036-04-29
Also published as: CN109284825A; CN107341542B; CN107341542A; CN110929863B; CN110929863A

Abstract

The invention provides a device for executing a recurrent neural network and an LSTM, which comprises an instruction storage unit, a controller unit, a data access unit, an interconnection module, a main operation module and a plurality of slave operation modules. The slave operation module is used for multiplying and adding the input data to obtain a partial sum and storing the partial sum until the neuron data are all input and returning the result to the master operation module; the main operation module is used for carrying out interpolation activation on the sum returned by the operation module in the forward process, and carrying out interpolation to obtain an activation derivative and multiplying the activation derivative by the gradient in the reverse process. The invention can solve the problems of insufficient operation performance of the CPU and the GPU and high front-end decoding cost, and effectively improves the support to the forward operation of the multilayer artificial neural network.

Description

Apparatus and method for performing LSTM operations

Technical Field

The invention belongs to the technical field of artificial neural networks, and particularly relates to an LSTM, in particular to a device and a method for executing the LSTM.

Background

The recurrent neural network and the LSTM are widely applied to the fields of speech recognition, language modeling, translation, picture description and the like, and have received more and more extensive attention in academia and industry in recent years due to higher recognition accuracy and better parallelism.

One known method of supporting the recurrent neural network and LSTM is to use a general purpose processor. The method supports the above algorithm by executing general instructions using a general register file and general functional units. One of the disadvantages of this approach is that the performance of a single general purpose processor is low and cannot meet the performance requirements of normal recurrent neural networks and LSTM operations. When multiple general-purpose processors are executed in parallel, the mutual communication between the general-purpose processors becomes a performance bottleneck. In addition, the general processor needs to decode the inverse operation of the recurrent neural network and the LSTM into a long-row operation and access instruction sequence, and the front-end decoding of the processor brings large power consumption overhead

Another known method to support recurrent neural networks and LSTM is to use a Graphics Processor (GPU). The method supports the above algorithm by executing general purpose SIMD instructions using a general purpose register file and a general purpose stream processing unit. Because the GPU is a device specially used for performing graphic image operations and scientific calculations, there is no special support for operations of the multilayer artificial neural network, and a large amount of front-end decoding work is still required to perform operations of the multilayer artificial neural network, which brings a large amount of additional overhead. In addition, the GPU only has small on-chip cache, model data (weight) of the recurrent neural network and the LSTM need to be carried from the outside of the chip repeatedly, and the bandwidth of the outside of the chip becomes a main performance bottleneck. In addition, the GPU has only a small on-chip cache, and the model data (weight) of the recurrent neural network and the LSTM need to be repeatedly carried off-chip, and off-chip bandwidth becomes a main performance bottleneck, and brings huge power consumption overhead.

Disclosure of Invention

One aspect of the present invention provides an apparatus for executing a recurrent neural network and LSTM, comprising an instruction storage unit, a controller unit, a data access unit, an interconnect module, a master operation module, and a plurality of slave operation modules, wherein: the instruction storage unit is used for caching instructions; the controller unit is used for reading an instruction from the instruction storage unit and decoding the instruction into a microinstruction for controlling the behaviors of the interconnection module, the main operation module and the slave operation module; the data access unit is used for writing data into the corresponding data storage units of the main operation module and each auxiliary operation module from the memory or reading data from the data storage units to the memory; the interconnection module is used for transmitting the input gradient vector of the layer to all the slave operation modules through the interconnection module at the stage of starting calculation of reverse training of each layer of the neural network, and after the calculation process of the slave operation modules is completed, the interconnection module gradually adds the output gradient vector parts of all the slave operation modules pairwise to obtain the output gradient vector of the layer; the slave operation module is used for multiplying and adding the input data to obtain a partial sum and storing the partial sum until the neuron data are all input and returning the result to the master operation module; the main operation module is used for carrying out interpolation activation on the sum returned by the operation module in the forward process, and carrying out interpolation to obtain an activation derivative and multiplying the activation derivative by the gradient in the reverse process.

The invention also provides a method for executing the recurrent neural network and the LSTM operation by using the device.

The apparatus may be applied in the following (including but not limited to) scenarios: the system comprises various electronic products such as a data processing device, a robot, a computer, a printer, a scanner, a telephone, a tablet computer, an intelligent terminal, a mobile phone, a driving recorder, a navigator, a sensor, a camera, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage device and a wearable device; various vehicles such as airplanes, ships, vehicles, and the like; various household appliances such as televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas stoves, range hoods and the like; and various medical devices including nuclear magnetic resonance apparatuses, B-ultrasonic apparatuses, electrocardiographs and the like.

Drawings

FIG. 1 shows an example block diagram of the overall structure of an apparatus for implementing a recurrent neural network and LSTM in accordance with an embodiment of the present invention;

FIG. 2 schematically illustrates the structure of interconnected modules in an apparatus for performing a recurrent neural network and LSTM, in accordance with an embodiment of the present invention;

FIG. 3 illustrates an example block diagram of a main operational block architecture in an apparatus for performing a recurrent neural network and LSTM in accordance with an embodiment of this invention;

FIG. 4 illustrates an example block diagram of a slave operational module architecture in an apparatus for executing a recurrent neural network and LSTM in accordance with an embodiment of this invention;

FIG. 5 shows an example block diagram of a recurrent neural network and LSTM forward-reverse process in accordance with an embodiment of the present invention;

FIG. 6 illustrates a process of operation of an apparatus for performing a recurrent neural network and LSTM using the present invention;

FIG. 7 is a structure of a recurrent neural network;

FIG. 8 is the structure of a block of the LSTM algorithm;

FIG. 9 shows a flow chart of the recurrent neural network and LSTM single layer of the present invention;

FIG. 10 shows the gradient backprojection flow diagram for the single layer operation of the recurrent neural network and LSTM of the present invention.

Detailed Description

Fig. 1 is a schematic diagram showing the overall structure of an apparatus for performing a recurrent neural network and LSTM operation according to an embodiment of the present invention. As shown in fig. 1, the apparatus includes an instruction storage unit 1, a controller unit 2, a data access unit 3, an interconnection module 4, a master operation module 5, and a plurality of slave operation modules 6. The instruction storage unit 1, the controller unit 2, the data access unit 3, the interconnect module 4, the master operation module 5, and the slave operation module 6 may each be implemented by hardware circuits (including but not limited to an FPGA, a CGRA, an application specific integrated circuit ASIC, an analog circuit, and a memristor).

The instruction storage unit 1 reads in instructions through the data access unit 3 and buffers the read instructions. The instruction storing unit 1 may be implemented by various different memory devices (SRAM, DRAM, eDRAM, memristor, 3D-DRAM, nonvolatile memory, and the like).

The controller unit 2 reads instructions from the instruction storage unit 1, decodes the instructions into micro instructions that control the behavior of other units or modules, and sends the micro instructions to the units or modules, such as the data access unit 3, the master operation module 5, the slave operation module 6, and the like.

The data access unit 3 is used for accessing and storing an external address space, directly reading and writing data to each storage unit in the device, and completing the loading and storage of the data.

The interconnection module is used for distributing the input vector of the master operation module to the plurality of slave operation modules, combining the calculation results of the slave operation modules and returning the combined calculation results to the master operation module. Fig. 2 schematically shows a structural schematic of an embodiment of the structure of the interconnect module. The interconnect module 4 constitutes a data path between the master operation module 5 and a plurality of slave operation modules 6, and has an H-tree structure in this embodiment. The H tree is a binary tree path formed by a plurality of nodes, each node sends upstream data to two downstream nodes in the same way, combines the data returned by the two downstream nodes and returns the data to the upstream node.

Calculation out ∑ w in _ data with a typical recurrent neural network and LSTM

For example, neuron data in the master operation module 5 is sent to each slave operation module 6 through the interconnection module 4; after the calculation process of the slave operation module 6 is completed, the values of the neuron data output from each slave operation module are pieced together step by step in the H-tree to form a complete vector composed of neuron data as an intermediate result vector. Assuming that the device has N slave operation modules, the intermediate result vector is segmented according to N, each segment has N elements, and the ith slave operation module calculates the ith element in each segment. And the N elements are spliced into a vector with the length of N through the interconnection module and returned to the main operation module. Therefore, if the network only has N output neurons, each slave operation unit only needs to output the value of a single neuron, and if the network has m × N output neurons, each slave operation unit needs to output m neuron values.

In the invention, the main operation module carries out interpolation activation on the sum returned by the auxiliary operation module in the forward direction, and obtains an activation derivative by interpolation in the reverse direction and multiplies the activation derivative by the gradient.

In the invention, the slave operation module is used for multiplying and adding the input data to obtain a partial sum and storing the partial sum until all the neurons are input and returning the result to the master operation module.

Fig. 3 shows an exemplary block diagram of the structure of the main operation module 5 in the apparatus for performing recurrent neural networks and LSTM operations according to the present invention. As shown in fig. 3, the main operation block 5 includes an operation unit 51, a data dependency relationship determination unit 52, and a neuron storage unit 53.

The neuron storage unit 53 is used for caching input neuron data and output neuron data used in the calculation process of the main operation module 5. The arithmetic unit 51 performs various arithmetic functions of the main arithmetic block. The data dependency relationship determination unit 52 is a port of the operation unit 51 for reading and writing the neuron storage unit 53, and can ensure that there is no consistency conflict in reading and writing data in the neuron storage unit 53.

Specifically, the data dependency relationship determining unit 52 determines whether there is a dependency relationship between the micro instruction that has not yet been executed and the data of the micro instruction in the execution process, and if not, allows the micro instruction to be immediately issued, otherwise, the micro instruction is allowed to be issued only after all the micro instructions that are depended by the micro instruction are completely executed. For example, all microinstructions destined for the data dependency unit 52 are stored in an instruction queue within the data dependency unit 52, in which queue a read instruction must wait until the dependent write instruction is executed if the read data range of the read instruction conflicts with the write data range of the write instruction located earlier in the queue. Meanwhile, the data dependency relationship determination unit 52 is also responsible for reading the input gradient vector from the neuron storage unit 53 and sending it to the slave operation module 6 through the interconnection module 4, and the output data of the slave operation module 6 is directly sent to the operation unit 51 through the interconnection module 4. The instruction output by the controller unit 2 is sent to the arithmetic unit 51 and the dependency relationship judging unit 52 to control the behavior thereof.

Fig. 4 shows an example block diagram of the structure of the slave operation module 6 of the apparatus for executing the recurrent neural network and LSTM according to the present invention. As shown in fig. 4, each slave operation module 6 includes an operation unit 61, a data dependency relationship judgment unit 62, a slave neuron buffer unit 63, a weight storage unit 64, and a weight gradient storage unit 65.

The arithmetic unit 61 receives the microinstruction issued by the controller unit 2 and performs arithmetic logic operations.

The data dependency relationship determination unit 62 is responsible for reading and writing operations on the storage unit in the calculation process. The data dependency judgment unit 62 ensures that there is no consistency conflict for reading and writing to the storage unit. Specifically, the data dependency relationship determining unit 62 determines whether there is a dependency relationship between the unexecuted microinstruction and the data of the microinstruction in the executing process, and if not, allows the microinstruction to be immediately issued, otherwise, the microinstruction is allowed to be issued only after all the microinstructions depended by the microinstruction are completely executed. For example, all microinstructions destined for the data dependency unit 62 are stored in an instruction queue within the data dependency unit 62, in which queue a read instruction must wait until the dependent write instruction is executed if the read data range of the read instruction conflicts with the write data range of the write instruction located earlier in the queue.

The slave neuron element caching unit 63 caches scalar data corresponding to the slave operation block 6 among the input vector data and the partial sum of the output vector calculated by the slave operation block 6.

The weight buffer unit 64 buffers the weight data required by the slave computing module 6 in the calculation process. For each slave, only the column of the weight matrix corresponding to the scalar data stored by the slave 6 is stored.

The weight gradient buffer unit 65 buffers weight gradient data required by the corresponding slave operation module in the process of updating the weight. Each weight gradient data stored in the slave operation module 6 corresponds to the weight gradient data stored therein.

The first half part and the weight value can be updated in parallel in the process of realizing the output gradient vector of the recurrent neural network and the LSTM by the operation module 6.

Take out ∑ w in _ data as an example, where the weight matrix w and the input gradient vector in_{_}The multiplication of the data can be divided into unrelated parallel calculation subtasks, out and in _ data are column vectors, each slave operation module only calculates the product of a corresponding part of scalar elements in the in _ data and a column corresponding to the weight matrix w, and each obtained output vector is a part to be accumulated of a final resultAnd, the partial sums are added pairwise in the H-tree to obtain the final result. The calculation process becomes a parallel process of calculating partial sums and a subsequent process of accumulation. Each slave operation module 6 calculates a partial sum of the output vectors, and all the partial sums are summed in the interconnection module 4 to obtain the final output vector. Each slave operation module 6 multiplies the input vector by the output value of each layer in the forward operation, and calculates the weight value, so as to update the weight value stored in the slave operation module 6. The forward operation and the reverse training are two main processes of a neural network algorithm, the neural network needs to train (update) the weight in the network, firstly, the forward output of an input vector in the network formed by the current weight needs to be calculated, which is a forward process, and then, the weight of each layer is reversely trained (updated) layer by layer according to the difference between an output value and a labeled value of the input vector. The output vectors of each layer and the derivative values of the activation functions are saved during the forward calculation, and the data are needed by the reverse training process, so the data are guaranteed to exist at the beginning of the reverse training. The output value of each layer in the forward operation is the existing data when the reverse operation starts, and can be cached in the main operation module through the data access unit and sent to the slave operation module through the H tree. The main operation module 5 performs subsequent calculation based on the output gradient vector, for example, the output gradient vector is multiplied by the derivative of the activation function in the forward operation to obtain the input gradient value of the next layer. The derivative of the activation function in the forward operation is the existing data at the beginning of the reverse operation, and can be cached in the main operation module through the data access unit.

According to the embodiment of the invention, an instruction set for executing the forward operation of the artificial neural network on the device is also provided. The instruction set comprises a CONFIG instruction, a COMPUTE instruction, an IO instruction, a NOP instruction, a JUMP instruction and a MOVE instruction, wherein:

configuring various constants required by calculation of a current layer by the CONFIG instruction before calculation of each layer of artificial neural network is started;

the COMPUTE instruction completes the arithmetic logic calculation of each layer of artificial neural network;

the IO instruction reads input data required by calculation from an external address space and stores the data back to the external space after the calculation is finished;

the NOP instruction is responsible for emptying the microinstructions currently loaded into all internal microinstruction cache queues, and all instructions before the NOP instruction are guaranteed to be finished. NOP instructions do not contain any operations themselves;

the JUMP instruction is responsible for the JUMP of the next instruction address to be read from the instruction storage unit by the controller and is used for realizing the JUMP of a control flow;

the MOVE instruction is responsible for carrying data at one address in the internal address space of the device to another address in the internal address space of the device, and the process is independent of the arithmetic unit and does not occupy the resources of the arithmetic unit in the execution process.

FIG. 5 shows an example block diagram of a recurrent neural network and LSTM forward-reverse process in accordance with an embodiment of the present invention. In different slave operation modules 6, the input neuron vectors are respectively subjected to dot product operation with the weight vectors of the slave operation modules 6 to obtain corresponding output neuron values, all the output neuron values form an intermediate result vector, the intermediate result vector is subjected to offset vector addition and activation operation to obtain a final output neuron vector of the layer of neural network, and the formula is described as out ∑ w [ _ in _ data [ ]

The weight vector of each slave operation module 6 is a column vector corresponding to the slave operation module 6 in the weight matrix. The interconnect module sends the input neuron vectors [ in0, …, inN ] to all slave arithmetic units, temporarily stored in slave neuron cache units. For the ith slave unit, calculate the dot product of its corresponding weight vector [ w _ i0, …, w _ iN ] and the input neuron vector. The results output by the slave operation units are spliced into a complete output vector through the interconnection module and returned to the master operation unit, and activation operation is carried out in the master operation unit to obtain the final output neuron vector [ out0, out1, out2, …, outN ].

FIG. 6 illustrates a process for implementing a recurrent neural network and LSTM operations using the apparatus and instruction set of the present invention.

In step S1, an IO instruction is stored in advance at the head address of the instruction storage unit 1.

In step S2, the operation starts, the controller unit 2 reads the IO instruction from the first address of the instruction storage unit 1, and according to the translated microinstruction, the data access unit 3 reads all corresponding artificial neural network operation instructions from the external address space and buffers them in the instruction storage unit 1.

At step S3, the controller unit 2 then reads in the next IO instruction from the instruction storage unit, and according to the translated microinstruction, the data access unit 3 reads all the data (e.g., including input neuron vectors, interpolation tables, constant tables, offsets, etc.) required by the main operation block 5 from the external address space to the neuron storage unit 53 of the main operation block 5.

In step S4, the controller unit 2 then reads in the next IO instruction from the instruction storage unit, and the data access unit 3 reads the weight matrix data required from the operation module 6 from the external address space according to the translated microinstruction.

At step S5, the controller unit 2 then reads in the next CONFIG instruction from the instruction storage unit, and based on the translated microinstructions, the device configures the various constants needed for the layer neural network computation. For example, the

arithmetic units

51, 61 configure the values of the unit internal registers according to parameters in the microinstructions, such as the precision setting of the calculation at this level, and the data of the activation function (e.g., the precision bit of the calculation at this level).

At step S6, the controller unit 2 reads in the next complete instruction from the instruction storage unit, and according to the translated microinstruction, the master operation module 5 first sends the input neuron vectors to the slave operation modules 6 through the interconnection module 4, and stores the input neuron vectors in the slave neuron cache unit 63 of the slave operation module 6.

In step S7, according to the microinstruction decoded by the component instruction, the operation unit 61 of the slave operation module 6 reads the weight vector (corresponding to the column vector of the slave operation module 6 in the weight matrix) from the weight storage unit 64, reads the input neuron vector from the neuron cache unit, completes the dot product operation of the weight vector and the input neuron vector, and returns the intermediate result through the interconnection module.

In step S8, in the interconnection block 4, the intermediate results returned from the operation block 6 are each pieced together step by step into a complete intermediate result vector.

In step S9, the main operation module 5 obtains the return value of the interconnection module 4, reads the offset vector from the neuron storage unit 53 according to the microinstruction decoded by the component instruction, adds the offset vector to the vector returned by the interconnection module 4, activates the addition result, and writes the final output neuron vector back to the neuron storage unit 53.

In step S10, the controller unit then reads in the next IO instruction from the instruction storage unit, and based on the translated microinstruction, the data access unit 3 stores the output neuron vector in the neuron storage unit 53 to the external address space specified address, and the operation ends.

FIG. 7 is a structure of a recurrent neural network, in order to solve the dependency of the conventional neural network on previous input in time, in the case of a forward operation, the input of the recurrent neural network is from the input at the current time and the hidden layer output at the previous time in the formula, I is the number of inputs, H is the number of hidden layers, and K is the number of outputs, wherein α_hIs the intermediate value of the h output at time t, b_hIs the h output at time t after activation, δ_hRepresenting residual pairs α_hThe partial derivative of (a), θ, represents the activation function.

The formula for forward propagation is expressed as:

b_h＝θ(α_h)

the formula for back propagation expresses:

wherein. The function of the comprehensive time sequence is achieved through the connection of the hidden layer at the previous moment and the output of the layer. However, such a recurrent neural network has a time decay problem.

FIG. 8 is the structure of one block of the LSTM algorithm. Compared with the traditional recurrent neural network, the LSTM introduces a cell to record the information of the current time point. It can be seen that in the LSTM algorithm, a block consists of three gates and a cell, an input gate, an output gate, and a forgetting gate. The main idea of the LSTM algorithm is to use a cell to record the state of the current time, and to transmit the cell value at the last time to achieve the function of directly transmitting information at different times. The weights for the current time input and the last time cell in the output of the cell are controlled by the input gate and the forgetting gate. The output of the cell is controlled by an output gate. Under the control of the input gate and the forgetting gate, proper information can be stored for a long time and is recorded in the cell all the time, so that the problem that the recurrent neural network decays along with the time is solved.

FIG. 9 shows a flow chart of the recurrent neural network and LSTM single layer of the present invention.

In step a1, the product of the current time input corresponding to the input gate and the weight is calculated and buffered in the neuron buffer, and then the product of the cell state at the previous time and the corresponding weight and the product of the hidden layer at the previous time and the corresponding weight are calculated and both buffered in the neuron buffer. Finally, three of them are added and activated to obtain the input gate value.

In step a2, the product of the current time input corresponding to the forgotten gate and the weight is calculated and buffered in the neuron buffer, and then the product of the cell state at the previous time and the corresponding weight and the product of the hidden layer at the previous time and the corresponding weight are calculated and both buffered in the neuron buffer. Finally, add up their three and activate to get the forgotten gate value.

In step a3, the product of the input and the weight at the current time corresponding to the input gate is calculated and buffered in the neuron buffer, and then the product of the hidden layer and the corresponding weight at the previous time is calculated and both the hidden layer and the buffer are stored in the neuron buffer. Finally, the two are added and activated to obtain a unit state intermediate value which is buffered in a neuron buffer area. Then, the intermediate value is multiplied by the input gate, and buffered in the buffer area of the operation unit (51 in fig. 7), and the unit state at the previous time is multiplied by the forgotten gate, and correspondingly added with the previous buffer in the operation unit, so as to obtain the unit state value.

In step a4, the product of the current time input and the weight corresponding to the output gate is calculated and buffered in the neuron buffer, and then the product of the current time cell state and the corresponding weight and the product of the previous time hidden layer and the corresponding weight are both calculated and buffered in the neuron buffer. Finally, three of them are added and activated to obtain the output gate value.

In step a5, the cell states are multiplied by the output gates to obtain the output of the current layer.

In step B1, the sum of the product of the hidden layer gradient and the weight at the corresponding position at the next time and the product of the residual error of the current layer and the corresponding weight is calculated to obtain the output gradient of the current layer.

In step B2, the output gradient and the cell activation value are multiplied and added, and finally multiplied by the activation function derivative in the neuron buffer area to obtain the output gate gradient.

In step B3, the state gradient of the cell is stored in the neuron by multiplying the current output gradient by the current output gate value and the derivative of the state activation, then the gradients of the cell at the next moment are calculated, the gradients of the input gate and the forgotten gate and the gradients of the output gate at this moment by the corresponding weights are stored in the neuron buffer, and finally the cell state gradients are obtained by adding. The gradient of the cell intermediate value is obtained by multiplying the input gate activation value at the current moment, the derivative of the cell activation function and the cell state gradient.

In step B4, the state gradients of all cells at the current time are correspondingly multiplied by the cell state outputs at the previous time, and finally multiplied by the forgotten gate derivative to obtain the forgotten gate gradient.

In step B5, the state gradients of all cells at the current time are multiplied by the activation value corresponding to the intermediate value of the cell at the current time, and finally multiplied by the derivative of the input gate to obtain the gradient of the input gate.

It should be noted that the application of the traditional recurrent neural network algorithm to the device is a greatly simplified LSTM algorithm, the calculation output only depends on the input at the current moment and the output at the previous moment, the forward and reverse expression is similar to the operation subprocess of the LSTM, and details are not repeated here.

For a complete cyclic neural network and an LSTM algorithm, the implementation process is similar to the typical calculation, corresponding weight values and data are taken out according to a formula for weighted summation, and the output neuron address of the upper layer stored in the main operation unit is taken as the input neuron address of the layer by the operation instruction of the next layer in a long time. Similarly, the weight address and the offset address in the instruction are also changed to the corresponding address of the current layer.

By adopting the device and the instruction set for executing the recurrent neural network and the LSTM, the problems of insufficient operation performance of the CPU and the GPU and high front-end decoding overhead are solved. The support for the forward operation of the multilayer artificial neural network is effectively improved.

By adopting the special on-chip cache aiming at the recurrent neural network and the LSTM, the reusability of input neurons and weight data is fully excavated, the data is prevented from being read to the memory repeatedly, the memory access bandwidth is reduced, and the problem that the memory bandwidth becomes the bottleneck of the forward operational performance of the multilayer artificial neural network is avoided.

Claims

1. An apparatus for performing an LSTM operation comprising a controller unit, a data access unit, an interconnect module, a master operation module, and a slave operation module, wherein a block of the LSTM operation comprises: an input gate, an output gate, a forgetting gate and a cell; the input gate and the forgetting gate are used for controlling the weight of the current time input and the last time cell in the output of the cell; the output gate is used for controlling the output of the cell;

the data access unit is used for accessing an external address space, completing the loading and storage of data and reading in instructions;

the controller unit is used for reading the instruction, decoding the instruction into a micro instruction for controlling the behavior of other units or modules, and then distributing the respective micro instruction to each unit or module;

the interconnection module is used for sending the input vector of the master operation module to the slave operation module and returning the operation result of the slave operation module to the master operation module;

the slave operation module is used for multiplying and adding the input data to obtain a partial sum and storing the partial sum until all neuron data are input, and then returning all the partial sums to the master operation module through the interconnection module;

the main operation module is used for carrying out interpolation activation on a calculation result during forward operation;

the apparatus is to perform LSTM single layer computations, the LSTM single layer computations including:

the device is used for calculating the product sum of the current time input corresponding to the input gate and the weight, caching the product sum in a neuron cache unit, calculating the product sum of the state of the unit at the previous time and the corresponding weight and the product sum of the hidden layer at the previous time and the corresponding weight, wherein the neuron cache unit exists, and finally adding the three products and activating the three products to obtain an input gate value;

the device is used for calculating the product sum of the current time input corresponding to the forgotten gate and the weight, caching the product sum in a neuron caching unit, and calculating the product sum of the state of the unit at the previous time and the corresponding weight and the product sum of the hidden layer at the previous time and the corresponding weight, wherein the neuron caching unit exists; finally, three of the three are added and activated to obtain a forgotten gate value;

the device is used for calculating the product sum of the current time input corresponding to the input gate and the weight, caching the product sum in the neuron cache unit, calculating the product sum of the previous time hidden layer and the corresponding weight, storing the product sum in the neuron cache unit, and finally adding the two product sums and activating the product sums to obtain a unit state intermediate value which is cached in the neuron cache unit; then, the intermediate value and the input gate value are correspondingly multiplied, the intermediate value and the input gate value are cached in a cache unit of the operation unit, the unit state at the previous moment and the forgotten gate value are correspondingly multiplied, and the unit state value is correspondingly added with the previous cache in the operation unit to obtain a unit state value;

the device is used for calculating the product sum of the current time input corresponding to the output gate and the weight, caching the product sum in a neuron cache unit, calculating the product sum of the current time unit state and the corresponding weight and the product sum of the previous time hidden layer and the corresponding weight, wherein the cache units exist, and finally adding and activating the three products to obtain an output gate value; and correspondingly multiplying the unit state and the output gate value to obtain the LSTM single-layer output.

2. The apparatus of claim 1, further comprising: and the instruction storage unit is used for caching the read-in instructions.

3. The apparatus of claim 1, wherein the apparatus comprises a plurality of slave computing modules,

the interconnection module distributes the input vector of the main operation module to the slave operation module, and splices the operation results of the slave operation module step by step into operation results which are returned to the main operation module.

4. The apparatus of claim 3, wherein the interconnection module comprises a binary tree path formed by a plurality of nodes;

each node sends the upstream data to the two downstream nodes in the same way, and combines the data returned by the two downstream nodes and returns the data to the upstream node.

5. The apparatus of claim 4,

and the main operation module is also used for carrying out interpolation activation on the calculation result and multiplying the derivative obtained by derivation by the gradient during reverse operation.

6. The apparatus of claim 5, wherein the master operation module comprises: an arithmetic unit, a data dependency relationship judging unit and a neuron cache unit, wherein,

the arithmetic unit is used for receiving the microinstruction sent by the controller unit and carrying out arithmetic logic operation;

the data dependency relationship judging unit is used for performing read-write operation on the neuron cache unit and ensuring that read-write consistency conflict does not exist in data used between the instructions;

the neuron buffer unit is used for buffering input neuron data and output neuron data.

7. The apparatus of claim 4, wherein the slave operation module comprises an operation unit, a data dependency judgment unit, a slave neuron caching unit, a weight storage unit, and a weight gradient storage unit,

the data dependency relationship judging unit is used for performing read-write operation on the slave neuron cache unit and ensuring that read-write consistency conflict does not exist in data used between instructions;

the slave neuron buffer unit is used for buffering input neuron data and output neuron data;

the weight storage unit is used for caching weight data required by the slave operation module in the calculation process;

the weight gradient storage unit is used for caching weight gradient data required by the corresponding slave operation module in the process of updating the weight;

the arithmetic unit is used for receiving the microinstruction sent by the controller unit and carrying out arithmetic logic operation on the input neuron data and the weight data.

8. The apparatus according to claim 6 or 7,

the data dependency relationship determining unit is specifically configured to determine whether a dependency relationship exists between first data of an unexecuted control signal and second data of a control signal in an executing process, allow the unexecuted control signal to be executed immediately if the dependency relationship does not exist, and allow a dependency relationship if the dependency relationship exists; after all the control signals having a dependency relationship with the unexecuted control signal are completely executed, the unexecuted control signal is allowed to be executed.

9. A method of performing LSTM operations, the method being applied to an LSTM operation apparatus comprising a controller unit, an interconnect module, a data access unit, a master operation module and a slave operation module, a block of the LSTM operations comprising: an input gate, an output gate, a forgetting gate and a cell; the input gate and the forgetting gate are used for controlling the weight of the current time input and the last time cell in the output of the cell; the output gate is used for controlling the output of the cell; the method comprises the following steps:

the data access unit accesses an external address space, completes the loading and storage of data and reads in instructions; the controller unit reads the instruction, decodes the instruction into a micro instruction for controlling the behavior of other units or modules, and then distributes the respective micro instruction to each unit or module;

the interconnection module sends the input vector of the master operation module to the slave operation module and returns the operation result of the slave operation module to the master operation module; the slave operation module multiplies and adds the input data to obtain a partial sum, and stores the partial sum until the neuron data are all input, and then all the partial sums are returned to the master operation module through the interconnection module; the main operation module carries out interpolation activation on a calculation result during forward operation;

the method further comprises the following steps: for performing LSTM single layer computations, the LSTM single layer computations including:

the device calculates the product sum of the current time input corresponding to the input gate and the weight, caches the product sum in the neuron cache unit, calculates the product sum of the state of the unit at the previous time and the corresponding weight and the product sum of the hidden layer at the previous time and the corresponding weight, and finally adds the three products and activates the three products to obtain the input gate value;

the device calculates the product sum of the current time input corresponding to the forgetting gate and the weight, caches the product sum in a neuron cache unit, and calculates the product sum of the state of a unit at the previous time and the corresponding weight and the product sum of the hidden layer at the previous time and the corresponding weight to form the neuron cache unit; finally, three of the three are added and activated to obtain a forgotten gate value;

the device calculates the product sum of the current time input corresponding to the input gate and the weight, caches the product sum in the neuron cache unit, calculates the product sum of the last time hidden layer and the corresponding weight, and stores the product sum in the neuron cache unit, and finally adds the two product sums and activates the product sum to obtain a unit state intermediate value which is cached in the neuron cache unit; then, the intermediate value and the input gate value are correspondingly multiplied, the intermediate value and the input gate value are cached in a cache unit of the operation unit, the unit state at the previous moment and the forgotten gate value are correspondingly multiplied, and the unit state value is correspondingly added with the previous cache in the operation unit to obtain a unit state value;

the device calculates the product sum of the current time input and the weight corresponding to the output gate, caches the product sum in a neuron cache unit, calculates the product sum of the current time unit state and the corresponding weight and the product sum of the previous time hidden layer and the corresponding weight, and adds and activates the three to obtain an output gate value; and correspondingly multiplying the unit state and the output gate value to obtain the LSTM single-layer output.

10. The method of claim 9, wherein the apparatus further comprises: and the instruction storage unit is used for caching the read-in instructions.

11. The method of claim 9, wherein the apparatus comprises a plurality of slave operation modules, and the interconnection module sends the input vector of the master operation module to the slave operation module, and returns the operation result of the slave operation module to the master operation module specifically comprises:

12. The method of claim 11, wherein the interconnect module comprises a binary tree path formed by a plurality of nodes; the interconnection module distributes the input vector of the master operation module to the slave operation module, and splices the operation results of the slave operation module step by step into operation results which are returned to the master operation module, and the interconnection module specifically comprises:

13. The method of claim 12, further comprising:

and when the main operation module performs reverse operation, the calculation result is subjected to interpolation activation and derivative obtained by derivation is multiplied by the gradient.

14. The method of claim 12, wherein the master operation module comprises: the method specifically comprises the following steps of:

the arithmetic unit receives the microinstruction sent by the controller unit and carries out arithmetic logic operation;

the data dependency relationship judging unit performs read-write operation on the neuron cache unit, and ensures that read-write consistency conflict does not exist in data used between instructions; the neuron buffer unit buffers input neuron data and output neuron data.

15. The method of claim 12, wherein the slave operation module comprises: the method specifically comprises the following steps of:

the data dependency relationship judging unit performs read-write operation on the slave neuron cache unit, and ensures that read-write consistency conflict does not exist in data used between instructions;

the slave neuron buffer unit buffers input neuron data and output neuron data;

the weight storage unit caches weight data required by the slave operation module in the calculation process;

the weight gradient storage unit caches weight gradient data required by the corresponding operation module in the process of updating the weight;

the arithmetic unit receives the microinstruction sent by the controller unit and carries out arithmetic logic operation on the input neuron data and the weight data.

16. The method according to claim 14 or 15, characterized in that the method further comprises:

the data dependency relationship judging unit judges whether a dependency relationship exists between first data of an unexecuted control signal and second data of the control signal in the executing process, if the dependency relationship does not exist, the unexecuted control signal is allowed to be executed immediately, and if the dependency relationship exists; after all the control signals having a dependency relationship with the unexecuted control signal are completely executed, the unexecuted control signal is allowed to be executed.