CN109284825A

CN109284825A - Device and method for executing LSTM operation

Info

Publication number: CN109284825A
Application number: CN201811279404.3A
Authority: CN
Inventors: 郭崎; 陈峋宇; 陈云霁; 陈天石
Original assignee: Beijing Zhongke Cambrian Technology Co Ltd
Current assignee: Cambricon Technologies Corp Ltd; Beijing Zhongke Cambrian Technology Co Ltd
Priority date: 2016-04-29
Filing date: 2016-04-29
Publication date: 2019-01-29
Anticipated expiration: 2036-04-29
Also published as: CN110929863A; CN109284825B; CN107341542B; CN110929863B; CN107341542A

Abstract

The present invention proposes a kind of for executing the device of Recognition with Recurrent Neural Network and LSTM, including the location of instruction, controller unit, data access unit, interconnecting modules, main computing module and multiple from computing module.It is described from computing module be used for by input data carry out it is multiply-add obtain part and and save, until neuron number according to all input return result to main computing module；Main computing module is in positive process, to return from computing module and progress interpolation activation, in reverse procedure, interpolation to obtain activation derivative and is multiplied with gradient.The present invention is able to solve CPU and GPU operational performance deficiency, and decoding overheads big problem in front end effectively increases the support to multi-layer artificial neural network forward operation.

Description

Device and method for executing LSTM operation

Technical field

The present inventor's artificial neural networks technical field, and in particular to LSTM, it is especially a kind of for executing the device of LSTM And method.

Background technique

Recognition with Recurrent Neural Network and LSTM are widely used in speech recognition, and Language Modeling is translated, the fields such as picture description, In recent years due to its higher recognition accuracy and preferably can concurrency, the pass more and more extensive by academia and industry Note.

It is a kind of to support that the known method of Recognition with Recurrent Neural Network and LSTM are using general processor.This method is by using logical Universal command is executed with register file and general utility functions component to support above-mentioned algorithm.The disadvantages of this method first is that single general The operational performance of processor is lower, is unable to satisfy the performance requirement of common Recognition with Recurrent Neural Network and LSTM operation.And it is multiple logical When being executed parallel with processor, the intercommunication of general processor becomes performance bottleneck again.In addition, general processor needs The reversed operation of Recognition with Recurrent Neural Network and LSTM are decoded into a queue of operation and access instruction sequence, processor front end decodes band Biggish power dissipation overhead is carried out

Another kind supports that the known method of Recognition with Recurrent Neural Network and LSTM are using graphics processor (GPU).This method is logical It crosses and executes general SIMD instruction using general-purpose register and general stream processing unit to support above-mentioned algorithm.Since GPU is special The equipment that door is used to execute graph image operation and scientific algorithm, not to the special branch of multi-layer artificial neural network operation It holds, it is still desirable to which a large amount of front end work decoding could execute multi-layer artificial neural network operation, bring and largely additionally open Pin.In addition GPU only has lesser on piece to cache, and the model data (weight) of Recognition with Recurrent Neural Network and LSTM need repeatedly outside piece It carries, the outer bandwidth of piece becomes main performance bottleneck.In addition, GPU only has lesser on piece to cache, Recognition with Recurrent Neural Network and LSTM Model data (weight) needs carried outside piece repeatedly, the outer bandwidth of piece becomes main performance bottleneck, while bringing huge Power dissipation overhead.

Summary of the invention

One aspect of the present invention provides a kind of for executing the device of Recognition with Recurrent Neural Network and LSTM, including instruction is deposited Storage unit, controller unit, data access unit, interconnecting modules, main computing module and multiple from computing module, in which: refer to Enable storage unit for cache instruction；Controller unit be used for from the location of instruction read instruct, and by the Instruction decoding at Control interconnecting modules, main computing module and the microcommand from computing module behavior；Data access unit is used for from memory to master Computing module and respectively write from the respective data storage unit of computing module data or from the data storage cell to memory read Data；Interconnecting modules are used for, and start the stage calculated in every layer of neural network reverse train, and main computing module passes through interconnecting modules To all input gradient vectors for transmitting this layer from computing module, after the completion of from the calculating process of computing module, mutual gang mould Block step by step by respectively from the output gradient vector part of computing module be added to obtain the output gradient vector of this layer two-by-two；From operation Module be used for by input data carry out it is multiply-add obtain part and and save, until neuron number according to all input return result to Main computing module；Main computing module is used in positive process to return from computing module and progress interpolation activation, reversed Interpolation obtains activation derivative and is multiplied with gradient when process.

The present invention also provides a kind of methods for executing Recognition with Recurrent Neural Network and LSTM operation using above-mentioned apparatus.

The device can be applied in following (including but not limited to) scene: data processing, robot, computer, printer, Scanner, phone, tablet computer, intelligent terminal, mobile phone, automobile data recorder, navigator, sensor, camera, cloud service Each electronic products such as device, camera, video camera, projector, wrist-watch, earphone, mobile storage, wearable device；Aircraft, steamer, All kinds of vehicles such as vehicle；TV, air-conditioning, micro-wave oven, refrigerator, electric cooker, humidifier, washing machine, electric light, gas-cooker, oil All kinds of household electrical appliance such as smoke machine；And including all kinds of Medical Devices such as Nuclear Magnetic Resonance, B ultrasound, electrocardiograph.

Detailed description of the invention

Fig. 1 shows according to an embodiment of the present invention for executing the overall structure of Recognition with Recurrent Neural Network and the device of LSTM Example block diagram；

Fig. 2 is diagrammatically illustrated in the device according to an embodiment of the present invention for executing Recognition with Recurrent Neural Network and LSTM mutually The structure of gang mould block；

Fig. 3 shows main operation mould in the device according to an embodiment of the present invention for executing Recognition with Recurrent Neural Network and LSTM The example block diagram of block structure；

Fig. 4 is shown in the device according to an embodiment of the present invention for executing Recognition with Recurrent Neural Network and LSTM from operation mould The example block diagram of block structure；

Fig. 5 shows the example block diagram of the forward and reverse process of Recognition with Recurrent Neural Network and LSTM according to an embodiment of the present invention；

Fig. 6 shows the process using the operation for executing Recognition with Recurrent Neural Network and the device of LSTM of the invention；

Fig. 7 is the structure of Recognition with Recurrent Neural Network；

Fig. 8 is the structure of a block of LSTM algorithm；

Fig. 9 shows the flow chart of Recognition with Recurrent Neural Network and LSTM single layer of the invention；

Figure 10 shows the gradient back transfer flow chart of the single layer operation of Recognition with Recurrent Neural Network and LSTM of the invention.

Specific embodiment

Fig. 1 shows the overall structure of the device for executing Recognition with Recurrent Neural Network and LSTM operation of the embodiment of the present invention Schematic diagram.As shown in Figure 1, the device include the location of instruction 1, controller unit 2, data access unit 3, interconnecting modules 4, Main computing module 5 and multiple from computing module 6.The location of instruction 1, controller unit 2, data access unit 3, interconnecting modules 4, main computing module 5 and hardware circuit (including but not limited to FPGA, CGRA, dedicated integrated electricity can be passed through from computing module 6 Road ASIC, analog circuit and memristor) it realizes.

The location of instruction 1 reads in the instruction for instructing and caching reading by data access unit 3.The location of instruction 1 It can be realized by various different memory parts (SRAM, DRAM, eDRAM, memristor, 3D-DRAM and non-volatile memory etc.).

Controller unit 2 reads instruction from the location of instruction 1, by Instruction decoding at other units of control or module row For microcommand, and be sent to the unit or module, such as data access unit 3, main computing module 5 and from computing module 6 Deng.

Data access unit 3 is used for memory access external address space, directly reads and writes number to each storage unit inside device According to completing the load and storage of data.

It is the multiple from computing module for the input vector of the main computing module to be distributed to state interconnecting modules, and Will main computing module respectively be returned to after the merging of the calculated result of computing module.Fig. 2 diagrammatically illustrates the structure of interconnecting modules A kind of embodiment structural schematic diagram.Interconnecting modules 4 constitute main computing module 5 and multiple numbers between computing module 6 According to access, in this embodiment with the structure of H tree-shaped.H tree is the binary tree access being made of multiple nodes, and each node will The data of upstream similarly issue two nodes in downstream, and the data that two nodes in downstream return are merged, and returns To the node of upstream.

By taking the calculating out=∑ w*in_data of typical Recognition with Recurrent Neural Network and LSTM as an example, in main computing module 5 Neuron number is each from computing module 6 according to being sent to by interconnecting modules 4；After the completion of the calculating process from computing module 6, often The value of a neuron number evidence exported from computing module can be combined into step by step one completely by neuron number according to forming in H tree Vector, as intermediate result vector.Assuming that sharing N number of from computing module in device, then intermediate result vector is segmented by N, every section There is N number of element, i-th of element in every section is calculated from computing module i-th.It is N that N number of element, which is combined into length by interconnecting modules, Vector and return to main computing module.So each only being needed from arithmetic element defeated if network only has N number of output neuron The value of single neuron out each need to export m neuron value from arithmetic element if network has m*N output neuron.

In the present invention, main computing module is reversely being inserted in forward direction to return from computing module and progress interpolation activation Value obtains activation derivative and is multiplied with gradient.

In the present invention, from computing module be used for by input data carry out it is multiply-add obtain part and and save, until nerve All input returns result to main computing module to member.

Fig. 3 is shown according to the present invention for executing main computing module 5 in Recognition with Recurrent Neural Network and the device of LSTM operation Structure example block diagram.As shown in figure 3, main computing module 5 includes arithmetic element 51,52 and of data dependence relation judging unit Neuron storage unit 53.

Neuron storage unit 53 be used to cache input neuron number evidence that main computing module 5 is used in calculating process and Output nerve metadata.Arithmetic element 51 completes the various calculation functions of main computing module.Data dependence relation judging unit 52 It is the port that arithmetic element 51 reads and writes neuron storage unit 53, while can guarantees to data in neuron storage unit 53 Consistency conflict is not present in read-write.

Specifically, micro- during data dependence relation judging unit 52 judges the microcommand having not carried out and is carrying out It otherwise needs to wait until if it does not, this microcommand is allowed to emit immediately with the presence or absence of dependence between the data of instruction This microcommand just allows to be launched after the completion of all microcommands that this microcommand is relied on all execute.For example, all hairs Microcommand toward data dependence relation unit 52 can be all stored into the instruction queue inside data dependence relation unit 52, at this In queue, if the range that the range of the reading data of the reading instruction write command forward with queue position writes data clashes, Then the instruction must can execute after relied on write command is performed.Meanwhile data dependence relation judging unit 52 Also it is responsible for reading input gradient vector from neuron storage unit 53 and is sent to from computing module 6 by interconnecting modules 4, and from fortune The output data for calculating module 6 is transmitted directly to arithmetic element 51 by interconnecting modules 4.The instruction that controller unit 2 exports is sent To arithmetic element 51 and dependence judging unit 52, to control its behavior.

Fig. 4 shows according to the present invention for executing the slave computing module 6 of Recognition with Recurrent Neural Network and the device of LSTM The example block diagram of structure.As shown in figure 4, including each arithmetic element 61, data dependence relation judging unit from computing module 6 62, neuron storage unit 63, weight storage unit 64 and weight gradient storage unit 65.

Arithmetic element 61 receives the microcommand that controller unit 2 issues and carries out arithmetic logic operation.

Data dependence relation judging unit 62 is responsible for the read-write operation in calculating process to storage unit.Data dependence relation Judging unit 62 guarantees that consistency conflict is not present in the read-write to storage unit.Specifically, data dependence relation judging unit 62 With the presence or absence of dependence between the data of microcommand during judging the microcommand having not carried out and being carrying out, if not In the presence of allowing this microcommand to emit immediately, all microcommands for otherwise needing to be relied on until this microcommand all execute This microcommand just allows to be launched after the completion.For example, all microcommands for being sent to data dependence relation unit 62 can all be deposited Enter in the instruction queue inside data dependence relation unit 62, in the queue, if the range of the reading data of reading instruction with The range that the forward write command of queue position writes data clashes, then the instruction must be performed until relied on write command After can execute.

Neuron storage unit 63 cache in input vector data with this from the corresponding scalar data of computing module 6 and The output vector part being calculated from computing module 6 and.

Weight cache unit 64 caches the weight data needed in calculating process from computing module 6.For each From computing module, column corresponding with the scalar data that should be stored from computing module 6 in weight matrix all can be only stored.

Weight gradient cache unit 65 caches the weight gradient number accordingly needed during updating weight from computing module According to.The weight gradient data that each is stored from computing module 6 is corresponding with the weight gradient data that it is stored.

First half that can be parallel during realizing the output gradient vector of Recognition with Recurrent Neural Network and LSTM from computing module 6 The update of part and weight.

By taking out=∑ w*in_data as an example, wherein the multiplication of weight matrix w and input gradient vector in_data can be drawn It is divided into incoherent parallel computation subtask, out and in_data are column vectors, are each only calculated in in_data from computing module The product of corresponding part scaling element column corresponding with weight matrix w, obtained each output vector is final result One part to be added up and, these parts and the result for being added two-by-two to the end step by step in H tree.So calculating process becomes At the process and subsequent cumulative process of parallel calculating section sum.Each output vector is calculated from computing module 6 Part and, all parts and in interconnecting modules 4 complete summation operation obtain output vector to the end.Each from computing module 6 are simultaneously multiplied input vector with every layer when forward operation of output valve, calculate weight, are deposited with updating this from computing module 6 The weight of storage.Forward operation and reverse train are two main process of neural network algorithm, and neural network will train (update) Weight in network, it is necessary first to calculate positive output of the input vector in the network that current weight is constituted, this is positive mistake Journey, then according to the difference between the mark value of output valve and input vector itself, the power of reversed successively every layer of training (update) Value.Each layer of output vector and the derivative value of activation primitive can be saved in positive calculating process, these data are reversed Required for training process, so these data have guaranteed exist when reverse train starts.Every layer defeated in forward operation Value is existing data when reversed operation starts out, in main computing module and can pass through H by data memory access unit caches Tree is sent to from computing module.Main computing module 5 is based on output gradient vector and carries out subsequent calculating, such as will export gradient vector The derivative of activation primitive when multiplied by forward operation obtains next layer of input gradient value.Activation primitive when forward operation is led Number is the existing data when reversed operation starts, can be by data memory access unit caches in main computing module.

According to embodiments of the present invention, the instruction that artificial neural network forward operation is executed in aforementioned device is additionally provided Collection.It include CONFIG instruction, COMPUTE instruction, I/O instruction, NOP instruction, JUMP instruction and MOVE instruction in instruction set, in which:

CONFIG instruction configures current layer before every layer of artificial neural networks start and calculates the various constants needed；

The arithmetical logic that every layer of artificial neural network is completed in COMPUTE instruction calculates；

I/O instruction, which is realized to read in from external address space, calculates the input data needed and after the completion of calculating by data It is stored back to exterior space；

NOP instruction is responsible for emptying the microcommand being currently filled in internal all microcommand buffer queues, guarantee NOP instruction it Preceding all instructions all instructions finishes.NOP instruction itself does not include any operation；

JUMP instructs jumping for next IA for being responsible for that controller will be read from the location of instruction, is used to real What now control was flowed jumps；

MOVE instruction is responsible for the data of a certain address of device internal address space being carried to device internal address space Another address, the process are not take up the resource of arithmetic element independently of arithmetic element in the process of implementation.

Fig. 5 shows the example block diagram of the forward and reverse process of Recognition with Recurrent Neural Network and LSTM according to an embodiment of the present invention.? For difference from computing module 6, input neuron vector carries out dot-product operation with the weight vector from computing module 6 respectively, obtains To corresponding output neuron value, all these output neuron values form intermediate result vector, which passes through Bias vector and activation operation is added to obtain the final output neuron vector of this layer of neural network, formula is described as out=∑ W*in_data, each from the weight vector of computing module 6 be in weight matrix with should be from the corresponding column vector of computing module 6. Interconnecting modules will input neuron vector [in0 ..., inN] and be sent to all from arithmetic element, and it is single to be temporarily stored in neuron storage In member.For i-th from arithmetic element, calculates its corresponding weight vector [w_i0 ..., w_iN] and input neuron vector Dot product.The result exported from arithmetic element is combined into complete output vector by interconnecting modules and returns to main arithmetic element, Activation operation is carried out in main arithmetic element, obtains output neuron vector [out0, out1, out2 ..., outN] to the end.

Fig. 6 shows the process that Recognition with Recurrent Neural Network and LSTM operation are realized using the device of the invention and instruction set.

In step S1, an I/O instruction is pre-deposited at the first address of the location of instruction 1.

In step S2, operation starts, and controller unit 2 reads this I/O instruction, root from the first address of the location of instruction 1 According to the microcommand translated, data access unit 3 reads corresponding all artificial neural network operational orders from external address space, And it is buffered in the location of instruction 1.

In step S3, controller unit 2 then reads in next I/O instruction from the location of instruction, according to the micro- finger translated Enable, data access unit 3 read from external address space all data that main computing module 5 needs (e.g., including input nerve First vector, interpolation table, constant table and biasing etc.) to main computing module 5 neuron storage unit 53.

In step S4, controller unit 2 then reads in next I/O instruction from the location of instruction, according to the micro- finger translated It enables, data access unit 3 reads the weight matrix data needed from computing module 6 from external address space.

In step S5, controller unit 2 then reads in next CONFIG instruction from the location of instruction, according to what is translated Microcommand, device configure the various constants of this layer of neural computing needs.For example, arithmetic element 51,61 is according in microcommand Parameter configuration unit internal register value, the parameter for example including this layer calculate precision setting, activation primitive number According to (such as precision position of this layer of calculating).

In step S6, controller unit 2 then reads in next COMPUTE instruction from the location of instruction, according to translating Microcommand, main computing module 5 will input neuron vector by interconnecting modules 4 first and issues respectively from computing module 6, save extremely From the neuron storage unit 63 of computing module 6.

In step S7, the microcommand translated is instructed according to COMPUTE, is deposited from the arithmetic element 61 of computing module 6 from weight Storage unit 64 reads weight vector (corresponding to the column vector from computing module 6 in weight matrix), from neuron storage unit Input neuron vector is read, weight vector is completed and inputs the dot-product operation of neuron vector, intermediate result is passed through into interconnection Module returns.

In step S8, in interconnecting modules 4, respectively it is combined into complete step by step from the intermediate result that computing module 6 returns Between result vector.

In step S9, main computing module 5 obtains the return value of interconnecting modules 4, and the micro- finger translated is instructed according to COMPUTE It enables, reads bias vector from neuron storage unit 53, the addition of vectors returned with interconnecting modules 4, then again to addition result It activates, and last output neuron vector is written back to neuron storage unit 53.

In step S10, controller unit then reads in next I/O instruction from the location of instruction, according to the micro- finger translated It enabling, the output neuron vector in neuron storage unit 53 is deposited to external address space and specifies address by data access unit 3, Operation terminates.

Fig. 7 is the structure of Recognition with Recurrent Neural Network.In order to solve traditional neural network in time for pervious input It relies on, when forward operation, the input of Recognition with Recurrent Neural Network is defeated from the hidden layer of the input at current time and last moment Out.I is input quantity in formula, and H is hidden layer quantity, and K is the number of output.Wherein α_hIt is the median of h-th of t moment output, b_h It is h-th of output of t moment after activating, δ_hIndicate residual error to α_hPartial derivative, θ indicate activation primitive.

The formula of forward-propagating is expressed are as follows:

b_h=θ (α_h)

The formula of backpropagation is expressed:

Wherein.Have the function that generalized time sequence by the connection of hidden layer and this layer of output for last moment. But there is time decaying in such Recognition with Recurrent Neural Network.

Fig. 8 is the structure of a block of LSTM algorithm.Relative to conventional recycle neural network, LSTM introduces one Cell records the information of current point in time.It can be seen that a block is by three doors and a cell group in LSTM algorithm At input gate, forgets door at out gate.The main thought of LSTM algorithm is the state that current time is recorded using cell, to upper One moment was passed to cell value to reach the function of directly transmitting information in different time.Cell is controlled with door is forgotten with input gate Output in for current time input and upper time cell weight.The output of cell is controlled with out gate.It is inputting Door and forget under the control of door, suitable information will be saved for a long time, be recorded in always inside cell, thus solved Recognition with Recurrent Neural Network with the time decay the problem of.

Fig. 9 shows the flow chart of Recognition with Recurrent Neural Network and LSTM single layer of the invention.

In step A1, calculates the corresponding current time input of input gate and the sum of products of weight is buffered in neuron caching Area, then calculate last moment location mode and the sum of products of corresponding weight and last moment hidden layer and corresponding weight sum of products all There are buffer areas.Finally by they three be added and activate to obtain input gate value.

In step A2, calculates and forget that the corresponding current time input of door and the sum of products of weight are buffered in neuron caching Area, then calculate last moment location mode and the sum of products of corresponding weight and last moment hidden layer and corresponding weight sum of products all There are buffer areas.Finally by they three be added and activate to obtain and forget gate value.

In step A3, calculates the corresponding current time input of input gate and the sum of products of weight is buffered in neuron caching Area, then calculate last moment hidden layer and all there is buffer area with corresponding weight sum of products.Finally by they two be added and swash Work obtains location mode median and is cached to neuron buffer area.Then it allows median to be multiplied with input gate correspondence, is buffered in fortune Calculate in unit (the 51 of Fig. 7) buffer area, then allow and the location mode of last moment and forget to correspond to and be multiplied, arithmetic element with it is upper Primary caching is corresponding to be added, and cell-like state value is obtained.

In step A4, calculates the corresponding current time input of out gate and the sum of products of weight is buffered in neuron caching Area, then calculate current time location mode and the sum of products of corresponding weight and last moment hidden layer and corresponding weight sum of products all There are buffer areas.Finally by they three be added and activate, obtain output gate value.

In step A5, location mode is corresponded to out gate is multiplied to obtain this layer of output.

In step B1, calculates subsequent time hidden layer gradient and be multiplied addition with the weight of corresponding position plus this layer of residual sum The sum of corresponding weight multiplication obtains this layer of output gradient.

It in step B2, allows output gradient and cell activation value to correspond to multiply-add, is buffered in neuron buffer area finally multiplied by sharp Functional derivative living obtains out gate gradient.

The state gradient of cell is multiplied by leading for current output gate value and state activation by currently exporting gradient in step B3 Number deposit neuron, then the gradient of subsequent time cell is calculated, input gate and the gradient and this moment out gate for forgetting door Gradient is all stored in neuron caching multiplied by corresponding weight, is finally added and obtains location mode gradient.The gradient of cell median by Current time input gate activation value, cell activation primitive derivative and cell state gradient are multiplied to obtain.

It is in step B4, the state gradient of current time all cell is corresponding multiplied by the addition of last moment cell state output Finally it is multiplied to obtain the gradient for forgetting door with a derivative is forgotten.

In step B5, by the corresponding activation value multiplied by this moment cell median of the state gradient of current time all cell Output, which is added, to be finally multiplied to obtain the gradient of input gate with input gate derivative.

It is noted that it is that the LSTM greatlied simplify is calculated on this device that conventional recycle neural network algorithm, which is applied, Method only relies on current time input and last moment output, the operation sub- mistake of forward and reverse expression and LSTM when calculating output Journey is similar, and this will not be repeated here.

For primary complete Recognition with Recurrent Neural Network and LSTM algorithm, realization process typically calculated with more than it is similar, Corresponding weight is taken out according to formula and data are weighted summation, and when prolonged, next layer of operational order can be by master Input neuron address of upper one layer of the output neuron address stored in arithmetic element as this layer.Similarly, in instruction Weight address and biasing address can also change to this layer of corresponding address.

By using the device and instruction set for executing Recognition with Recurrent Neural Network and LSTM, solves CPU and GPU operation Can be insufficient, the big problem of front end decoding overheads.Effectively increase the support to multi-layer artificial neural network forward operation.

By using the dedicated on piece caching for Recognition with Recurrent Neural Network and LSTM, input neuron and power have sufficiently been excavated The reusability of Value Data avoids and reads these data to memory repeatedly, reduce EMS memory access bandwidth, avoids memory bandwidth The problem of as multi-layer artificial neural network forward operation performance bottleneck.

Claims

1. a kind of for executing the device of LSTM operation, including controller unit, data access unit, interconnecting modules, main operation Module and from computing module, wherein a block block of the LSTM operation includes: input gate, out gate, forgets door and list First cell；The input gate and it is described forget door, in the output for controlling cell for current time input and the upper time The weight of cell；The out gate, for controlling the output of cell；

The data access unit completes the load and storage of data, reads in instruction for accessing external address space；

The controller unit, for reading instruction, by Instruction decoding at the microcommand for controlling other units or module behavior, so Respective microcommand is distributed to each unit or module afterwards；

The interconnecting modules, it is described from computing module for the input vector of the main computing module to be sent to, and will be from The operation result of computing module returns to main computing module；

It is described from computing module, for by input data carry out it is multiply-add obtain part and and save, until neuron number is according to all Main computing module is returned to by all parts and by the interconnecting modules after input；

The main computing module, for carrying out interpolation activation to calculated result in forward operation.

2. the apparatus according to claim 1, which is characterized in that described device further include: the location of instruction, for caching The instruction of reading.

3. device according to claim 1 or 2, which is characterized in that described device is described for executing the calculating of LSTM single layer LSTM single layer calculates

It is slow to be buffered in neuron for calculating the sum of products of input gate corresponding current time input and weight for described device Memory cell, then calculate last moment location mode and the sum of products of corresponding weight and last moment hidden layer and corresponding weight product All there is neuron cache unit, finally by they three be added and activate to obtain input gate value；

It is slow to be buffered in neuron for calculating the sum of products for forgetting door corresponding current time input and weight for described device Memory cell, then calculate last moment location mode and the sum of products of corresponding weight and last moment hidden layer and corresponding weight product All there is neuron cache unit；Finally by they three be added and activate to obtain and forget gate value；

It is slow to be buffered in neuron for calculating the sum of products of input gate corresponding current time input and weight for described device Memory cell, then calculate last moment hidden layer and all there is neuron cache unit with corresponding weight sum of products, finally by they two It a addition and activates to obtain location mode median and is cached to neuron cache unit；Then allow median and input gate corresponding It is multiplied, is buffered in the cache unit of arithmetic element, then allow and the location mode of last moment and forget to correspond to and be multiplied, in operation Unit is corresponding with last caching to be added, and cell-like state value is obtained；

It is slow to be buffered in neuron for calculating the sum of products of out gate corresponding current time input and weight for described device Memory cell, then calculate current time location mode and the sum of products of corresponding weight and last moment hidden layer and corresponding weight product All there is cache unit, finally by they three be added and activate, obtain output gate value；By location mode and out gate pair It should be multiplied to obtain the output of LSTM single layer.

4. device according to claim 1 or 2, which is characterized in that described device is also used to execute LSTM single layer operation The reversed operation of gradient；It specifically includes:

Described device calculates subsequent time hidden layer gradient and is multiplied with the weight of corresponding position addition, in addition this layer of residual sum is corresponding The sum of weight multiplication obtains this layer of output gradient；It allows output gradient and cell activation value to correspond to multiply-add, is buffered in neuron caching Unit finally obtains out gate gradient multiplied by activation primitive derivative；

Described device, for the state gradient of cell to be multiplied by leading for current output gate value and state activation by currently exporting gradient Number deposit neuron cache unit, then the gradient of subsequent time cell is calculated, input gate and gradient and this when for forgetting door It carves out gate gradient and is all stored in neuron cache unit multiplied by corresponding weight, be finally added and obtain location mode gradient, in cell Between the gradient that is worth be multiplied to obtain by current time input gate activation value, cell activation primitive derivative and cell state gradient；

Described device, for the state gradient of current time all cell is corresponding multiplied by the addition of last moment cell state output Finally it is multiplied to obtain the gradient for forgetting door with a derivative is forgotten；The state gradient of current time all cell is corresponding multiplied by this when The activation value for carving cell median, which exports, to be added, and is finally multiplied to obtain the gradient of input gate with input gate derivative.

5. device according to any one of claims 1-4, which is characterized in that it is described from processing module include multiple from fortune Module is calculated,

The interconnecting modules, the input vector of the main computing module is distributed to it is described from computing module, and will be from operation The operation result of module is spliced into operation result step by step and returns to main computing module.

6. device according to claim 5, which is characterized in that the interconnecting modules include the binary tree that multiple nodes are constituted Access；

The data of upstream are similarly issued two nodes in downstream by each node, the data that two nodes in downstream are returned into Row merges, and returns to the node of upstream.

7. device according to claim 6, which is characterized in that

The main computing module, is also used in reversed operation, and calculated result is carried out the derivative that interpolation activation and derivation obtain It is multiplied with gradient.

8. device as claimed in claim 7, which is characterized in that the main computing module includes: arithmetic element, data dependence pass It is judging unit and neuron cache unit, wherein

The arithmetic element, for receiving the microcommand of controller unit sending and carrying out arithmetic logic operation；

The data dependence relation judging unit guarantees institute between instruction for being written and read to neuron cache unit There is no read-write consistency conflicts for data；

The neuron cache unit, for caching input neuron number evidence and output nerve metadata.

9. as claimed in claim 6 for executing the device of LSTM operation, which is characterized in that described to include fortune from computing module Unit, data dependence relation judging unit, neuron storage unit, weight storage unit and weight gradient storage unit are calculated,

The neuron cache unit, for caching input neuron number evidence and output nerve metadata；

The weight cache unit, for caching the weight data needed in calculating process from computing module；

The weight gradient cache unit, for caching the weight gradient accordingly needed during updating weight from computing module Data；

The arithmetic element, for receive the microcommand of controller unit sending and to input neuron number accordingly and weight data Carry out arithmetic logic operation.

10. device according to claim 8 or claim 9, which is characterized in that

The data dependence relation judging unit, specifically for judging the first data of control signal being not carried out and being carrying out This is allowed not hold if there is no dependence with the presence or absence of dependence between second data of control signal in the process Capable control signal is immediately performed, if there is dependence；Signal owning there are dependence is being controlled with what is be not carried out After the completion of controlling signal whole execution, the control signal for allowing this to be not carried out is executed.

11. a kind of method for executing LSTM operation, which is characterized in that the method is applied to LSTM arithmetic unit, described device Including controller unit, interconnecting modules, data access unit, main computing module and from computing module, the one of the LSTM operation A block block includes: input gate, out gate, forgets door and unit cell；The input gate and it is described forget door, for controlling For the weight of current time input and upper time cell in the output of cell；The out gate, for controlling the defeated of cell Out；Described method includes following steps:

The data access unit accesses external address space, completes the load and storage of data, reads in instruction；The controller Unit reads instruction, by Instruction decoding at the microcommand for controlling other units or module behavior, then by respective microcommand point It is sent to each unit or module；

The interconnecting modules input vector of the main computing module is sent to it is described from computing module, and will be from operation mould The operation result of block returns to main computing module；It is described from computing module by input data carry out it is multiply-add obtain part and and protect It deposits, until neuron number returns to main computing module according to by all parts after all inputting and by the interconnecting modules；Institute It states main computing module and interpolation activation is carried out to calculated result in forward operation.

12. according to the method for claim 11, which is characterized in that described device further include: the location of instruction, for delaying Deposit the instruction of reading.

13. method according to claim 11 or 12, which is characterized in that the method also includes: it is mono- for executing LSTM Layer calculates, and the LSTM single layer calculating includes:

Described device calculates the sum of products of input gate corresponding current time input and weight, and it is single to be buffered in neuron caching Member, then calculate last moment location mode and the sum of products of corresponding weight and last moment hidden layer and corresponding weight sum of products all There are neuron cache units, finally by they three be added and activate to obtain input gate value；

Described device calculates the sum of products for forgetting door corresponding current time input and weight, and it is single to be buffered in neuron caching Member, then calculate last moment location mode and the sum of products of corresponding weight and last moment hidden layer and corresponding weight sum of products all There are neuron cache units；Finally by they three be added and activate to obtain and forget gate value；

Described device calculates the sum of products of input gate corresponding current time input and weight, and it is single to be buffered in neuron caching Member, then calculate last moment hidden layer and all there is neuron cache unit with corresponding weight sum of products, finally by their two phases Adduction and activation obtains location mode median and is cached to neuron cache unit；Then median and input gate is allowed to correspond to phase Multiply, be buffered in the cache unit of arithmetic element, then allow the location mode of last moment with forget that correspondence is multiplied, in operation list Member is corresponding with last caching to be added, and cell-like state value is obtained；

Described device calculates the sum of products of out gate corresponding current time input and weight, and it is single to be buffered in neuron caching Member, then calculate current time location mode and the sum of products of corresponding weight and last moment hidden layer and corresponding weight sum of products all There are cache units, finally by they three be added and activate, obtain output gate value；Location mode and out gate are corresponded into phase It is multiplied to be exported to LSTM single layer.

14. method according to claim 11 or 12, which is characterized in that the method also includes: execute LSTM single layer fortune The reversed operation of the gradient of calculation；It specifically includes:

The derivative that the state gradient of cell is multiplied by current output gate value and state activation by described device by currently exporting gradient is deposited Enter neuron cache unit, then calculate the gradient of subsequent time cell, input gate and forgets that the gradient of door and this moment are defeated Gradient of going out all is stored in neuron cache unit multiplied by corresponding weight, is finally added and obtains location mode gradient, cell median Gradient be multiplied to obtain by current time input gate activation value, cell activation primitive derivative and cell state gradient；

Described device is corresponding last multiplied by the addition of last moment cell state output by the state gradient of current time all cell It is multiplied to obtain the gradient for forgetting door with a derivative is forgotten；The state gradient of current time all cell is corresponding multiplied by this moment The activation value of cell median, which exports, to be added, and is finally multiplied to obtain the gradient of input gate with input gate derivative.

15. method described in 1-14 according to claim 1, which is characterized in that it is described from processing module include multiple from operation mould Block, the interconnecting modules input vector of the main computing module is sent to it is described from computing module, and will be from operation mould The operation result of block returns to main computing module and specifically includes:

The interconnecting modules input vector of the main computing module is distributed to it is described from computing module, and will be from operation mould The operation result of block is spliced into operation result step by step and returns to main computing module.

16. according to the method for claim 15, which is characterized in that the interconnecting modules include the y-bend that multiple nodes are constituted Set access；The interconnecting modules input vector of the main computing module is distributed to it is described from computing module, and will be from fortune The operation result for calculating module, which is spliced into operation result step by step and returns to main computing module, to be specifically included:

17. according to the method for claim 16, which is characterized in that the method also includes:

The main computing module carries out the derivative and gradient phase that interpolation activation and derivation obtain in reversed operation, by calculated result Multiply.

18. the method described in claim 16, which is characterized in that the main computing module includes: arithmetic element, data dependence Relationship judging unit and neuron cache unit, the method specifically include:

The arithmetic element receives the microcommand that controller unit issues and carries out arithmetic logic operation；

The data dependence relation judging unit is written and read neuron cache unit, guarantees number used between instruction According to there is no read-write consistency conflicts；The neuron cache unit caching input neuron number evidence and output nerve metadata.

19. the method described in claim 16, which is characterized in that it is described from computing module include: arithmetic element, data dependence Relationship judging unit, neuron storage unit, weight storage unit and weight gradient storage unit, the method specifically include:

The data dependence relation judging unit is written and read neuron cache unit, guarantees number used between instruction According to there is no read-write consistency conflicts；

The neuron cache unit caching input neuron number evidence and output nerve metadata；

The weight cache unit caches the weight data needed in calculating process from computing module；

The weight gradient cache unit caches the weight gradient data accordingly needed during updating weight from computing module；

The arithmetic element receive the microcommand that controller unit issues and to input neuron number accordingly and weight data carries out Arithmetic logic operation.

20. method described in 8 or 19 according to claim 1, which is characterized in that the method also includes:

During the data dependence relation judging unit judges the first data of the control signal being not carried out and is carrying out It controls and whether there is dependence, if there is no dependence, the control that allows this to be not carried out between the second data of signal Signal is immediately performed, if there is dependence；There are all control signals of dependence with the control signal that is not carried out After the completion of all executing, the control signal for allowing this to be not carried out is executed.