CN111260025A

CN111260025A - Apparatus and method for performing LSTM neural network operations

Info

Publication number: CN111260025A
Application number: CN202010018716.XA
Authority: CN
Inventors: 不公告发明人
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2016-12-30
Filing date: 2016-12-30
Publication date: 2020-06-09
Anticipated expiration: 2036-12-30
Also published as: CN113537481A; CN111260025B; CN108268939A; CN108268939B; CN113537481B; CN113537480A; CN113537480B

Abstract

An apparatus and method for performing LSTM neural network operations. The device comprises a direct memory access unit, an instruction cache unit, a controller unit, a plurality of data cache units arranged in parallel and a plurality of data processing modules arranged in parallel, wherein the data processing modules correspond to the data cache units one by one and are used for acquiring input data and weight and bias required during operation from the corresponding data cache units and performing LSTM neural network operation; and parallel operation is executed among the data processing modules. The invention adopts special instruction operation, the number of instructions required by operation is greatly reduced, and the decoding overhead is reduced; caching the weight and the bias to reduce the data transmission overhead; the invention is not limited to the specific application field, can be used in the fields such as speech recognition, text translation, music synthesis and the like, and has strong expandability; and a plurality of data processing modules run in parallel, so that the operation speed of the LSTM network is obviously improved.

Description

Apparatus and method for performing LSTM neural network operations

Technical Field

The present invention relates to the technical field of neural network operations, and more particularly, to an apparatus and an operation method for performing LSTM neural network operations.

Background

The long and short time memory network (LSTM) is a time Recursive Neural Network (RNN), and due to the unique structural design of the network, the LSTM is suitable for processing and predicting important events with very long intervals and delays in time sequences. LSTM networks exhibit better performance than traditional recurrent neural networks, and are well suited to learning from experience to classify, process and predict time sequences after the presence of unknown-sized times between significant events. Currently, LSTM networks are widely used in many fields such as speech recognition, video description, machine translation, and automatic music synthesis. Meanwhile, with the continuous and deep research on the LSTM network, the performance of the LSTM network is greatly improved, and the great attention is paid to the industrial and academic fields.

The operation of the LSTM network relates to various algorithms, and the specific implementation devices mainly comprise the following two types:

one device that implements LSTM network operations is a general purpose processor. The method supports the above algorithm by executing general-purpose instructions using a general-purpose register stack and general-purpose functional units. One of the disadvantages of this method is that the single general-purpose processor has low operation performance and cannot meet the requirement of the acceleration by the parallelism of the operation of the LSTM network itself. When multiple general-purpose processors are used to execute in parallel, the communication between the processors becomes a performance bottleneck. In addition, the general processor needs to decode the artificial neural network operation into a series of operations and access instructions, and the front-end decoding of the processor also has large power consumption overhead.

Another known method of supporting LSTM network operations is to use a Graphics Processor (GPU). The method performs the above algorithm by executing general purpose SIMD instructions using a general purpose register stack and a general purpose stream processing unit. Because the GPU is a device dedicated to performing graphics image operations and scientific computations, no special support is provided for the LSTM network, and a large amount of front-end decoding work is still required to perform LSTM network operations, which may cause a large amount of overhead. In addition, the GPU has only a small on-chip cache, and the relevant parameters used in the LSTM network need to be repeatedly carried off-chip, and the off-chip bandwidth also becomes a performance bottleneck.

Therefore, how to design and provide a device and a method for realizing high-operation-performance LSTM network operation in a manner of smaller IO amount and low cost is a technical problem which needs to be solved urgently at present.

Disclosure of Invention

It is therefore an objective of the claimed invention to provide an apparatus and method for performing LSTM network operations to solve at least one of the above problems.

To achieve the above object, as one aspect of the present invention, there is provided an apparatus for performing LSTM neural network operations, comprising:

the data cache units are arranged in parallel and used for caching data, states and results required by operation;

the data processing modules are arranged in parallel and used for acquiring input data and weights and offsets required during operation from the corresponding data cache units and performing LSTM neural network operation; the data processing modules correspond to the data cache units one by one, and parallel operation is executed among the data processing modules.

As another aspect of the present invention, the present invention also provides an apparatus for performing LSTM neural network operations, comprising:

a memory;

a processor that performs the following operations:

step 1, reading a weight and an offset used for LSTM neural network operation from an external designated address space, dividing the weight and the offset into a plurality of parts corresponding to neurons operated by the LSTM neural network, and storing the parts into different spaces of a memory, wherein the weight and the offset in each space are the same in number; and reading input data for LSTM neural network operations from an externally specified address space and storing it in each of said different spaces of said memory;

step 2, dividing the weight and the input data in each different space of the memory into a plurality of parts, wherein the number of the weight or the input data of each part is the same as the number of the corresponding vector operation units; calculating a weight and input data to obtain a partial sum, and performing vector addition on the partial sum and the previously obtained partial sum to obtain a new partial sum, wherein the initial value of the partial sum is an offset value;

step 3, after all the input data in each different space of the memory are processed, obtaining a partial sum which is a net activation amount corresponding to the neuron, and transforming the net activation amount of the neuron through a nonlinear function tanh or a sigmoid function to obtain an output value of the neuron;

step 4, using different weights and offsets in the mode, repeating the steps 1-3, and respectively calculating the vector values of a forgetting gate, an input gate, an output gate and a to-be-selected state unit in the LSTM neural network operation; vector operation instructions are adopted in the process of calculating the partial sums, and input data in each different space of the memory are calculated in a parallel operation mode;

step 5, judging whether the calculation of the vector values of the current forgetting gate, the input gate and the to-be-selected state unit in each different space of the memory is finished, if so, calculating a new state unit, namely, obtaining a partial sum of the vector values of the old state unit and the forgetting gate through a vector dot multiplication component, then obtaining a partial sum of the values of the to-be-selected state unit and the input gate through the vector dot multiplication component, obtaining an updated state unit by the two partial sums through a vector summation submodule, and simultaneously converting the updated state unit through a nonlinear conversion function tanh; judging whether the nonlinear transformation of the current updated data state unit and the output gate are calculated, if so, calculating the output gate and the vector with the updated data state unit after the nonlinear transformation through a vector dot product component to obtain the final output value of each different space of the memory;

and 6, splicing the final output values of each different space of each memory to obtain a final output value.

As another aspect of the present invention, the present invention further provides an LSTM neural network operation method, including the steps of:

step S1, reading the weight and bias for LSTM neural network operation from the external designated address space, writing the weight and bias into a plurality of data buffer units arranged in parallel, and initializing the state unit of each data buffer unit; wherein, the weight and the offset read from the external appointed address space are divided and sent to each corresponding data cache unit corresponding to the neurons operated by the LSTM neural network, and the weight and the offset in each data cache unit are respectively the same in quantity;

step S2, reading input data from an external designated address space, and writing the input data into the plurality of data buffer units, where the input data written into each data buffer unit is complete;

step S3, a plurality of data processing modules corresponding to the plurality of data cache units arranged in parallel one by one respectively read the weight, the offset and the input data from the corresponding data cache units, and carry out LSTM neural network operation on the data processing modules by adopting a vector point multiplication component, a vector addition component, a vector summation component and a vector nonlinear function conversion component to respectively obtain the output value of each data processing module;

and step S4, splicing the output values of the data processing modules to obtain a final output value, namely a final result of the LSTM neural network operation.

Based on the above technical solution, the apparatus and method for performing neural network operations of the present invention have the following advantages compared with the existing implementation:

1. the external instruction is adopted for operation, compared with the existing implementation mode, the number of instructions required by operation is greatly reduced, and the decoding overhead generated during LSTM network operation is reduced;

2. by utilizing the characteristic that the weight and the bias of a hidden layer can be repeatedly used in the operation process of the LSTM network, the weight and the bias are temporarily stored in a data cache unit, so that the IO quantity between the device and the outside is reduced, and the overhead generated by data transmission is reduced;

3. the invention does not limit the application field of the specific LSTM network, can be used in the fields such as speech recognition, text translation, music synthesis and the like, and has strong expandability;

4. the multiple data processing modules in the device are completely parallel, and the internal parts of the data processing modules are parallel, so that the parallelism of the LSTM network can be fully utilized, and the operation speed of the LSTM network is obviously improved;

5. preferably, the specific implementation of the vector nonlinear function conversion part can be performed by a table look-up method, and compared with the traditional function operation, the efficiency is greatly improved.

Drawings

For a more complete understanding of the present invention, and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:

FIG. 1 is a schematic diagram illustrating an overall structure of an apparatus for performing LSTM network operations according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a data processing module of an apparatus for performing LSTM network operations according to an embodiment of the present invention;

FIG. 3 illustrates a flow diagram of a method for performing LSTM network operations in accordance with an embodiment of the present invention;

fig. 4 shows a detailed flowchart of a data processing procedure in a method for performing LSTM network operations according to an embodiment of the present invention.

Detailed Description

In order that the objects, technical solutions and advantages of the present invention will become more apparent, the present invention will be further described in detail with reference to the accompanying drawings in conjunction with the following specific embodiments.

In this specification, the various embodiments described below which are meant to illustrate the principles of this invention are illustrative only and should not be construed in any way to limit the scope of the invention. The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of exemplary embodiments of the invention as defined by the claims and their equivalents. The following description includes various specific details to aid understanding, but such details are to be regarded as illustrative only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Moreover, descriptions of well-known functions and constructions are omitted for clarity and conciseness. Moreover, throughout the drawings, the same reference numerals are used for similar functions and operations. In the present invention, the terms "include" and "comprise," as well as derivatives thereof, mean inclusion without limitation.

The apparatus for performing LSTM network operations of the present invention may be applied in the following scenarios, including but not limited to: the system comprises various electronic products such as a data processing device, a robot, a computer, a printer, a scanner, a telephone, a tablet computer, an intelligent terminal, a mobile phone, a driving recorder, a navigator, a sensor, a camera, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage device and a wearable device; various vehicles such as airplanes, ships, vehicles, and the like; various household appliances such as televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas stoves, smoke extractors and various medical devices such as nuclear magnetic resonance apparatuses, B-ultrasonic apparatuses, electrocardiographs and the like.

Specifically, the invention discloses a device for executing LSTM neural network operation, which comprises:

The data caching unit is used for caching an intermediate result calculated by the data processing module, and only once weight and offset are led in from the direct memory access unit in the whole execution process and are not changed later.

Wherein, each data buffer unit is written with weight and bias which are divided corresponding to the neurons operated by the LSTM neural network, wherein the number of the weight and the bias in each data buffer unit is the same, and each data buffer unit acquires a complete input data.

The data processing module adopts a vector point multiplication component, a vector addition component, a vector summation component and a vector nonlinear function conversion component to carry out the LSTM neural network operation.

The vector nonlinear function conversion part performs function operation by a table look-up method.

The vector operation is carried out by each data processing module through respectively calculating vector values of a forgetting gate, an input gate, an output gate and a unit to be selected in the LSTM network operation, then the output value of each data processing module is obtained through each vector value, and finally the output values of the data processing modules are spliced to obtain a final output value.

As a preferred embodiment, the present invention discloses an apparatus for performing LSTM neural network operations, comprising:

the direct memory access unit is used for acquiring instructions and data required by LSTM neural network operation from an external address space outside the device, respectively transmitting the instructions and the data to the instruction cache unit and the data cache unit, and writing back an operation result to the external address space from the data processing module or the data cache unit;

the instruction cache unit is used for caching the instruction acquired by the direct memory access unit from the external address space and inputting the instruction into the controller unit;

the controller unit is used for reading the instruction from the instruction cache unit, decoding the instruction into a micro instruction, and controlling the direct memory unit to perform data IO operation, the data processing module to perform relevant operation and the data cache unit to perform data cache and transmission;

the data processing modules are arranged in parallel and used for acquiring input data and weight and bias required during operation from the corresponding data cache units, performing LSTM neural network operation and inputting operation results into the corresponding data cache units or the direct memory access units; the data processing modules correspond to the data cache units one by one, and parallel operation is executed among the data processing modules.

Preferably, the direct memory access unit, the instruction cache unit, the controller unit, the plurality of data cache units, and the plurality of data processing modules are implemented by hardware circuits.

Preferably, the data caching unit is further configured to cache an intermediate result calculated by the data processing module, and only once import a weight and a bias from the direct memory access unit in the whole execution process, and then do not change any more.

Preferably, the plurality of data processing modules all adopt a vector dot multiplication component, a vector addition component, a vector summation component and a vector nonlinear function conversion component to perform the LSTM neural network operation.

Preferably, the vector nonlinear function conversion unit performs a function operation by a table lookup method.

Preferably, the plurality of data processing modules perform parallel operations as follows:

step 1, writing weight values and offsets read from an external designated address space and divided corresponding to neurons operated by the LSTM neural network into each corresponding data cache unit, wherein the number of the weight values and the offsets in each data cache unit is the same, and each data cache unit acquires a complete input data; each data processing module divides the weight and the input data in each corresponding data cache unit into a plurality of parts, wherein the number of the weight or the input data of each part is the same as the number of operations of the vector operation unit in the corresponding single data processing module; sending a weight and input data into the corresponding data processing module each time, calculating to obtain a partial sum, then taking out the partial sum obtained before from the data cache unit, carrying out vector addition on the partial sum to obtain a new partial sum, and sending the new partial sum back to the data cache unit, wherein the initial value of the partial sum is an offset value;

step 2, after all input data in each data cache unit are sent to a corresponding data processing module to be processed once, the obtained part sum is the net activation amount corresponding to the neuron, and the corresponding data processing module transforms the net activation amount of the neuron through a nonlinear function tanh or a sigmoid function to obtain an output value of the neuron;

step 3, using different weights and offsets in the mode, repeating the steps 1-2, and respectively calculating the vector values of a forgetting gate, an input gate, an output gate and a to-be-selected state unit in the LSTM network operation; in the same data processing module, vector operation instructions are adopted in the process of calculating partial sums, and parallel operation is adopted among data;

step 4, each data processing module judges whether the calculation of the vector values of the current forgetting gate, the input gate and the to-be-selected state unit is completed, if so, the calculation of a new state unit is carried out, namely, the vector values of the old state unit and the forgetting gate are sent to the data processing module, a partial sum is obtained through a vector dot multiplication component and sent back to a data cache unit, then the values of the to-be-selected state unit and the input gate are sent to the data processing module, the partial sum is obtained through the vector dot multiplication component, the partial sum in the data cache unit is sent to the data processing module, an updated state unit is obtained through a vector summation submodule and then sent back to the data cache unit, and meanwhile, the updated state unit in the data processing module is transformed through a nonlinear transformation function tanh; each data processing module judges whether the nonlinear transformation of the current updated data state unit and the output gate are calculated, if so, the output gate and the vector with the updated data state unit after the nonlinear transformation are calculated by a vector dot multiplication component to obtain a final output value, and the output value is written back to the data cache unit;

and 5, after the output values in all the data processing modules are written back to the data cache unit, splicing the output values in all the data processing modules to obtain a final output value, and sending the final output value to an external designated address through the direct memory access unit.

The invention also discloses a device for executing the LSTM neural network operation, which comprises the following components:

a memory;

a processor that performs the following operations:

The invention also discloses an operation method of the LSTM neural network, which comprises the following steps:

Preferably, in step S3, each data processing module divides the weight and the input data in the corresponding data buffer unit into a plurality of parts, where the weight or the number of the input data in each part is the same as the number of operations performed by the vector operation unit in the corresponding single data processing module; each data cache unit sends a weight and input data to a data processing module corresponding to the weight and the input data each time, partial sums are obtained through calculation, partial sums obtained before are taken out from the data cache unit, vector addition is carried out on the partial sums to obtain new partial sums, and the new partial sums are sent back to the data cache unit, wherein the initial values of the partial sums are offset values;

after all input data are sent to the data processing module once, the obtained part sum is the net activation amount corresponding to the neuron, then the net activation amount of the neuron is sent to the data processing module, the output value of the neuron is obtained through nonlinear function tanh or sigmoid function transformation in the data operation submodule, and different weights and offsets are used in the mode to respectively calculate the vector values of a forgetting gate, an input gate, an output gate and a to-be-selected state unit in the LSTM neural network;

each data processing module judges whether the calculation of the vector values of the current forgetting gate, the input gate and the to-be-selected state unit is completed, if so, the calculation of a new state unit is carried out, namely, the vector values of the old state unit and the forgetting gate are sent to the data processing module, a partial sum is obtained through a vector dot multiplication component and sent back to a data cache unit, then the values of the to-be-selected state unit and the input gate are sent to the data processing module, a partial sum is obtained through the vector dot multiplication component, the partial sum in the data cache unit is sent to the data processing module, an updated state unit is obtained through a vector summation submodule and then sent back to the data cache unit, and meanwhile, the updated state unit in the data processing module is transformed through a nonlinear transformation function tanh; and each data processing module judges whether the nonlinear transformation of the current updated data state unit and the output gate are completed or not, if so, the output gate and the vector with the updated data state unit after the nonlinear transformation are calculated by a vector dot multiplication component to obtain a final output value, and the output value is written back to the data cache unit.

Preferably, the nonlinear function tanh or sigmoid function is operated by a table look-up method.

Other aspects, advantages and salient features of the invention will become apparent to those skilled in the art from the following detailed description of exemplary embodiments of the invention, which description is to be taken in conjunction with the accompanying drawings.

The invention discloses a device and a method for computing an LSTM network, which can be used for accelerating the application of using the LSTM network. The method specifically comprises the following steps:

(1) taking out weights and offsets used in LSTM network operation from an external designated address space through a direct memory access unit, and writing the weights and offsets into each data cache unit, wherein the weights and the offsets are taken out from the external designated address space, are divided and are sent into each data cache unit, the weights and the offsets in each data cache unit are the same in quantity, the weights and the offsets in each data cache unit correspond to neurons, and state units in the data cache units are initialized;

(2) the input data is taken out from an external appointed address space through a direct memory access unit and written into a data cache unit, wherein each data cache unit acquires a complete input data;

(3) dividing the weight and the input data in each data cache unit into a plurality of parts, wherein the weight or the number of the input data of each part is the same as the number of operations of the vector operation unit in a corresponding single data processing module, sending one part of the weight and the input data into the data processing module each time, calculating to obtain a partial sum, taking out the previously obtained partial sum from the data cache unit, carrying out vector addition on the partial sum to obtain a new partial sum, and sending the new partial sum back to the data cache unit. Wherein the initial value of the partial sum is the offset value. After all input data are sent to the data processing module once, the obtained part sum is the net activation amount corresponding to the neuron, then the net activation amount of the neuron is sent to the data processing module, the output value of the neuron is obtained through nonlinear function tanh or sigmoid function transformation in the data operation submodule, and the function transformation can be carried out through a table look-up method and a function operation method. By using different weights and offsets in this way, the vector values of the forgetting gate, the input gate, the output gate and the candidate state unit in the LSTM network can be respectively calculated. In the same data processing module, vector operation instructions are adopted in the process of calculating partial sums, and parallelism exists among data. And then, the data dependence judgment submodule in each data processing module judges whether the calculation of the vector values of the current forgetting gate, the input gate and the to-be-selected state unit is finished or not, and if so, the calculation of a new state unit is carried out. Firstly, sending old state unit and forgotten gate vector values to a data processing module, obtaining partial sums through a vector dot multiplication component in a data operation submodule, and sending the partial sums back to a data cache unit; and then, sending the values of the state unit to be selected and the input gate to a data processing module, obtaining a partial sum through a vector dot multiplication component in a data operation submodule, sending the partial sum in the data cache unit to the data processing module, obtaining an updated state unit through a vector summation submodule in the data operation submodule, then sending the updated state unit back to the data cache unit, and simultaneously transforming the updated state unit in the data processing module through a nonlinear transformation function tanh in the data operation submodule. And the data dependence judging submodule in each data processing module judges whether the nonlinear transformation of the current updated data state unit and the output gate are calculated, if so, the output gate and the vector subjected to the nonlinear transformation of the updated data state unit are calculated by a vector dot multiplication component in the data operation submodule to obtain a final output value, and the final output value is written back to the data cache unit. In the whole operation process, the problems of data dependence or data conflict do not exist among different data processing modules, and the parallel processing can be always performed.

(4) And after the output values in all the data processing modules are written back to the data cache unit, splicing the output values in all the data processing modules to obtain a final output value, and sending the final output value to an external designated address through the direct memory access unit.

(5) And (3) judging whether the LSTM network needs to output at the next moment, if so, turning to (2), otherwise, ending the operation.

Fig. 1 is a schematic diagram illustrating an overall structure of an apparatus for performing LSTM network operations according to an embodiment of the present invention. As shown in fig. 1, the apparatus includes a direct memory access unit 1, an instruction cache unit 2, a controller unit 3, a data cache unit 4, and a data processing module 5, which may all be implemented by hardware circuits.

The direct memory access unit 1 can access an external address space, and can read and write data to each cache unit in the device to complete loading and storing of the data. The method specifically comprises the steps of reading an instruction from the instruction cache unit 2, reading a weight value, bias and input data required by LSTM network operation from a designated storage unit to the data cache unit 4, and directly writing the output after operation into an external designated space from the data cache unit 4.

The instruction cache unit 2 reads the instructions through the direct memory access unit 1 and caches the read instructions.

The controller unit 3 reads the instruction from the instruction cache unit 2, decodes the instruction into a microinstruction for controlling the behavior of other modules, and sends the microinstruction to other modules such as the direct memory access unit 1, the data cache unit 4, the data processing module 5, and the like.

The data cache unit 4 initializes a state unit of the LSTM when the device is initialized, and reads a weight and an offset from an external designated address through the direct memory access unit 1, wherein the weight and the offset read in each data cache unit 4 correspond to a neuron to be calculated, namely the weight and the offset read in each data cache unit 4 are part of the total weight and the offset, and the weight and the offset read in the external designated address are combined in all the data cache units 4; during specific operation, firstly, input data is obtained from the direct memory access unit 1, each data cache unit 4 obtains a copy of the input data, partial sum is initialized to be an offset value, then, a part of weight, offset and the input value is sent to the data processing module 5, an intermediate value is obtained through calculation in the data processing module 5, then, the intermediate value is read out from the data processing module 5 and stored in the data cache unit 4, when all the data are subjected to one-time operation, the partial sum is input to the data processing module 5, neuron output is obtained through calculation, then, the neuron output is written back to the data cache unit 4, and finally, vector values of an input gate, an output gate, a forgetting gate and a to-be-selected state unit are obtained. Then, the forgetting gate and the old state unit are sent into the data processing module 5, partial sums are obtained through calculation and written back to the data cache unit 4, the state unit to be selected and the input gate are sent into the data processing module 5, partial sums are obtained through calculation, the partial sums in the data cache unit 4 are written into the data processing module 5 and are subjected to vector addition with the partial sums obtained through calculation, an updated state unit is obtained, and the updated state unit is written back to the data cache unit 4. The output gate is sent to the data processing module 5, vector dot multiplication is performed on the output gate and the value after the nonlinear transformation function tanh transformation of the updated state unit to obtain an output value, and the output value is written back to the data cache unit 4. Finally, each data cache unit 4 obtains a corresponding updated state unit and an output value, and the output values in all the data cache units 4 are combined to obtain a final output value. Finally, each data cache unit 4 writes back the partial output value obtained by the data cache unit to the external designated address space through the direct memory access unit 1.

The corresponding operations in the LSTM network are as follows:

f^t＝σ(W_f[h_t-1,x_t]+b_f)；

i_t＝σ(W_i[h_t-1,x_t]+b_i)；

o_t＝σ(W_o[h_t-1,x_t]+b_o)；

h_t＝o_t⊙tanh(c_t)；

wherein x is_tIs input data at time t, h_t-1Output data representing time t-1, W_f、W_i、W_cAnd W_oRespectively representing weight vectors corresponding to the forgetting gate, the input gate, the update state unit and the output gate, b_f、b_i、b_cAnd b_oRespectively representing the corresponding offsets of the forgetting gate, the input gate, the updating state unit and the output gate; f. of_tA state cell value indicating that the output of the forgetting gate is selectively forgotten to be passed by dot-multiplying the output of the forgetting gate with the state cell at the time t-1; i.e. i_tRepresenting the output of the input gate, multiplying the resulting candidate state value at time t by the point to selectively add the candidate state value at time t to the state cell;

representing candidate state values obtained by calculation at the time t; c. C_tRepresenting a new state value obtained by selectively forgetting the state value at the time t-1 and selectively adding the state value at the time t, which is to be used at the time of calculating the final output and transmitted to the next time; o_tThe selection condition which is required to be output as a result part in the state unit at the time t is represented; h is_tRepresenting the output at time t and which will also be transmitted to the next time, ⊙ being the product of the vector operations on elements, [ sigma ] being the sigmoid function, the formula:

the activation function tanh function is calculated by

The data processing module 5 reads part of weight W from the corresponding data cache unit 4 each time_i/W_f/W_o/W_cAnd a weight b_i/b_f/b_o/b_cAnd corresponding input data [ h ]_t-1,x_t]Partial sum calculation is completed through a vector multiplication part and a summation part in the data processing module 5 until all input data of each neuron are operated once, and then the net activation amount net of the neuron can be obtained_i/net_f/net_o/net_cThe calculation of the output values is then carried out by conversion of the vector nonlinear function sigmoid or tanh function, in such a way that the input gates i are each completed_iForgetting door f_iAnd an output gate o_iAnd a candidate status cell

And (4) calculating. Then calculating the dot product of the old state unit and the forgetting gate, the to-be-selected state unit and the input gate by the vector dot product unit in the data processing module 5, and then calculating the two results by the vector addition unit to obtain the new state unit c_t. The newly obtained status unit is written back to the data cache unit 4. Completing the transformation of the tanh function by using a vector nonlinear function conversion component to the state unit in the data processing module 5 to obtain tanh (c)_t) In the calculation process, the value of the tanh function can be calculated or the table can be looked up. Then, the vector of the output gate and the state unit after tanh nonlinear transformation is calculated by a vector point multiplication component to obtain the final neuron output value h_t. Finally, the neuron outputs a value h_tWrites back to the data cache location 4.

as shown in fig. 2, the data processing module 5 includes a data processing control sub-module 51, a data dependency discrimination sub-module 52, and a data operation sub-module 53.

Among them, the data processing control sub-module 51 controls the operation performed by the data operation sub-module 53. The control data dependency determination sub-module 52 determines whether or not there is data dependency in the current operation. For the partial operation, the data processing control sub-module 51 controls the operation performed by the data operation sub-module 53; for the operation with possible data dependency, the data processing control sub-module 51 will first control the data dependency judgment sub-module 52 to judge whether the current operation has data dependency, if so, the data processing control sub-module 51 will insert null operation into the data operation sub-module 53, and control the data operation sub-module 53 to perform data operation after the data dependency is released.

The data dependency discrimination sub-module 52 is controlled by the data processing control sub-module 51 to check whether a data dependency exists in the data operation sub-module 53. If the operation carried out next time needs to use the value which is not operated completely, the data dependency exists at present, otherwise, the data dependency does not exist. A method for detecting data dependence is that registers R1, R2, R3, R4 and R5 exist in a data operation submodule 53, and are used for marking whether tanh function conversion operation of an input gate, a forgetting gate, an output gate, a to-be-selected state unit and an updated state unit is completed or not, the value of the register is not 0 to indicate that the operation is completed, and the value is 0 to indicate that the operation is not completed. Corresponding to the LSTM network, the data dependency determining submodule 52 may determine two data dependencies, respectively determine whether there is a data dependency between the input gate, the output gate, and the selected state unit when calculating a new state unit, determine whether there is a data dependency in the tanh function conversion between the output gate and the updated state unit when calculating an output value, and respectively determine whether all of R1, R3, and R4 are not 0 and all of R2, and R are not 0. After the judgment is completed, the judgment result needs to be transmitted back to the data processing control sub-module 51.

The data operation sub-module 53 is controlled by the data processing control sub-module 51 to complete data processing in the network operation process. The data operation submodule 53 includes a vector dot multiplication unit, a vector addition unit, a vector summation unit, and a vector nonlinear conversion unit, and registers R1, R2, R3, R4, and R5 that flag whether or not the related data operation is completed. Registers R1, R2, R3, R4 and R5 are used for marking whether the tanh function conversion operation of the input gate, the forgetting gate, the output gate, the to-be-selected state unit and the updated state unit is completed or not respectively, and the registersA value of not 0 indicates that the operation is complete and 0 indicates that it has not yet been completed. The vector addition part adds corresponding positions of two vectors to obtain a vector, the vector summation part divides the vector into a plurality of segments, each segment is internally summed, and the length of the finally obtained vector is equal to the number of the segments. The vector nonlinear transformation component takes each element in the vector as input to obtain the output after nonlinear function transformation. The specific non-linear transformation can be accomplished in two ways. Taking the sigmoid function with input as x as an example, one mode is to use a function operation mode to directly calculate sigmoid (x), and the other mode is to use a table look-up method to complete, the data operation sub-module 53 maintains a table of sigmoid functions, and records input x respectively₁、x₂…x_n(x₁<x₂<…<x_n) Output y of time correspondence₁、y₂…y_nThe value of the function value corresponding to x is calculated by first finding the interval [ x_i,x_i+1]Satisfy x_i<x<x_i+1Calculating

As an output value. In the LSTM network operation process, the following operations are carried out:

first, R1, R2, R3, R4, and R5 are set to 0. Initializing an input gate portion and with an offset; using part of input data and the weight corresponding to the input data to obtain a temporary value through calculation by a vector dot multiplication component, then segmenting the temporary value according to temporary value vectors corresponding to different neurons, using a vector summation component to complete summation operation of the temporary value, and updating the calculation result with the sum of an input gate part and a completion part; and (3) carrying out the same operation on the other input data and the weight to update partial sums, obtaining partial sums which are the net activation quantity of the neurons after all the input data are operated once, and then calculating by a vector nonlinear transformation component to obtain the output value of the input gate. The output value is written back into the data cache unit 4 and the R1 register is set to not 0.

The output values of the forgetting gate, the output gate and the unit to be selected are calculated by the same method of calculating the output of the input gate, the corresponding output values are written back to the data buffer unit 4, and the registers R2, R3 and R4 are set to be not 0.

And executing null operation or performing the operation of the updated status unit according to the control command of the data processing module sub-module 51. The operation of the updated state cell is: and (3) taking the forgetting gate output value and the old state unit from the data cache unit 4, calculating the sum of the forgetting gate output value and the old state unit through the vector dot multiplication component, then taking the input gate output value and the state unit to be selected from the data cache unit 4, calculating the sum of the forgetting gate output value and the old state unit through the vector dot multiplication component, and obtaining the updated state unit through the vector addition component. The last state element is finally written back into the data cache element 4.

And performs a null operation or an operation of the LSTM network output value according to the control command of the data processing module sub-module 51. The output value is calculated as: and calculating the nonlinear transformation value of the state unit by using the vector nonlinear function change component for the updated state unit, and setting R5 as non-0. And then, performing point multiplication operation on the nonlinear transformation values of the output gate and the state unit by using a vector point multiplication component, and calculating a final output value, namely the output value of the neuron corresponding to the LSTM network. The output value is written back to the data buffer unit 4.

Fig. 3 illustrates a flow diagram for performing LSTM network operations provided in accordance with an embodiment of the present invention.

In step S1, an IO instruction is stored in advance at the head address of the instruction cache unit 2.

In step S2, the controller unit 3 reads the IO instruction from the first address of the instruction cache unit 2, and according to the translated micro instruction, the direct memory access unit 1 reads all instructions related to LSTM network computation from the external address space and caches the instructions in the instruction cache unit 2.

In step S3, the controller unit 3 reads an IO instruction from the instruction cache unit 2, and according to the translated microinstruction, the direct memory access unit 1 reads the weight and offset related to the LSTM network operation from the external specified address space, including the weight and offset of the input gate, the output gate, the forgetting gate, and the to-be-selected state unit, and according to the difference of the neurons corresponding to the weight, the weight and the offset are divided and then read into different data cache modules 4.

In step S4, the controller unit 3 reads a state unit initialization command from the command buffer unit 2, and initializes the state unit values in the data buffer module 4 according to the translated microinstructions to set the input gate portion and the output gate portion and the forgetting gate portion and the to-be-selected state unit portion to the corresponding neuron offset values.

In step S5, the controller unit 3 reads an IO instruction from the instruction cache unit 2, and according to the translated micro instruction, the dma unit 1 reads an input value from the external designated address space into the data cache units 4, and each data cache unit 4 receives a same input value vector.

In step S6, the controller unit 3 reads a data processing instruction from the instruction cache unit 2, and according to the translated microinstruction, the data processing module 5 obtains the relevant data required for operation from the corresponding data cache unit 4 to perform the operation, the result of the operation is the output value of a part of neurons corresponding to a time point, the output values obtained by processing by all the data processing modules 5 are combined and correspond to the output value at a time point, and the detailed processing procedure is shown in fig. 4. After the processing is finished, the data processing module 5 stores the intermediate value or the output value and the state unit value obtained by the processing into the data cache unit 4.

In step S7, the controller unit 3 reads an IO instruction from the instruction cache unit 2, and concatenates the output values in the data cache unit 4 according to the translated microinstruction and outputs the output values to the external designated address through the direct memory access unit 1.

In step S8, the controller unit 3 reads a discrimination instruction from the instruction cache unit 2, and based on the translated microinstruction, the controller unit 3 determines whether the forward process is completed, and if so, ends the operation. If not, the routine proceeds to S6.

In step S1, the data processing module 5 reads in the weight values and input values of a part of the input gates from the data buffer unit 4.

In step S2, the data processing control sub-module 51 in the data processing module 5 controls the vector dot product unit in the data operation sub-module 53 to calculate the dot product of the input gate weight and the input value, then groups the input gate weight and the input value according to the neuron to which the result belongs, and calculates the dot product result in the group to obtain the partial sum by the vector summation unit in the data operation sub-module 53.

In step S3, the data processing module 5 reads in the input gate portion from the data buffer unit 4.

In step S4, the data processing control sub-module 51 in the data processing module 5 controls the data operation sub-module 53 to add the calculated partial sum and the partial sum just read in to obtain the updated input gate partial sum.

In step S5, the data processing module 5 writes the updated partial sum into the data cache module 4.

In step S6, the data processing module 5 determines whether all the input gate weights have been operated once, if so, the partial sum in the data buffer unit is the value of the input gate, and sets the R1 register to be non-zero, otherwise, the operation is continued by using a part of the different input gate weights and input values to S1.

In step S7, the and operation is performed to obtain the forgotten gate output value, the output gate output value, and the candidate cell output value, and the R2, R3, and R4 are set as non-zero values, and the output values are all written back to the data buffer unit 4.

In step S8, the data processing control sub-module 51 in the data processing module 5 controls the data dependency determining sub-module 52 to determine whether operations are completed among the forgetting gate, the input gate, and the candidate state unit, i.e., whether R1, R2, and R4 are all non-zero, if not, the data processing control sub-module 51 controls the data operation sub-module 53 to perform a null operation, and then turns to S8 to continue operation, and if so, turns to S9 operation.

In step S9, the data processing module 5 reads the old status cell and the forgotten gate output value from the data buffer unit 4.

In step S10, the data processing control sub-module 51 in the data processing module 5 controls the data operation sub-module 53 to calculate a partial sum of the old status cells and the forgotten gate output values by the vector dot-product section.

In step S11, the data processing module 5 reads the candidate state cell and the input gate output value from the data buffer unit 4.

In step S12, the data processing control sub-module 51 in the data processing module 5 controls the data operation sub-module 53 to calculate a partial sum with the vector dot multiplication section for the state cell to be selected and the input gate output value, and to sum the partial sum with the partial sum calculated in S10 and the state cell calculated by the vector addition section to obtain an update.

In step S13, the data processing module 5 returns the updated status unit to the data caching unit 4.

In step S14, the data processing control sub-module 51 in the data processing module 5 controls the data operation sub-module 53 to calculate the transform value of the nonlinear transform function tanh of the state cell by using the vector nonlinear transform means for the updated state cell, and set R5 to be nonzero.

In step S15, the data control sub-module 51 in the data processing module 5 controls the data dependency judgment sub-module 52 to judge whether the calculation of the output gate output value and the conversion value of the nonlinear conversion function tanh of the state cell is completed, i.e., whether both R3 and R5 are nonzero, and if not, the data processing control sub-module 51 controls the data operation sub-module 53 to perform a no-operation, and then turns to S15 to continue the operation, and if so, turns to S16 to operate.

In step S16, the data processing module 5 reads in the output of the output gate from the data buffer unit 4.

In step S17, the data processing control sub-module 51 in the data processing module 5 controls the data operation sub-module 53 to calculate the output value of the output gate and the transformed value of the nonlinear transformation function tanh of the state unit by the vector dot product component, which is the output value in the neuron corresponding to the data processing module 5 in the LSTM network.

In step S18, the data processing module 5 writes the output value into the data buffer unit 4.

The processes or methods depicted in the preceding figures may be performed by processing logic that comprises hardware (e.g., circuitry, dedicated logic, etc.), firmware, software (e.g., software embodied on a non-transitory computer readable medium), or a combination of both. Although the processes or methods are described above in terms of some sequential operations, it should be understood that some of the operations described may be performed in a different order. Further, some operations may be performed in parallel rather than sequentially.

The device disclosed by the invention works by using a specially designed instruction set, and the efficiency of instruction decoding is higher. The plurality of data processing modules perform parallel calculation, and the plurality of data cache units perform parallel operation without data transmission, so that the parallelism of operation is greatly improved. In addition, the weight and the bias are placed in the data cache unit, so that IO (input/output) operation between the device and an external address space can be reduced, and the bandwidth required by memory access is reduced.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention and are not intended to limit the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An apparatus for performing LSTM neural network operations, comprising:

the data cache units are arranged in parallel and used for caching data, states and results required by the LSTM neural network operation;

the data processing modules are arranged in parallel and used for acquiring input data and weights and offsets required during operation from the corresponding data cache units and performing LSTM neural network operation; the data processing modules correspond to the data cache units one by one.

2. The apparatus of claim 1, wherein the apparatus further comprises a direct memory access unit, an instruction cache unit, and a controller unit, wherein,

the direct memory access unit is used for acquiring data required by the operation of an instruction and an LSTM neural network from an external address space, respectively transmitting the instruction and the data to the instruction cache unit and the data cache unit, and writing back an operation result to the external address space from the data processing module and the data cache unit;

the instruction cache unit is used for caching the instruction acquired by the direct memory access unit from an external address space and inputting the instruction into the controller unit;

the controller unit is used for reading the instruction from the instruction cache unit, decoding the instruction into a micro instruction, controlling the direct memory access unit to perform data IO operation, controlling the data processing module to perform operation, and controlling the data cache unit to perform data caching and transmission.

3. The apparatus of claim 1, wherein the data caching unit is further configured to cache intermediate results computed by the data processing module, and to import weights and offsets from the direct memory access unit during execution of the LSTM neural network operation.

4. The apparatus of claim 1, wherein weights and offsets divided corresponding to the neurons operated on by the LSTM neural network are buffered in each of the data buffer units, wherein the number of weights and offsets in each of the data buffer units is the same, and complete input data is buffered in each of the data buffer units.

5. The apparatus of claim 1, wherein the data processing module performs the LSTM neural network operations using a vector dot product component, a vector add component, a vector sum component, and a vector nonlinear function conversion component.

6. The apparatus of claim 5, wherein the vector nonlinear function conversion component performs the function operation by table lookup.

7. The apparatus according to any one of claims 2 to 6, wherein each of the data processing modules divides the weight and the input data in the corresponding data buffer unit into a plurality of parts, wherein the number of the weight or the input data in each part is the same as the number of the vector operation unit operations in the corresponding data processing module;

the controller unit is further configured to send a weight and input data from the data buffer unit to the corresponding data processing module to calculate a partial sum, then take out the previous partial sum from the data buffer unit and send the previous partial sum to the data processing module, so that the data processing module performs vector addition on the partial sum to obtain a new partial sum, and send the new partial sum back to the data buffer unit, where an initial value of the partial sum is an offset value.

8. The apparatus of claim 1, wherein each of the data processing modules performs vector operations by respectively calculating vector values of a forgetting gate, an input gate, an output gate, and a candidate state unit in LSTM network operations, obtains an output value of each of the data processing modules from the vector values, and finally splices the output values of the data processing modules to obtain a final output value.

9. The apparatus of claim 1, wherein the data processing module comprises a data processing control sub-module, a data dependent discrimination sub-module, and a data operation sub-module, wherein,

the data processing control sub-module controls the operation of the data operation sub-module and controls the data dependence judgment sub-module to judge whether the current operation has data dependence.

10. An operation method of an LSTM neural network is applied to an LSTM neural network operation device, the LSTM neural network operation device comprises a plurality of data cache units arranged in parallel and a plurality of data processing modules corresponding to the data cache units one by one, and the method comprises the following steps:

the target data processing module obtains input data and weight and bias required during operation from the corresponding data cache unit, carries out LSTM neural network operation, and caches the result obtained by the operation to the corresponding data cache unit, wherein the target data processing module is any one of the data processing modules.

11. The method of claim 10, wherein the apparatus further comprises a direct memory access unit, an instruction cache unit, and a controller unit, the method further comprising:

the direct memory access unit acquires instructions and data required by LSTM neural network operation from an external address space, and respectively transmits the instructions and the data to the instruction cache unit and the data cache unit;

the controller unit reads an instruction from the instruction cache unit, decodes the instruction into a micro instruction, controls the direct memory access unit to perform data IO operation, controls the data processing module to perform operation and controls the data cache unit to perform data caching and transmission;

the direct memory access unit writes back the operation result to the external address space from the data processing module and the data cache unit.

12. The method of claim 10, wherein the method further comprises:

and caching the intermediate result calculated by the data processing module into the data caching unit, and leading weight and offset from the direct memory access unit by the data caching unit in the execution process of the LSTM neural network operation.

13. The method of claim 10, wherein the method further comprises:

and each data cache unit caches weights and offsets which are divided corresponding to the neurons operated by the LSTM neural network, wherein the weights and the offsets in each data cache unit are the same in number, and complete input data is cached in each data cache unit.

14. The method of claim 10, wherein the data processing module performs the LSTM neural network operations using a vector dot product component, a vector add component, a vector sum component, and a vector nonlinear function conversion component.

15. The method of claim 14, wherein the vector nonlinear function conversion component performs the function operation by table lookup.

16. The method according to any one of claims 11 to 15, wherein each of the data processing modules divides the weight and the input data in the corresponding data buffer unit into a plurality of parts, wherein the number of the weight or the input data in each part is the same as the number of the vector operation unit operations in the corresponding data processing module;

the controller unit sends a weight and input data from the data cache unit to the corresponding data processing module to calculate a partial sum, then takes the partial sum obtained before from the data cache unit and sends the partial sum to the data processing module, so that the data processing module carries out vector addition on the partial sum to obtain a new partial sum, and sends the new partial sum back to the data cache unit, wherein the initial value of the partial sum is an offset value.

17. The method as claimed in claim 10, wherein each of the data processing modules performs vector operation by respectively calculating vector values of a forgetting gate, an input gate, an output gate and a candidate state unit in LSTM network operation, obtains an output value of each of the data processing modules from the vector values, and finally splices the output values of the data processing modules to obtain a final output value.

18. The method of claim 10, wherein the data processing module includes a data processing control sub-module, a data dependent discrimination sub-module, and a data operation sub-module, wherein,

19. An apparatus for performing LSTM neural network operations, comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any of claims 10 to 18 when executing the computer program.