CN111260025A - Apparatus and method for performing LSTM neural network operations - Google Patents
Apparatus and method for performing LSTM neural network operations Download PDFInfo
- Publication number
- CN111260025A CN111260025A CN202010018716.XA CN202010018716A CN111260025A CN 111260025 A CN111260025 A CN 111260025A CN 202010018716 A CN202010018716 A CN 202010018716A CN 111260025 A CN111260025 A CN 111260025A
- Authority
- CN
- China
- Prior art keywords
- data
- unit
- data processing
- vector
- processing module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 63
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 58
- 238000012545 processing Methods 0.000 claims abstract description 180
- 230000005540 biological transmission Effects 0.000 claims abstract description 6
- 239000013598 vector Substances 0.000 claims description 137
- 230000006870 function Effects 0.000 claims description 58
- 210000002569 neuron Anatomy 0.000 claims description 39
- 238000006243 chemical reaction Methods 0.000 claims description 21
- 230000008569 process Effects 0.000 claims description 16
- 230000008676 import Effects 0.000 claims description 2
- 238000004590 computer program Methods 0.000 claims 2
- 230000001419 dependent effect Effects 0.000 claims 2
- 238000010977 unit operation Methods 0.000 claims 2
- 230000015572 biosynthetic process Effects 0.000 abstract description 3
- 238000003786 synthesis reaction Methods 0.000 abstract description 3
- 238000013519 translation Methods 0.000 abstract description 3
- 230000009466 transformation Effects 0.000 description 27
- 238000004364 calculation method Methods 0.000 description 22
- 230000004913 activation Effects 0.000 description 13
- 210000004027 cell Anatomy 0.000 description 13
- 238000010586 diagram Methods 0.000 description 6
- 230000001131 transforming effect Effects 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000005481 NMR spectroscopy Methods 0.000 description 1
- IOYNQIMAUDJVEI-BMVIKAAMSA-N Tepraloxydim Chemical group C1C(=O)C(C(=N/OC\C=C\Cl)/CC)=C(O)CC1C1CCOCC1 IOYNQIMAUDJVEI-BMVIKAAMSA-N 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000001934 delay Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000000779 smoke Substances 0.000 description 1
- 238000005406 washing Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F13/00—Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
- G06F13/14—Handling requests for interconnection or transfer
- G06F13/20—Handling requests for interconnection or transfer for access to input/output bus
- G06F13/28—Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access DMA, cycle steal
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/30036—Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline, look ahead
- G06F9/3802—Instruction prefetching
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline, look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Neurology (AREA)
- Memory System Of A Hierarchy Structure (AREA)
- Complex Calculations (AREA)
- Image Analysis (AREA)
Abstract
An apparatus and method for performing LSTM neural network operations. The device comprises a direct memory access unit, an instruction cache unit, a controller unit, a plurality of data cache units arranged in parallel and a plurality of data processing modules arranged in parallel, wherein the data processing modules correspond to the data cache units one by one and are used for acquiring input data and weight and bias required during operation from the corresponding data cache units and performing LSTM neural network operation; and parallel operation is executed among the data processing modules. The invention adopts special instruction operation, the number of instructions required by operation is greatly reduced, and the decoding overhead is reduced; caching the weight and the bias to reduce the data transmission overhead; the invention is not limited to the specific application field, can be used in the fields such as speech recognition, text translation, music synthesis and the like, and has strong expandability; and a plurality of data processing modules run in parallel, so that the operation speed of the LSTM network is obviously improved.
Description
Technical Field
The present invention relates to the technical field of neural network operations, and more particularly, to an apparatus and an operation method for performing LSTM neural network operations.
Background
The long and short time memory network (LSTM) is a time Recursive Neural Network (RNN), and due to the unique structural design of the network, the LSTM is suitable for processing and predicting important events with very long intervals and delays in time sequences. LSTM networks exhibit better performance than traditional recurrent neural networks, and are well suited to learning from experience to classify, process and predict time sequences after the presence of unknown-sized times between significant events. Currently, LSTM networks are widely used in many fields such as speech recognition, video description, machine translation, and automatic music synthesis. Meanwhile, with the continuous and deep research on the LSTM network, the performance of the LSTM network is greatly improved, and the great attention is paid to the industrial and academic fields.
The operation of the LSTM network relates to various algorithms, and the specific implementation devices mainly comprise the following two types:
one device that implements LSTM network operations is a general purpose processor. The method supports the above algorithm by executing general-purpose instructions using a general-purpose register stack and general-purpose functional units. One of the disadvantages of this method is that the single general-purpose processor has low operation performance and cannot meet the requirement of the acceleration by the parallelism of the operation of the LSTM network itself. When multiple general-purpose processors are used to execute in parallel, the communication between the processors becomes a performance bottleneck. In addition, the general processor needs to decode the artificial neural network operation into a series of operations and access instructions, and the front-end decoding of the processor also has large power consumption overhead.
Another known method of supporting LSTM network operations is to use a Graphics Processor (GPU). The method performs the above algorithm by executing general purpose SIMD instructions using a general purpose register stack and a general purpose stream processing unit. Because the GPU is a device dedicated to performing graphics image operations and scientific computations, no special support is provided for the LSTM network, and a large amount of front-end decoding work is still required to perform LSTM network operations, which may cause a large amount of overhead. In addition, the GPU has only a small on-chip cache, and the relevant parameters used in the LSTM network need to be repeatedly carried off-chip, and the off-chip bandwidth also becomes a performance bottleneck.
Therefore, how to design and provide a device and a method for realizing high-operation-performance LSTM network operation in a manner of smaller IO amount and low cost is a technical problem which needs to be solved urgently at present.
Disclosure of Invention
It is therefore an objective of the claimed invention to provide an apparatus and method for performing LSTM network operations to solve at least one of the above problems.
To achieve the above object, as one aspect of the present invention, there is provided an apparatus for performing LSTM neural network operations, comprising:
the data cache units are arranged in parallel and used for caching data, states and results required by operation;
the data processing modules are arranged in parallel and used for acquiring input data and weights and offsets required during operation from the corresponding data cache units and performing LSTM neural network operation; the data processing modules correspond to the data cache units one by one, and parallel operation is executed among the data processing modules.
As another aspect of the present invention, the present invention also provides an apparatus for performing LSTM neural network operations, comprising:
a memory;
a processor that performs the following operations:
and 6, splicing the final output values of each different space of each memory to obtain a final output value.
As another aspect of the present invention, the present invention further provides an LSTM neural network operation method, including the steps of:
step S1, reading the weight and bias for LSTM neural network operation from the external designated address space, writing the weight and bias into a plurality of data buffer units arranged in parallel, and initializing the state unit of each data buffer unit; wherein, the weight and the offset read from the external appointed address space are divided and sent to each corresponding data cache unit corresponding to the neurons operated by the LSTM neural network, and the weight and the offset in each data cache unit are respectively the same in quantity;
step S2, reading input data from an external designated address space, and writing the input data into the plurality of data buffer units, where the input data written into each data buffer unit is complete;
step S3, a plurality of data processing modules corresponding to the plurality of data cache units arranged in parallel one by one respectively read the weight, the offset and the input data from the corresponding data cache units, and carry out LSTM neural network operation on the data processing modules by adopting a vector point multiplication component, a vector addition component, a vector summation component and a vector nonlinear function conversion component to respectively obtain the output value of each data processing module;
and step S4, splicing the output values of the data processing modules to obtain a final output value, namely a final result of the LSTM neural network operation.
Based on the above technical solution, the apparatus and method for performing neural network operations of the present invention have the following advantages compared with the existing implementation:
1. the external instruction is adopted for operation, compared with the existing implementation mode, the number of instructions required by operation is greatly reduced, and the decoding overhead generated during LSTM network operation is reduced;
2. by utilizing the characteristic that the weight and the bias of a hidden layer can be repeatedly used in the operation process of the LSTM network, the weight and the bias are temporarily stored in a data cache unit, so that the IO quantity between the device and the outside is reduced, and the overhead generated by data transmission is reduced;
3. the invention does not limit the application field of the specific LSTM network, can be used in the fields such as speech recognition, text translation, music synthesis and the like, and has strong expandability;
4. the multiple data processing modules in the device are completely parallel, and the internal parts of the data processing modules are parallel, so that the parallelism of the LSTM network can be fully utilized, and the operation speed of the LSTM network is obviously improved;
5. preferably, the specific implementation of the vector nonlinear function conversion part can be performed by a table look-up method, and compared with the traditional function operation, the efficiency is greatly improved.
Drawings
For a more complete understanding of the present invention, and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:
FIG. 1 is a schematic diagram illustrating an overall structure of an apparatus for performing LSTM network operations according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a data processing module of an apparatus for performing LSTM network operations according to an embodiment of the present invention;
FIG. 3 illustrates a flow diagram of a method for performing LSTM network operations in accordance with an embodiment of the present invention;
fig. 4 shows a detailed flowchart of a data processing procedure in a method for performing LSTM network operations according to an embodiment of the present invention.
Detailed Description
In order that the objects, technical solutions and advantages of the present invention will become more apparent, the present invention will be further described in detail with reference to the accompanying drawings in conjunction with the following specific embodiments.
In this specification, the various embodiments described below which are meant to illustrate the principles of this invention are illustrative only and should not be construed in any way to limit the scope of the invention. The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of exemplary embodiments of the invention as defined by the claims and their equivalents. The following description includes various specific details to aid understanding, but such details are to be regarded as illustrative only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Moreover, descriptions of well-known functions and constructions are omitted for clarity and conciseness. Moreover, throughout the drawings, the same reference numerals are used for similar functions and operations. In the present invention, the terms "include" and "comprise," as well as derivatives thereof, mean inclusion without limitation.
The apparatus for performing LSTM network operations of the present invention may be applied in the following scenarios, including but not limited to: the system comprises various electronic products such as a data processing device, a robot, a computer, a printer, a scanner, a telephone, a tablet computer, an intelligent terminal, a mobile phone, a driving recorder, a navigator, a sensor, a camera, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage device and a wearable device; various vehicles such as airplanes, ships, vehicles, and the like; various household appliances such as televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas stoves, smoke extractors and various medical devices such as nuclear magnetic resonance apparatuses, B-ultrasonic apparatuses, electrocardiographs and the like.
Specifically, the invention discloses a device for executing LSTM neural network operation, which comprises:
the data cache units are arranged in parallel and used for caching data, states and results required by operation;
the data processing modules are arranged in parallel and used for acquiring input data and weights and offsets required during operation from the corresponding data cache units and performing LSTM neural network operation; the data processing modules correspond to the data cache units one by one, and parallel operation is executed among the data processing modules.
The data caching unit is used for caching an intermediate result calculated by the data processing module, and only once weight and offset are led in from the direct memory access unit in the whole execution process and are not changed later.
Wherein, each data buffer unit is written with weight and bias which are divided corresponding to the neurons operated by the LSTM neural network, wherein the number of the weight and the bias in each data buffer unit is the same, and each data buffer unit acquires a complete input data.
The data processing module adopts a vector point multiplication component, a vector addition component, a vector summation component and a vector nonlinear function conversion component to carry out the LSTM neural network operation.
The vector nonlinear function conversion part performs function operation by a table look-up method.
The vector operation is carried out by each data processing module through respectively calculating vector values of a forgetting gate, an input gate, an output gate and a unit to be selected in the LSTM network operation, then the output value of each data processing module is obtained through each vector value, and finally the output values of the data processing modules are spliced to obtain a final output value.
As a preferred embodiment, the present invention discloses an apparatus for performing LSTM neural network operations, comprising:
the direct memory access unit is used for acquiring instructions and data required by LSTM neural network operation from an external address space outside the device, respectively transmitting the instructions and the data to the instruction cache unit and the data cache unit, and writing back an operation result to the external address space from the data processing module or the data cache unit;
the instruction cache unit is used for caching the instruction acquired by the direct memory access unit from the external address space and inputting the instruction into the controller unit;
the controller unit is used for reading the instruction from the instruction cache unit, decoding the instruction into a micro instruction, and controlling the direct memory unit to perform data IO operation, the data processing module to perform relevant operation and the data cache unit to perform data cache and transmission;
the data cache units are arranged in parallel and used for caching data, states and results required by operation;
the data processing modules are arranged in parallel and used for acquiring input data and weight and bias required during operation from the corresponding data cache units, performing LSTM neural network operation and inputting operation results into the corresponding data cache units or the direct memory access units; the data processing modules correspond to the data cache units one by one, and parallel operation is executed among the data processing modules.
Preferably, the direct memory access unit, the instruction cache unit, the controller unit, the plurality of data cache units, and the plurality of data processing modules are implemented by hardware circuits.
Preferably, the data caching unit is further configured to cache an intermediate result calculated by the data processing module, and only once import a weight and a bias from the direct memory access unit in the whole execution process, and then do not change any more.
Preferably, the plurality of data processing modules all adopt a vector dot multiplication component, a vector addition component, a vector summation component and a vector nonlinear function conversion component to perform the LSTM neural network operation.
Preferably, the vector nonlinear function conversion unit performs a function operation by a table lookup method.
Preferably, the plurality of data processing modules perform parallel operations as follows:
step 4, each data processing module judges whether the calculation of the vector values of the current forgetting gate, the input gate and the to-be-selected state unit is completed, if so, the calculation of a new state unit is carried out, namely, the vector values of the old state unit and the forgetting gate are sent to the data processing module, a partial sum is obtained through a vector dot multiplication component and sent back to a data cache unit, then the values of the to-be-selected state unit and the input gate are sent to the data processing module, the partial sum is obtained through the vector dot multiplication component, the partial sum in the data cache unit is sent to the data processing module, an updated state unit is obtained through a vector summation submodule and then sent back to the data cache unit, and meanwhile, the updated state unit in the data processing module is transformed through a nonlinear transformation function tanh; each data processing module judges whether the nonlinear transformation of the current updated data state unit and the output gate are calculated, if so, the output gate and the vector with the updated data state unit after the nonlinear transformation are calculated by a vector dot multiplication component to obtain a final output value, and the output value is written back to the data cache unit;
and 5, after the output values in all the data processing modules are written back to the data cache unit, splicing the output values in all the data processing modules to obtain a final output value, and sending the final output value to an external designated address through the direct memory access unit.
The invention also discloses a device for executing the LSTM neural network operation, which comprises the following components:
a memory;
a processor that performs the following operations:
and 6, splicing the final output values of each different space of each memory to obtain a final output value.
The invention also discloses an operation method of the LSTM neural network, which comprises the following steps:
step S1, reading the weight and bias for LSTM neural network operation from the external designated address space, writing the weight and bias into a plurality of data buffer units arranged in parallel, and initializing the state unit of each data buffer unit; wherein, the weight and the offset read from the external appointed address space are divided and sent to each corresponding data cache unit corresponding to the neurons operated by the LSTM neural network, and the weight and the offset in each data cache unit are respectively the same in quantity;
step S2, reading input data from an external designated address space, and writing the input data into the plurality of data buffer units, where the input data written into each data buffer unit is complete;
step S3, a plurality of data processing modules corresponding to the plurality of data cache units arranged in parallel one by one respectively read the weight, the offset and the input data from the corresponding data cache units, and carry out LSTM neural network operation on the data processing modules by adopting a vector point multiplication component, a vector addition component, a vector summation component and a vector nonlinear function conversion component to respectively obtain the output value of each data processing module;
and step S4, splicing the output values of the data processing modules to obtain a final output value, namely a final result of the LSTM neural network operation.
Preferably, in step S3, each data processing module divides the weight and the input data in the corresponding data buffer unit into a plurality of parts, where the weight or the number of the input data in each part is the same as the number of operations performed by the vector operation unit in the corresponding single data processing module; each data cache unit sends a weight and input data to a data processing module corresponding to the weight and the input data each time, partial sums are obtained through calculation, partial sums obtained before are taken out from the data cache unit, vector addition is carried out on the partial sums to obtain new partial sums, and the new partial sums are sent back to the data cache unit, wherein the initial values of the partial sums are offset values;
after all input data are sent to the data processing module once, the obtained part sum is the net activation amount corresponding to the neuron, then the net activation amount of the neuron is sent to the data processing module, the output value of the neuron is obtained through nonlinear function tanh or sigmoid function transformation in the data operation submodule, and different weights and offsets are used in the mode to respectively calculate the vector values of a forgetting gate, an input gate, an output gate and a to-be-selected state unit in the LSTM neural network;
each data processing module judges whether the calculation of the vector values of the current forgetting gate, the input gate and the to-be-selected state unit is completed, if so, the calculation of a new state unit is carried out, namely, the vector values of the old state unit and the forgetting gate are sent to the data processing module, a partial sum is obtained through a vector dot multiplication component and sent back to a data cache unit, then the values of the to-be-selected state unit and the input gate are sent to the data processing module, a partial sum is obtained through the vector dot multiplication component, the partial sum in the data cache unit is sent to the data processing module, an updated state unit is obtained through a vector summation submodule and then sent back to the data cache unit, and meanwhile, the updated state unit in the data processing module is transformed through a nonlinear transformation function tanh; and each data processing module judges whether the nonlinear transformation of the current updated data state unit and the output gate are completed or not, if so, the output gate and the vector with the updated data state unit after the nonlinear transformation are calculated by a vector dot multiplication component to obtain a final output value, and the output value is written back to the data cache unit.
Preferably, the nonlinear function tanh or sigmoid function is operated by a table look-up method.
Other aspects, advantages and salient features of the invention will become apparent to those skilled in the art from the following detailed description of exemplary embodiments of the invention, which description is to be taken in conjunction with the accompanying drawings.
The invention discloses a device and a method for computing an LSTM network, which can be used for accelerating the application of using the LSTM network. The method specifically comprises the following steps:
(1) taking out weights and offsets used in LSTM network operation from an external designated address space through a direct memory access unit, and writing the weights and offsets into each data cache unit, wherein the weights and the offsets are taken out from the external designated address space, are divided and are sent into each data cache unit, the weights and the offsets in each data cache unit are the same in quantity, the weights and the offsets in each data cache unit correspond to neurons, and state units in the data cache units are initialized;
(2) the input data is taken out from an external appointed address space through a direct memory access unit and written into a data cache unit, wherein each data cache unit acquires a complete input data;
(3) dividing the weight and the input data in each data cache unit into a plurality of parts, wherein the weight or the number of the input data of each part is the same as the number of operations of the vector operation unit in a corresponding single data processing module, sending one part of the weight and the input data into the data processing module each time, calculating to obtain a partial sum, taking out the previously obtained partial sum from the data cache unit, carrying out vector addition on the partial sum to obtain a new partial sum, and sending the new partial sum back to the data cache unit. Wherein the initial value of the partial sum is the offset value. After all input data are sent to the data processing module once, the obtained part sum is the net activation amount corresponding to the neuron, then the net activation amount of the neuron is sent to the data processing module, the output value of the neuron is obtained through nonlinear function tanh or sigmoid function transformation in the data operation submodule, and the function transformation can be carried out through a table look-up method and a function operation method. By using different weights and offsets in this way, the vector values of the forgetting gate, the input gate, the output gate and the candidate state unit in the LSTM network can be respectively calculated. In the same data processing module, vector operation instructions are adopted in the process of calculating partial sums, and parallelism exists among data. And then, the data dependence judgment submodule in each data processing module judges whether the calculation of the vector values of the current forgetting gate, the input gate and the to-be-selected state unit is finished or not, and if so, the calculation of a new state unit is carried out. Firstly, sending old state unit and forgotten gate vector values to a data processing module, obtaining partial sums through a vector dot multiplication component in a data operation submodule, and sending the partial sums back to a data cache unit; and then, sending the values of the state unit to be selected and the input gate to a data processing module, obtaining a partial sum through a vector dot multiplication component in a data operation submodule, sending the partial sum in the data cache unit to the data processing module, obtaining an updated state unit through a vector summation submodule in the data operation submodule, then sending the updated state unit back to the data cache unit, and simultaneously transforming the updated state unit in the data processing module through a nonlinear transformation function tanh in the data operation submodule. And the data dependence judging submodule in each data processing module judges whether the nonlinear transformation of the current updated data state unit and the output gate are calculated, if so, the output gate and the vector subjected to the nonlinear transformation of the updated data state unit are calculated by a vector dot multiplication component in the data operation submodule to obtain a final output value, and the final output value is written back to the data cache unit. In the whole operation process, the problems of data dependence or data conflict do not exist among different data processing modules, and the parallel processing can be always performed.
(4) And after the output values in all the data processing modules are written back to the data cache unit, splicing the output values in all the data processing modules to obtain a final output value, and sending the final output value to an external designated address through the direct memory access unit.
(5) And (3) judging whether the LSTM network needs to output at the next moment, if so, turning to (2), otherwise, ending the operation.
Fig. 1 is a schematic diagram illustrating an overall structure of an apparatus for performing LSTM network operations according to an embodiment of the present invention. As shown in fig. 1, the apparatus includes a direct memory access unit 1, an instruction cache unit 2, a controller unit 3, a data cache unit 4, and a data processing module 5, which may all be implemented by hardware circuits.
The direct memory access unit 1 can access an external address space, and can read and write data to each cache unit in the device to complete loading and storing of the data. The method specifically comprises the steps of reading an instruction from the instruction cache unit 2, reading a weight value, bias and input data required by LSTM network operation from a designated storage unit to the data cache unit 4, and directly writing the output after operation into an external designated space from the data cache unit 4.
The instruction cache unit 2 reads the instructions through the direct memory access unit 1 and caches the read instructions.
The controller unit 3 reads the instruction from the instruction cache unit 2, decodes the instruction into a microinstruction for controlling the behavior of other modules, and sends the microinstruction to other modules such as the direct memory access unit 1, the data cache unit 4, the data processing module 5, and the like.
The data cache unit 4 initializes a state unit of the LSTM when the device is initialized, and reads a weight and an offset from an external designated address through the direct memory access unit 1, wherein the weight and the offset read in each data cache unit 4 correspond to a neuron to be calculated, namely the weight and the offset read in each data cache unit 4 are part of the total weight and the offset, and the weight and the offset read in the external designated address are combined in all the data cache units 4; during specific operation, firstly, input data is obtained from the direct memory access unit 1, each data cache unit 4 obtains a copy of the input data, partial sum is initialized to be an offset value, then, a part of weight, offset and the input value is sent to the data processing module 5, an intermediate value is obtained through calculation in the data processing module 5, then, the intermediate value is read out from the data processing module 5 and stored in the data cache unit 4, when all the data are subjected to one-time operation, the partial sum is input to the data processing module 5, neuron output is obtained through calculation, then, the neuron output is written back to the data cache unit 4, and finally, vector values of an input gate, an output gate, a forgetting gate and a to-be-selected state unit are obtained. Then, the forgetting gate and the old state unit are sent into the data processing module 5, partial sums are obtained through calculation and written back to the data cache unit 4, the state unit to be selected and the input gate are sent into the data processing module 5, partial sums are obtained through calculation, the partial sums in the data cache unit 4 are written into the data processing module 5 and are subjected to vector addition with the partial sums obtained through calculation, an updated state unit is obtained, and the updated state unit is written back to the data cache unit 4. The output gate is sent to the data processing module 5, vector dot multiplication is performed on the output gate and the value after the nonlinear transformation function tanh transformation of the updated state unit to obtain an output value, and the output value is written back to the data cache unit 4. Finally, each data cache unit 4 obtains a corresponding updated state unit and an output value, and the output values in all the data cache units 4 are combined to obtain a final output value. Finally, each data cache unit 4 writes back the partial output value obtained by the data cache unit to the external designated address space through the direct memory access unit 1.
The corresponding operations in the LSTM network are as follows:
ft=σ(Wf[ht-1,xt]+bf);
it=σ(Wi[ht-1,xt]+bi);
ot=σ(Wo[ht-1,xt]+bo);
ht=ot⊙tanh(ct);
wherein x istIs input data at time t, ht-1Output data representing time t-1, Wf、Wi、WcAnd WoRespectively representing weight vectors corresponding to the forgetting gate, the input gate, the update state unit and the output gate, bf、bi、bcAnd boRespectively representing the corresponding offsets of the forgetting gate, the input gate, the updating state unit and the output gate; f. oftA state cell value indicating that the output of the forgetting gate is selectively forgotten to be passed by dot-multiplying the output of the forgetting gate with the state cell at the time t-1; i.e. itRepresenting the output of the input gate, multiplying the resulting candidate state value at time t by the point to selectively add the candidate state value at time t to the state cell;representing candidate state values obtained by calculation at the time t; c. CtRepresenting a new state value obtained by selectively forgetting the state value at the time t-1 and selectively adding the state value at the time t, which is to be used at the time of calculating the final output and transmitted to the next time; otThe selection condition which is required to be output as a result part in the state unit at the time t is represented; h istRepresenting the output at time t and which will also be transmitted to the next time, ⊙ being the product of the vector operations on elements, [ sigma ] being the sigmoid function, the formula:the activation function tanh function is calculated by
The data processing module 5 reads part of weight W from the corresponding data cache unit 4 each timei/Wf/Wo/WcAnd a weight bi/bf/bo/bcAnd corresponding input data [ h ]t-1,xt]Partial sum calculation is completed through a vector multiplication part and a summation part in the data processing module 5 until all input data of each neuron are operated once, and then the net activation amount net of the neuron can be obtainedi/netf/neto/netcThe calculation of the output values is then carried out by conversion of the vector nonlinear function sigmoid or tanh function, in such a way that the input gates i are each completediForgetting door fiAnd an output gate oiAnd a candidate status cellAnd (4) calculating. Then calculating the dot product of the old state unit and the forgetting gate, the to-be-selected state unit and the input gate by the vector dot product unit in the data processing module 5, and then calculating the two results by the vector addition unit to obtain the new state unit ct. The newly obtained status unit is written back to the data cache unit 4. Completing the transformation of the tanh function by using a vector nonlinear function conversion component to the state unit in the data processing module 5 to obtain tanh (c)t) In the calculation process, the value of the tanh function can be calculated or the table can be looked up. Then, the vector of the output gate and the state unit after tanh nonlinear transformation is calculated by a vector point multiplication component to obtain the final neuron output value ht. Finally, the neuron outputs a value htWrites back to the data cache location 4.
FIG. 2 is a schematic diagram of a data processing module of an apparatus for performing LSTM network operations according to an embodiment of the present invention;
as shown in fig. 2, the data processing module 5 includes a data processing control sub-module 51, a data dependency discrimination sub-module 52, and a data operation sub-module 53.
Among them, the data processing control sub-module 51 controls the operation performed by the data operation sub-module 53. The control data dependency determination sub-module 52 determines whether or not there is data dependency in the current operation. For the partial operation, the data processing control sub-module 51 controls the operation performed by the data operation sub-module 53; for the operation with possible data dependency, the data processing control sub-module 51 will first control the data dependency judgment sub-module 52 to judge whether the current operation has data dependency, if so, the data processing control sub-module 51 will insert null operation into the data operation sub-module 53, and control the data operation sub-module 53 to perform data operation after the data dependency is released.
The data dependency discrimination sub-module 52 is controlled by the data processing control sub-module 51 to check whether a data dependency exists in the data operation sub-module 53. If the operation carried out next time needs to use the value which is not operated completely, the data dependency exists at present, otherwise, the data dependency does not exist. A method for detecting data dependence is that registers R1, R2, R3, R4 and R5 exist in a data operation submodule 53, and are used for marking whether tanh function conversion operation of an input gate, a forgetting gate, an output gate, a to-be-selected state unit and an updated state unit is completed or not, the value of the register is not 0 to indicate that the operation is completed, and the value is 0 to indicate that the operation is not completed. Corresponding to the LSTM network, the data dependency determining submodule 52 may determine two data dependencies, respectively determine whether there is a data dependency between the input gate, the output gate, and the selected state unit when calculating a new state unit, determine whether there is a data dependency in the tanh function conversion between the output gate and the updated state unit when calculating an output value, and respectively determine whether all of R1, R3, and R4 are not 0 and all of R2, and R are not 0. After the judgment is completed, the judgment result needs to be transmitted back to the data processing control sub-module 51.
The data operation sub-module 53 is controlled by the data processing control sub-module 51 to complete data processing in the network operation process. The data operation submodule 53 includes a vector dot multiplication unit, a vector addition unit, a vector summation unit, and a vector nonlinear conversion unit, and registers R1, R2, R3, R4, and R5 that flag whether or not the related data operation is completed. Registers R1, R2, R3, R4 and R5 are used for marking whether the tanh function conversion operation of the input gate, the forgetting gate, the output gate, the to-be-selected state unit and the updated state unit is completed or not respectively, and the registersA value of not 0 indicates that the operation is complete and 0 indicates that it has not yet been completed. The vector addition part adds corresponding positions of two vectors to obtain a vector, the vector summation part divides the vector into a plurality of segments, each segment is internally summed, and the length of the finally obtained vector is equal to the number of the segments. The vector nonlinear transformation component takes each element in the vector as input to obtain the output after nonlinear function transformation. The specific non-linear transformation can be accomplished in two ways. Taking the sigmoid function with input as x as an example, one mode is to use a function operation mode to directly calculate sigmoid (x), and the other mode is to use a table look-up method to complete, the data operation sub-module 53 maintains a table of sigmoid functions, and records input x respectively1、x2…xn(x1<x2<…<xn) Output y of time correspondence1、y2…ynThe value of the function value corresponding to x is calculated by first finding the interval [ xi,xi+1]Satisfy xi<x<xi+1CalculatingAs an output value. In the LSTM network operation process, the following operations are carried out:
first, R1, R2, R3, R4, and R5 are set to 0. Initializing an input gate portion and with an offset; using part of input data and the weight corresponding to the input data to obtain a temporary value through calculation by a vector dot multiplication component, then segmenting the temporary value according to temporary value vectors corresponding to different neurons, using a vector summation component to complete summation operation of the temporary value, and updating the calculation result with the sum of an input gate part and a completion part; and (3) carrying out the same operation on the other input data and the weight to update partial sums, obtaining partial sums which are the net activation quantity of the neurons after all the input data are operated once, and then calculating by a vector nonlinear transformation component to obtain the output value of the input gate. The output value is written back into the data cache unit 4 and the R1 register is set to not 0.
The output values of the forgetting gate, the output gate and the unit to be selected are calculated by the same method of calculating the output of the input gate, the corresponding output values are written back to the data buffer unit 4, and the registers R2, R3 and R4 are set to be not 0.
And executing null operation or performing the operation of the updated status unit according to the control command of the data processing module sub-module 51. The operation of the updated state cell is: and (3) taking the forgetting gate output value and the old state unit from the data cache unit 4, calculating the sum of the forgetting gate output value and the old state unit through the vector dot multiplication component, then taking the input gate output value and the state unit to be selected from the data cache unit 4, calculating the sum of the forgetting gate output value and the old state unit through the vector dot multiplication component, and obtaining the updated state unit through the vector addition component. The last state element is finally written back into the data cache element 4.
And performs a null operation or an operation of the LSTM network output value according to the control command of the data processing module sub-module 51. The output value is calculated as: and calculating the nonlinear transformation value of the state unit by using the vector nonlinear function change component for the updated state unit, and setting R5 as non-0. And then, performing point multiplication operation on the nonlinear transformation values of the output gate and the state unit by using a vector point multiplication component, and calculating a final output value, namely the output value of the neuron corresponding to the LSTM network. The output value is written back to the data buffer unit 4.
Fig. 3 illustrates a flow diagram for performing LSTM network operations provided in accordance with an embodiment of the present invention.
In step S1, an IO instruction is stored in advance at the head address of the instruction cache unit 2.
In step S2, the controller unit 3 reads the IO instruction from the first address of the instruction cache unit 2, and according to the translated micro instruction, the direct memory access unit 1 reads all instructions related to LSTM network computation from the external address space and caches the instructions in the instruction cache unit 2.
In step S3, the controller unit 3 reads an IO instruction from the instruction cache unit 2, and according to the translated microinstruction, the direct memory access unit 1 reads the weight and offset related to the LSTM network operation from the external specified address space, including the weight and offset of the input gate, the output gate, the forgetting gate, and the to-be-selected state unit, and according to the difference of the neurons corresponding to the weight, the weight and the offset are divided and then read into different data cache modules 4.
In step S4, the controller unit 3 reads a state unit initialization command from the command buffer unit 2, and initializes the state unit values in the data buffer module 4 according to the translated microinstructions to set the input gate portion and the output gate portion and the forgetting gate portion and the to-be-selected state unit portion to the corresponding neuron offset values.
In step S5, the controller unit 3 reads an IO instruction from the instruction cache unit 2, and according to the translated micro instruction, the dma unit 1 reads an input value from the external designated address space into the data cache units 4, and each data cache unit 4 receives a same input value vector.
In step S6, the controller unit 3 reads a data processing instruction from the instruction cache unit 2, and according to the translated microinstruction, the data processing module 5 obtains the relevant data required for operation from the corresponding data cache unit 4 to perform the operation, the result of the operation is the output value of a part of neurons corresponding to a time point, the output values obtained by processing by all the data processing modules 5 are combined and correspond to the output value at a time point, and the detailed processing procedure is shown in fig. 4. After the processing is finished, the data processing module 5 stores the intermediate value or the output value and the state unit value obtained by the processing into the data cache unit 4.
In step S7, the controller unit 3 reads an IO instruction from the instruction cache unit 2, and concatenates the output values in the data cache unit 4 according to the translated microinstruction and outputs the output values to the external designated address through the direct memory access unit 1.
In step S8, the controller unit 3 reads a discrimination instruction from the instruction cache unit 2, and based on the translated microinstruction, the controller unit 3 determines whether the forward process is completed, and if so, ends the operation. If not, the routine proceeds to S6.
Fig. 4 shows a detailed flowchart of a data processing procedure in a method for performing LSTM network operations according to an embodiment of the present invention.
In step S1, the data processing module 5 reads in the weight values and input values of a part of the input gates from the data buffer unit 4.
In step S2, the data processing control sub-module 51 in the data processing module 5 controls the vector dot product unit in the data operation sub-module 53 to calculate the dot product of the input gate weight and the input value, then groups the input gate weight and the input value according to the neuron to which the result belongs, and calculates the dot product result in the group to obtain the partial sum by the vector summation unit in the data operation sub-module 53.
In step S3, the data processing module 5 reads in the input gate portion from the data buffer unit 4.
In step S4, the data processing control sub-module 51 in the data processing module 5 controls the data operation sub-module 53 to add the calculated partial sum and the partial sum just read in to obtain the updated input gate partial sum.
In step S5, the data processing module 5 writes the updated partial sum into the data cache module 4.
In step S6, the data processing module 5 determines whether all the input gate weights have been operated once, if so, the partial sum in the data buffer unit is the value of the input gate, and sets the R1 register to be non-zero, otherwise, the operation is continued by using a part of the different input gate weights and input values to S1.
In step S7, the and operation is performed to obtain the forgotten gate output value, the output gate output value, and the candidate cell output value, and the R2, R3, and R4 are set as non-zero values, and the output values are all written back to the data buffer unit 4.
In step S8, the data processing control sub-module 51 in the data processing module 5 controls the data dependency determining sub-module 52 to determine whether operations are completed among the forgetting gate, the input gate, and the candidate state unit, i.e., whether R1, R2, and R4 are all non-zero, if not, the data processing control sub-module 51 controls the data operation sub-module 53 to perform a null operation, and then turns to S8 to continue operation, and if so, turns to S9 operation.
In step S9, the data processing module 5 reads the old status cell and the forgotten gate output value from the data buffer unit 4.
In step S10, the data processing control sub-module 51 in the data processing module 5 controls the data operation sub-module 53 to calculate a partial sum of the old status cells and the forgotten gate output values by the vector dot-product section.
In step S11, the data processing module 5 reads the candidate state cell and the input gate output value from the data buffer unit 4.
In step S12, the data processing control sub-module 51 in the data processing module 5 controls the data operation sub-module 53 to calculate a partial sum with the vector dot multiplication section for the state cell to be selected and the input gate output value, and to sum the partial sum with the partial sum calculated in S10 and the state cell calculated by the vector addition section to obtain an update.
In step S13, the data processing module 5 returns the updated status unit to the data caching unit 4.
In step S14, the data processing control sub-module 51 in the data processing module 5 controls the data operation sub-module 53 to calculate the transform value of the nonlinear transform function tanh of the state cell by using the vector nonlinear transform means for the updated state cell, and set R5 to be nonzero.
In step S15, the data control sub-module 51 in the data processing module 5 controls the data dependency judgment sub-module 52 to judge whether the calculation of the output gate output value and the conversion value of the nonlinear conversion function tanh of the state cell is completed, i.e., whether both R3 and R5 are nonzero, and if not, the data processing control sub-module 51 controls the data operation sub-module 53 to perform a no-operation, and then turns to S15 to continue the operation, and if so, turns to S16 to operate.
In step S16, the data processing module 5 reads in the output of the output gate from the data buffer unit 4.
In step S17, the data processing control sub-module 51 in the data processing module 5 controls the data operation sub-module 53 to calculate the output value of the output gate and the transformed value of the nonlinear transformation function tanh of the state unit by the vector dot product component, which is the output value in the neuron corresponding to the data processing module 5 in the LSTM network.
In step S18, the data processing module 5 writes the output value into the data buffer unit 4.
The processes or methods depicted in the preceding figures may be performed by processing logic that comprises hardware (e.g., circuitry, dedicated logic, etc.), firmware, software (e.g., software embodied on a non-transitory computer readable medium), or a combination of both. Although the processes or methods are described above in terms of some sequential operations, it should be understood that some of the operations described may be performed in a different order. Further, some operations may be performed in parallel rather than sequentially.
The device disclosed by the invention works by using a specially designed instruction set, and the efficiency of instruction decoding is higher. The plurality of data processing modules perform parallel calculation, and the plurality of data cache units perform parallel operation without data transmission, so that the parallelism of operation is greatly improved. In addition, the weight and the bias are placed in the data cache unit, so that IO (input/output) operation between the device and an external address space can be reduced, and the bandwidth required by memory access is reduced.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention and are not intended to limit the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (19)
1. An apparatus for performing LSTM neural network operations, comprising:
the data cache units are arranged in parallel and used for caching data, states and results required by the LSTM neural network operation;
the data processing modules are arranged in parallel and used for acquiring input data and weights and offsets required during operation from the corresponding data cache units and performing LSTM neural network operation; the data processing modules correspond to the data cache units one by one.
2. The apparatus of claim 1, wherein the apparatus further comprises a direct memory access unit, an instruction cache unit, and a controller unit, wherein,
the direct memory access unit is used for acquiring data required by the operation of an instruction and an LSTM neural network from an external address space, respectively transmitting the instruction and the data to the instruction cache unit and the data cache unit, and writing back an operation result to the external address space from the data processing module and the data cache unit;
the instruction cache unit is used for caching the instruction acquired by the direct memory access unit from an external address space and inputting the instruction into the controller unit;
the controller unit is used for reading the instruction from the instruction cache unit, decoding the instruction into a micro instruction, controlling the direct memory access unit to perform data IO operation, controlling the data processing module to perform operation, and controlling the data cache unit to perform data caching and transmission.
3. The apparatus of claim 1, wherein the data caching unit is further configured to cache intermediate results computed by the data processing module, and to import weights and offsets from the direct memory access unit during execution of the LSTM neural network operation.
4. The apparatus of claim 1, wherein weights and offsets divided corresponding to the neurons operated on by the LSTM neural network are buffered in each of the data buffer units, wherein the number of weights and offsets in each of the data buffer units is the same, and complete input data is buffered in each of the data buffer units.
5. The apparatus of claim 1, wherein the data processing module performs the LSTM neural network operations using a vector dot product component, a vector add component, a vector sum component, and a vector nonlinear function conversion component.
6. The apparatus of claim 5, wherein the vector nonlinear function conversion component performs the function operation by table lookup.
7. The apparatus according to any one of claims 2 to 6, wherein each of the data processing modules divides the weight and the input data in the corresponding data buffer unit into a plurality of parts, wherein the number of the weight or the input data in each part is the same as the number of the vector operation unit operations in the corresponding data processing module;
the controller unit is further configured to send a weight and input data from the data buffer unit to the corresponding data processing module to calculate a partial sum, then take out the previous partial sum from the data buffer unit and send the previous partial sum to the data processing module, so that the data processing module performs vector addition on the partial sum to obtain a new partial sum, and send the new partial sum back to the data buffer unit, where an initial value of the partial sum is an offset value.
8. The apparatus of claim 1, wherein each of the data processing modules performs vector operations by respectively calculating vector values of a forgetting gate, an input gate, an output gate, and a candidate state unit in LSTM network operations, obtains an output value of each of the data processing modules from the vector values, and finally splices the output values of the data processing modules to obtain a final output value.
9. The apparatus of claim 1, wherein the data processing module comprises a data processing control sub-module, a data dependent discrimination sub-module, and a data operation sub-module, wherein,
the data processing control sub-module controls the operation of the data operation sub-module and controls the data dependence judgment sub-module to judge whether the current operation has data dependence.
10. An operation method of an LSTM neural network is applied to an LSTM neural network operation device, the LSTM neural network operation device comprises a plurality of data cache units arranged in parallel and a plurality of data processing modules corresponding to the data cache units one by one, and the method comprises the following steps:
the target data processing module obtains input data and weight and bias required during operation from the corresponding data cache unit, carries out LSTM neural network operation, and caches the result obtained by the operation to the corresponding data cache unit, wherein the target data processing module is any one of the data processing modules.
11. The method of claim 10, wherein the apparatus further comprises a direct memory access unit, an instruction cache unit, and a controller unit, the method further comprising:
the direct memory access unit acquires instructions and data required by LSTM neural network operation from an external address space, and respectively transmits the instructions and the data to the instruction cache unit and the data cache unit;
the controller unit reads an instruction from the instruction cache unit, decodes the instruction into a micro instruction, controls the direct memory access unit to perform data IO operation, controls the data processing module to perform operation and controls the data cache unit to perform data caching and transmission;
the direct memory access unit writes back the operation result to the external address space from the data processing module and the data cache unit.
12. The method of claim 10, wherein the method further comprises:
and caching the intermediate result calculated by the data processing module into the data caching unit, and leading weight and offset from the direct memory access unit by the data caching unit in the execution process of the LSTM neural network operation.
13. The method of claim 10, wherein the method further comprises:
and each data cache unit caches weights and offsets which are divided corresponding to the neurons operated by the LSTM neural network, wherein the weights and the offsets in each data cache unit are the same in number, and complete input data is cached in each data cache unit.
14. The method of claim 10, wherein the data processing module performs the LSTM neural network operations using a vector dot product component, a vector add component, a vector sum component, and a vector nonlinear function conversion component.
15. The method of claim 14, wherein the vector nonlinear function conversion component performs the function operation by table lookup.
16. The method according to any one of claims 11 to 15, wherein each of the data processing modules divides the weight and the input data in the corresponding data buffer unit into a plurality of parts, wherein the number of the weight or the input data in each part is the same as the number of the vector operation unit operations in the corresponding data processing module;
the controller unit sends a weight and input data from the data cache unit to the corresponding data processing module to calculate a partial sum, then takes the partial sum obtained before from the data cache unit and sends the partial sum to the data processing module, so that the data processing module carries out vector addition on the partial sum to obtain a new partial sum, and sends the new partial sum back to the data cache unit, wherein the initial value of the partial sum is an offset value.
17. The method as claimed in claim 10, wherein each of the data processing modules performs vector operation by respectively calculating vector values of a forgetting gate, an input gate, an output gate and a candidate state unit in LSTM network operation, obtains an output value of each of the data processing modules from the vector values, and finally splices the output values of the data processing modules to obtain a final output value.
18. The method of claim 10, wherein the data processing module includes a data processing control sub-module, a data dependent discrimination sub-module, and a data operation sub-module, wherein,
the data processing control sub-module controls the operation of the data operation sub-module and controls the data dependence judgment sub-module to judge whether the current operation has data dependence.
19. An apparatus for performing LSTM neural network operations, comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any of claims 10 to 18 when executing the computer program.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010018716.XA CN111260025B (en) | 2016-12-30 | 2016-12-30 | Apparatus and method for performing LSTM neural network operation |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010018716.XA CN111260025B (en) | 2016-12-30 | 2016-12-30 | Apparatus and method for performing LSTM neural network operation |
CN201611269665.8A CN108268939B (en) | 2016-12-30 | 2016-12-30 | Apparatus and method for performing LSTM neural network operations |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611269665.8A Division CN108268939B (en) | 2016-12-30 | 2016-12-30 | Apparatus and method for performing LSTM neural network operations |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111260025A true CN111260025A (en) | 2020-06-09 |
CN111260025B CN111260025B (en) | 2023-11-14 |
Family
ID=62771289
Family Applications (4)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010018716.XA Active CN111260025B (en) | 2016-12-30 | 2016-12-30 | Apparatus and method for performing LSTM neural network operation |
CN202110713121.0A Active CN113537481B (en) | 2016-12-30 | 2016-12-30 | Apparatus and method for performing LSTM neural network operation |
CN201611269665.8A Active CN108268939B (en) | 2016-12-30 | 2016-12-30 | Apparatus and method for performing LSTM neural network operations |
CN202110708810.2A Active CN113537480B (en) | 2016-12-30 | 2016-12-30 | Apparatus and method for performing LSTM neural network operation |
Family Applications After (3)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110713121.0A Active CN113537481B (en) | 2016-12-30 | 2016-12-30 | Apparatus and method for performing LSTM neural network operation |
CN201611269665.8A Active CN108268939B (en) | 2016-12-30 | 2016-12-30 | Apparatus and method for performing LSTM neural network operations |
CN202110708810.2A Active CN113537480B (en) | 2016-12-30 | 2016-12-30 | Apparatus and method for performing LSTM neural network operation |
Country Status (1)
Country | Link |
---|---|
CN (4) | CN111260025B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2022206536A1 (en) * | 2021-03-29 | 2022-10-06 | 维沃移动通信有限公司 | Data processing method and apparatus, and chip |
Families Citing this family (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110727462B (en) * | 2018-07-16 | 2021-10-19 | 上海寒武纪信息科技有限公司 | Data processor and data processing method |
WO2020061870A1 (en) * | 2018-09-27 | 2020-04-02 | 深圳大学 | Lstm end-to-end single-lead electrocardiogram classification method |
CN109543832B (en) * | 2018-11-27 | 2020-03-20 | 中科寒武纪科技股份有限公司 | Computing device and board card |
CN111258636B (en) * | 2018-11-30 | 2022-10-04 | 上海寒武纪信息科技有限公司 | Data processing method, processor, data processing device and storage medium |
US11494645B2 (en) * | 2018-12-06 | 2022-11-08 | Egis Technology Inc. | Convolutional neural network processor and data processing method thereof |
WO2020125092A1 (en) * | 2018-12-20 | 2020-06-25 | 中科寒武纪科技股份有限公司 | Computing device and board card |
CN109670581B (en) * | 2018-12-21 | 2023-05-23 | 中科寒武纪科技股份有限公司 | Computing device and board card |
US11042797B2 (en) | 2019-01-08 | 2021-06-22 | SimpleMachines Inc. | Accelerating parallel processing of data in a recurrent neural network |
CN110009100B (en) * | 2019-03-28 | 2021-01-05 | 安徽寒武纪信息科技有限公司 | Calculation method of user-defined operator and related product |
CN110020720B (en) * | 2019-04-01 | 2021-05-11 | 中科寒武纪科技股份有限公司 | Operator splicing method and device |
CN112346781A (en) * | 2019-08-07 | 2021-02-09 | 上海寒武纪信息科技有限公司 | Instruction processing method and device and related product |
CN110347506B (en) * | 2019-06-28 | 2023-01-06 | Oppo广东移动通信有限公司 | Data processing method and device based on LSTM, storage medium and electronic equipment |
CN111652361B (en) * | 2020-06-04 | 2023-09-26 | 南京博芯电子技术有限公司 | Composite granularity near storage approximate acceleration structure system and method for long-short-term memory network |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104303162A (en) * | 2012-01-12 | 2015-01-21 | 才智知识产权控股公司(2) | Systems and methods for managing cache admission |
CN105184366A (en) * | 2015-09-15 | 2015-12-23 | 中国科学院计算技术研究所 | Time-division-multiplexing general neural network processor |
CN105893159A (en) * | 2016-06-21 | 2016-08-24 | 北京百度网讯科技有限公司 | Data processing method and device |
CN106022468A (en) * | 2016-05-17 | 2016-10-12 | 成都启英泰伦科技有限公司 | Artificial neural network processor integrated circuit and design method therefor |
US20160379111A1 (en) * | 2015-06-25 | 2016-12-29 | Microsoft Technology Licensing, Llc | Memory bandwidth management for deep learning applications |
US20160379109A1 (en) * | 2015-06-29 | 2016-12-29 | Microsoft Technology Licensing, Llc | Convolutional neural networks on hardware accelerators |
Family Cites Families (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH0296861A (en) * | 1988-10-03 | 1990-04-09 | Mitsubishi Electric Corp | Microprocessor peripheral function circuit device |
JPH03162800A (en) * | 1989-08-29 | 1991-07-12 | Mitsubishi Electric Corp | Semiconductor memory device |
JP2001034597A (en) * | 1999-07-22 | 2001-02-09 | Fujitsu Ltd | Cache memory device |
JP2001188767A (en) * | 1999-12-28 | 2001-07-10 | Fuji Xerox Co Ltd | Neutral network arithmetic unit and method |
WO2001069424A2 (en) * | 2000-03-10 | 2001-09-20 | Jaber Associates, L.L.C. | Parallel multiprocessing for the fast fourier transform with pipeline architecture |
JP2008097572A (en) * | 2006-09-11 | 2008-04-24 | Matsushita Electric Ind Co Ltd | Processing device, computer system, and mobile apparatus |
CN101197017A (en) * | 2007-12-24 | 2008-06-11 | 深圳市物证检验鉴定中心 | Police criminal technology inspection and appraisal information system and method thereof |
CN102004446A (en) * | 2010-11-25 | 2011-04-06 | 福建师范大学 | Self-adaptation method for back-propagation (BP) nerve cell with multilayer structure |
CN103150596B (en) * | 2013-02-22 | 2015-12-23 | 百度在线网络技术(北京)有限公司 | The training system of a kind of reverse transmittance nerve network DNN |
JP6115455B2 (en) * | 2013-11-29 | 2017-04-19 | 富士通株式会社 | Parallel computer system, parallel computer system control method, information processing apparatus, arithmetic processing apparatus, and communication control apparatus |
US20150269481A1 (en) * | 2014-03-24 | 2015-09-24 | Qualcomm Incorporated | Differential encoding in neural networks |
US20160034812A1 (en) * | 2014-07-31 | 2016-02-04 | Qualcomm Incorporated | Long short-term memory using a spiking neural network |
JP6453681B2 (en) * | 2015-03-18 | 2019-01-16 | 株式会社東芝 | Arithmetic apparatus, arithmetic method and program |
CN105095961B (en) * | 2015-07-16 | 2017-09-29 | 清华大学 | A kind of hybrid system of artificial neural network and impulsive neural networks |
CN105513591B (en) * | 2015-12-21 | 2019-09-03 | 百度在线网络技术(北京)有限公司 | The method and apparatus for carrying out speech recognition with LSTM Recognition with Recurrent Neural Network model |
CN107609642B (en) * | 2016-01-20 | 2021-08-31 | 中科寒武纪科技股份有限公司 | Computing device and method |
CN106203621B (en) * | 2016-07-11 | 2019-04-30 | 北京深鉴智能科技有限公司 | The processor calculated for convolutional neural networks |
-
2016
- 2016-12-30 CN CN202010018716.XA patent/CN111260025B/en active Active
- 2016-12-30 CN CN202110713121.0A patent/CN113537481B/en active Active
- 2016-12-30 CN CN201611269665.8A patent/CN108268939B/en active Active
- 2016-12-30 CN CN202110708810.2A patent/CN113537480B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104303162A (en) * | 2012-01-12 | 2015-01-21 | 才智知识产权控股公司(2) | Systems and methods for managing cache admission |
US20160379111A1 (en) * | 2015-06-25 | 2016-12-29 | Microsoft Technology Licensing, Llc | Memory bandwidth management for deep learning applications |
US20160379109A1 (en) * | 2015-06-29 | 2016-12-29 | Microsoft Technology Licensing, Llc | Convolutional neural networks on hardware accelerators |
CN105184366A (en) * | 2015-09-15 | 2015-12-23 | 中国科学院计算技术研究所 | Time-division-multiplexing general neural network processor |
CN106022468A (en) * | 2016-05-17 | 2016-10-12 | 成都启英泰伦科技有限公司 | Artificial neural network processor integrated circuit and design method therefor |
CN105893159A (en) * | 2016-06-21 | 2016-08-24 | 北京百度网讯科技有限公司 | Data processing method and device |
Non-Patent Citations (2)
Title |
---|
MINSI WANG ET AL.: "A PARALLEL-FUSION RNN-LSTM ARCHITECTURE FOR IMAGE CAPTION GENERATION" * |
杨旭瑜 等: "深度学习加速技术研究" * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2022206536A1 (en) * | 2021-03-29 | 2022-10-06 | 维沃移动通信有限公司 | Data processing method and apparatus, and chip |
Also Published As
Publication number | Publication date |
---|---|
CN113537481A (en) | 2021-10-22 |
CN111260025B (en) | 2023-11-14 |
CN108268939A (en) | 2018-07-10 |
CN108268939B (en) | 2021-09-07 |
CN113537481B (en) | 2024-04-02 |
CN113537480A (en) | 2021-10-22 |
CN113537480B (en) | 2024-04-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108268939B (en) | Apparatus and method for performing LSTM neural network operations | |
CN111860812B (en) | Apparatus and method for performing convolutional neural network training | |
KR102470264B1 (en) | Apparatus and method for performing reverse training of a fully-connected layer neural network | |
EP3564863B1 (en) | Apparatus for executing lstm neural network operation, and operational method | |
CN110929863B (en) | Apparatus and method for performing LSTM operations | |
CN107832843B (en) | Information processing method and related product | |
US11531540B2 (en) | Processing apparatus and processing method with dynamically configurable operation bit width | |
CN110298443B (en) | Neural network operation device and method | |
US10853722B2 (en) | Apparatus for executing LSTM neural network operation, and operational method | |
CN111860813B (en) | Device and method for performing forward operation of convolutional neural network | |
CN109358900B (en) | Artificial neural network forward operation device and method supporting discrete data representation | |
CN111860811B (en) | Device and method for executing full-connection layer forward operation of artificial neural network | |
WO2017185347A1 (en) | Apparatus and method for executing recurrent neural network and lstm computations | |
EP3451238A1 (en) | Apparatus and method for executing pooling operation | |
CN108171328B (en) | Neural network processor and convolution operation method executed by same | |
CN111160547B (en) | Device and method for artificial neural network operation | |
CN111651203A (en) | Device and method for executing vector four-rule operation | |
CN111176608A (en) | Apparatus and method for performing vector compare operations | |
WO2017185248A1 (en) | Apparatus and method for performing auto-learning operation of artificial neural network | |
CN109711540B (en) | Computing device and board card | |
WO2017177446A1 (en) | Discrete data representation-supporting apparatus and method for back-training of artificial neural network | |
CN111860772B (en) | Device and method for executing artificial neural network mapping operation | |
CN113570053A (en) | Neural network model training method and device and computing equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |