CN113537480B - Apparatus and method for performing LSTM neural network operation - Google Patents

Apparatus and method for performing LSTM neural network operation Download PDF

Info

Publication number
CN113537480B
CN113537480B CN202110708810.2A CN202110708810A CN113537480B CN 113537480 B CN113537480 B CN 113537480B CN 202110708810 A CN202110708810 A CN 202110708810A CN 113537480 B CN113537480 B CN 113537480B
Authority
CN
China
Prior art keywords
data
unit
data processing
vector
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110708810.2A
Other languages
Chinese (zh)
Other versions
CN113537480A (en
Inventor
陈云霁
陈小兵
刘少礼
陈天石
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Cambricon Information Technology Co Ltd
Original Assignee
Shanghai Cambricon Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Cambricon Information Technology Co Ltd filed Critical Shanghai Cambricon Information Technology Co Ltd
Priority to CN202110708810.2A priority Critical patent/CN113537480B/en
Publication of CN113537480A publication Critical patent/CN113537480A/en
Application granted granted Critical
Publication of CN113537480B publication Critical patent/CN113537480B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/20Handling requests for interconnection or transfer for access to input/output bus
    • G06F13/28Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access DMA, cycle steal
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3802Instruction prefetching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Neurology (AREA)
  • Memory System Of A Hierarchy Structure (AREA)
  • Complex Calculations (AREA)
  • Image Analysis (AREA)

Abstract

An apparatus and an operation method for performing an LSTM neural network operation. The device comprises a direct memory access unit, an instruction cache unit, a controller unit, a plurality of data cache units which are arranged in parallel and a plurality of data processing modules which are arranged in parallel, wherein the data processing modules are in one-to-one correspondence with the data cache units and are used for acquiring input data and weight and bias required in operation from the corresponding data cache units and performing LSTM neural network operation; and executing parallel operation among the plurality of data processing modules. The invention adopts special instruction to run, the number of instructions required by operation is greatly reduced, and the decoding cost is reduced; caching the weight and the bias so as to reduce the data transmission overhead; the method is not limited to the specific application field, can be used in the fields such as voice recognition, text translation, music synthesis and the like, and has strong expandability; the data processing modules run in parallel, so that the operation speed of the LSTM network is remarkably improved.

Description

Apparatus and method for performing LSTM neural network operation
Technical Field
The present invention relates to the field of neural network operations, and more particularly, to an apparatus and an operation method for performing LSTM neural network operations.
Background
A long-short-time memory network (LSTM) is a type of time-Recurrent Neural Network (RNN) that is suitable for processing and predicting important events with very long intervals and delays in a time series, due to the unique structural design of the network itself. LSTM networks exhibit better performance than traditional recurrent neural networks, and are well suited for learning from experience to classify, process and predict time series after an unknown size time exists between significant events. Currently, LSTM networks are widely used in many fields, such as speech recognition, video description, machine translation, and automatic synthesis of music. Meanwhile, with the continuous and deep research on LSTM networks, the performance of the LSTM networks is greatly improved, and the LSTM networks are widely paid attention to in the industry and academia.
The LSTM network operation involves various algorithms, and the specific implementation device mainly comprises the following two kinds of:
one means of implementing LSTM network operations is a general purpose processor. The method supports the above algorithm by executing general instructions using a general purpose register stack and general purpose features. One of the disadvantages of this approach is that the computational performance of a single general purpose processor is low and the usual acceleration by means of parallelism of the LSTM network itself is not met. When the processors are executed in parallel by a plurality of general-purpose processors, the performance bottleneck is also formed by the mutual communication between the processors. In addition, the general processor needs to decode the artificial neural network operation into a series of operation and access instruction, and the front end decoding of the processor has larger power consumption cost.
Another known method of supporting LSTM network operations is to use a Graphics Processor (GPU). The method performs the above algorithm by executing a general purpose SIMD instruction using a general purpose register stack and a general purpose stream processing unit. Since the GPU is a device dedicated to performing graphics image operations and scientific calculations, no dedicated support is provided for the LSTM network, and a large amount of front-end decoding work is still required to perform the LSTM network operations, which causes a large amount of overhead. In addition, GPU has a small on-chip cache, and related parameters used in LSTM networks need to be repeatedly handled off-chip, and off-chip bandwidth also becomes a performance bottleneck.
It can be seen that how to design and provide an apparatus and method for implementing LSTM network operation with high operation performance in a manner of small IO amount and low overhead is a technical problem that needs to be solved currently.
Disclosure of Invention
Accordingly, it is a primary objective of the present invention to provide an apparatus and method for performing LSTM network operations, so as to solve at least one of the above problems.
In order to achieve the above object, as one aspect of the present invention, there is provided an apparatus for performing LSTM neural network operations, comprising:
The data caching units are arranged in parallel and used for caching data, states and results required by operation;
the data processing modules are arranged in parallel and are used for acquiring input data and weight and bias required in operation from the corresponding data caching units and performing LSTM neural network operation; the data processing modules are in one-to-one correspondence with the data caching units, and parallel operation is executed among the data processing modules.
As another aspect of the present invention, there is also provided an apparatus for performing LSTM neural network operations, including:
a memory;
a processor that performs the operations of:
step 1, reading weight and bias for LSTM neural network operation from an external designated address space, dividing the weight and bias into a plurality of parts corresponding to neurons of the LSTM neural network operation, and storing the parts into different spaces of a memory, wherein the number of the weight and the bias in each space is the same; and reading input data for LSTM neural network operations from an externally specified address space and storing it in each of said different spaces of said memory;
Step 2, dividing the weight and the input data in each different space of the memory into a plurality of parts, wherein the number of the weight or the input data of each part is the same as the number of the corresponding vector operation units; calculating a weight and input data each time to obtain a partial sum, and adding vectors with the partial sum obtained before to obtain a new partial sum, wherein the initial value of the partial sum is a bias value;
step 3, after all input data in each different space of the memory are processed, obtaining a partial sum, namely net activation quantity corresponding to the neuron, and transforming the net activation quantity of the neuron through a nonlinear function tanh or a sigmoid function to obtain an output value of the neuron;
step 4, by using different weights and offsets in this way, repeating the steps 1-3, and respectively calculating the vector values of the forgetting gate, the input gate, the output gate and the state unit to be selected in the LSTM neural network operation; the vector operation instruction is adopted in the process of the calculation part, and the input data in each different space of the memory is calculated in a parallel operation mode;
Step 5, judging whether the calculation of the vector values of the current forgetting gate, the input gate and the state unit to be selected in each different space of the memory is completed, if so, calculating a new state unit, namely, obtaining partial sums of the vector values of the old state unit and the forgetting gate through a vector point multiplication component, obtaining partial sums of the values of the state unit to be selected and the input gate through a vector point multiplication component, obtaining updated state units through a vector summation sub-module, and simultaneously, transforming the updated state units through a nonlinear transformation function tanh; judging whether the nonlinear transformation of the current updated data state unit and the output gate are calculated, if so, calculating the output gate and the vector with the nonlinear transformation of the updated data state unit through a vector point multiplication component to obtain the final output value of each different space of the memory;
and 6, splicing the final output values of each different space of each memory to obtain a final output value.
As still another aspect of the present invention, the present invention further provides an operation method of an LSTM neural network, which is characterized by comprising the steps of:
Step S1, reading weight and bias for LSTM neural network operation from an external designated address space, writing the weight and bias into a plurality of data cache units which are arranged in parallel, and initializing state units of the data cache units; the weight and bias read from the external appointed address space are divided and sent into corresponding data caching units corresponding to the neurons operated by the LSTM neural network, and the number of the weight and bias in each data caching unit is the same;
step S2, reading input data from an external designated address space and writing the input data into a plurality of data caching units, wherein the input data written into each data caching unit is complete;
step S3, a plurality of data processing modules corresponding to the data caching units in a one-to-one mode are used for respectively reading the weight, the offset and the input data from the corresponding data caching units, and performing LSTM neural network operation on the data processing modules by adopting a vector point multiplication component, a vector addition component, a vector summation component and a vector nonlinear function conversion component to respectively obtain an output value of each data processing module;
and S4, splicing the output values of the data processing modules to obtain a final output value, namely a final result of the LSTM neural network operation.
Based on the above technical solutions, the device and method for executing neural network operation of the present invention have the following advantages compared with the existing implementation manner:
1. compared with the existing implementation mode, the external instruction operation is adopted, so that the number of instructions required by operation is greatly reduced, and the decoding overhead generated when LSTM network operation is performed is reduced;
2. by utilizing the characteristic that the weight and the bias of an hidden layer can be reused in the LSTM network operation process, the weight and the bias value are temporarily stored in a data caching unit, so that the IO quantity between the device and the outside is reduced, and the overhead generated by data transmission is reduced;
3. the invention is not limited to the application field of a specific LSTM network, can be used in the fields such as voice recognition, text translation, music synthesis and the like, and has strong expandability;
4. the data processing modules in the device are completely parallel, and the internal parts of the data processing modules are parallel, so that the parallelism of the LSTM network can be fully utilized, and the operation speed of the LSTM network is obviously improved;
5. preferably, the specific implementation of the vector nonlinear function conversion component can be performed by a table look-up method, and compared with the traditional function operation, the efficiency is greatly improved.
Drawings
For a more complete understanding of the present invention, and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:
FIG. 1 is a schematic diagram showing the overall structure of an apparatus for performing LSTM network operations according to an embodiment of the present invention;
FIG. 2 shows a schematic diagram of a data processing module of an apparatus for performing LSTM network operations in accordance with an embodiment of the invention;
FIG. 3 illustrates a flow chart of a method for performing LSTM network operations in accordance with an embodiment of the invention;
fig. 4 is a detailed flowchart illustrating a data processing procedure in a method for performing LSTM network operations according to an embodiment of the present invention.
Detailed Description
The present invention will be further described in detail below with reference to specific embodiments and with reference to the accompanying drawings, in order to make the objects, technical solutions and advantages of the present invention more apparent.
In this specification, the various embodiments described below for describing the principles of the present invention are illustrative only and should not be construed as limiting the scope of the invention in any way. The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of exemplary embodiments of the invention defined by the claims and their equivalents. The following description includes numerous specific details to aid in understanding, but these details should be construed as exemplary only. Accordingly, those of ordinary skill in the art will recognize that many variations and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present invention. In addition, descriptions of well-known functions and constructions are omitted for clarity and conciseness. Furthermore, the same reference numerals are used for similar functions and operations throughout the drawings. In the present invention, the terms "include" and "comprise," as well as derivatives thereof, are intended to be inclusive rather than limiting.
The apparatus for performing LSTM network operations of the present invention may be applied in the following scenarios, including but not limited to: various electronic products such as data processing, robots, computers, printers, scanners, telephones, tablet computers, intelligent terminals, mobile phones, automobile data recorders, navigator, sensors, cameras, cloud servers, cameras, video cameras, projectors, watches, earphones, mobile storage, wearable devices and the like; various vehicles such as airplanes, ships, vehicles and the like; various household appliances such as televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas cookers, smoke extractors and the like, and various medical equipment such as nuclear magnetic resonance instruments, B ultrasonic instruments, electrocardiographs and the like.
Specifically, the invention discloses a device for executing LSTM neural network operation, which comprises:
the data caching units are arranged in parallel and used for caching data, states and results required by operation;
the data processing modules are arranged in parallel and are used for acquiring input data and weight and bias required in operation from the corresponding data caching units and performing LSTM neural network operation; the data processing modules are in one-to-one correspondence with the data caching units, and parallel operation is executed among the data processing modules.
The data caching unit caches the intermediate result calculated by the data processing module, and only imports the weight and the bias once from the direct memory access unit in the whole execution process, and then the weight and the bias are not changed.
The data caching units are respectively written with weights and offsets which are divided corresponding to neurons operated by the LSTM neural network, wherein the weights and offsets in the data caching units are the same in number, and each data caching unit acquires a piece of complete input data.
The data processing module adopts a vector point multiplication component, a vector addition component, a vector summation component and a vector nonlinear function conversion component to carry out LSTM neural network operation.
The vector nonlinear function conversion component performs function operation through a table look-up method.
Each data processing module performs vector operation by respectively calculating vector values of a forgetting gate, an input gate, an output gate and a state unit to be selected in LSTM network operation, obtains an output value of each data processing module from each vector value, and finally splices the output values of each data processing module to obtain a final output value.
As a preferred embodiment, the present invention discloses an apparatus for performing LSTM neural network operations, comprising:
the direct memory access unit is used for acquiring the instruction and the data required by the LSTM neural network operation from an external address space outside the device, respectively transmitting the instruction and the data to the instruction cache unit and the data cache unit, and writing the operation result back to the external address space from the data processing module or the data cache unit;
the instruction cache unit is used for caching the instruction acquired from the external address space by the direct memory access unit and inputting the instruction into the controller unit;
the controller unit reads the instruction from the instruction cache unit, decodes the instruction into a micro instruction and is used for controlling the direct memory unit to perform data IO operation, the data processing module to perform related operation and the data cache unit to perform data caching and transmission;
the data caching units are arranged in parallel and used for caching data, states and results required by operation;
the data processing modules are arranged in parallel and are used for acquiring input data and weight and bias required during operation from the corresponding data caching units, performing LSTM neural network operation and inputting operation results into the corresponding data caching units or the direct memory access units; the data processing modules are in one-to-one correspondence with the data caching units, and parallel operation is executed among the data processing modules.
Preferably, the direct memory access unit, the instruction cache unit, the controller unit, the plurality of data cache units and the plurality of data processing modules are all implemented by hardware circuits.
Preferably, the data caching unit caches the intermediate result calculated by the data processing module, and only imports the weight and the bias from the direct memory access unit once in the whole execution process, and then does not change any more.
Preferably, each of the plurality of data processing modules performs the LSTM neural network operation using a vector point multiplication component, a vector addition component, a vector summation component, and a vector nonlinear function conversion component.
Preferably, the vector nonlinear function conversion component performs a function operation by a table look-up method.
Preferably, the plurality of data processing modules perform parallel operation by:
step 1, each corresponding data buffer unit is written with a weight and a bias which are read from an external designated address space and are divided corresponding to a neuron operated by the LSTM neural network, wherein the weight and the bias in each data buffer unit are the same in number, and each data buffer unit acquires a piece of complete input data; each data processing module divides the weight and the input data in each corresponding data caching unit into a plurality of parts, wherein the weight or the number of the input data in each part is the same as the number of vector operation units in the corresponding single data processing module; each time, sending a weight and input data into the corresponding data processing module, calculating to obtain a partial sum, then taking out the partial sum obtained before from the data caching unit, adding vectors to the partial sum to obtain a new partial sum, and sending the new partial sum back to the data caching unit, wherein the initial value of the partial sum is an offset value;
Step 2, after all input data in each data buffer unit are sent to a corresponding data processing module for processing once, the obtained part sum is the net activation quantity corresponding to the neuron, and the corresponding data processing module transforms the net activation quantity of the neuron through a nonlinear function tanh or a sigmoid function to obtain an output value of the neuron;
step 3, by using different weights and offsets in this way, repeating the steps 1-2, and respectively calculating the vector values of the forgetting gate, the input gate, the output gate and the state unit to be selected in the LSTM network operation; in the same data processing module, vector operation instructions are adopted in the process of calculating part sum, and parallel operation is adopted among data;
step 4, each data processing module judges whether the calculation of the vector values of the current forgetting gate, the input gate and the state unit to be selected is completed, if so, the calculation of a new state unit is carried out, namely, the vector values of the old state unit and the forgetting gate are sent to the data processing unit, partial sums are obtained through a vector point multiplication component and are sent to a data caching unit, then the values of the state unit to be selected and the input gate are sent to the data processing unit, partial sums in the data caching unit are obtained through the vector point multiplication component and are sent to the data processing module, the updated state unit is obtained through a vector summation sub-module and is then sent to the data caching unit, and meanwhile, the updated state unit in the data processing module is transformed through a nonlinear transformation function tanh; each data processing module judges whether the nonlinear transformation of the current updated data state unit and the output gate are calculated, if so, the output gate and the vector after the nonlinear transformation of the updated data state unit are calculated by a vector point multiplication component to obtain a final output value, and the output value is written back into the data cache unit;
And step 5, after the output values in all the data processing modules are written back to the data caching unit, the output values in all the data processing modules are spliced to obtain a final output value, and the final output value is sent to an external designated address through the direct memory access unit.
The invention also discloses a device for executing the LSTM neural network operation, which comprises:
a memory;
a processor that performs the operations of:
step 1, reading weight and bias for LSTM neural network operation from an external designated address space, dividing the weight and bias into a plurality of parts corresponding to neurons of the LSTM neural network operation, and storing the parts into different spaces of a memory, wherein the number of the weight and the bias in each space is the same; and reading input data for LSTM neural network operations from an externally specified address space and storing it in each of said different spaces of said memory;
step 2, dividing the weight and the input data in each different space of the memory into a plurality of parts, wherein the number of the weight or the input data of each part is the same as the number of the corresponding vector operation units; calculating a weight and input data each time to obtain a partial sum, and adding vectors with the partial sum obtained before to obtain a new partial sum, wherein the initial value of the partial sum is a bias value;
Step 3, after all input data in each different space of the memory are processed, obtaining a partial sum, namely net activation quantity corresponding to the neuron, and transforming the net activation quantity of the neuron through a nonlinear function tanh or a sigmoid function to obtain an output value of the neuron;
step 4, by using different weights and offsets in this way, repeating the steps 1-3, and respectively calculating the vector values of the forgetting gate, the input gate, the output gate and the state unit to be selected in the LSTM neural network operation; the vector operation instruction is adopted in the process of the calculation part, and the input data in each different space of the memory is calculated in a parallel operation mode;
step 5, judging whether the calculation of the vector values of the current forgetting gate, the input gate and the state unit to be selected in each different space of the memory is completed, if so, calculating a new state unit, namely, obtaining partial sums of the vector values of the old state unit and the forgetting gate through a vector point multiplication component, obtaining partial sums of the values of the state unit to be selected and the input gate through a vector point multiplication component, obtaining updated state units through a vector summation sub-module, and simultaneously, transforming the updated state units through a nonlinear transformation function tanh; judging whether the nonlinear transformation of the current updated data state unit and the output gate are calculated, if so, calculating the output gate and the vector with the nonlinear transformation of the updated data state unit through a vector point multiplication component to obtain the final output value of each different space of the memory;
And 6, splicing the final output values of each different space of each memory to obtain a final output value.
The invention also discloses an operation method of the LSTM neural network, which comprises the following steps:
step S1, reading weight and bias for LSTM neural network operation from an external designated address space, writing the weight and bias into a plurality of data cache units which are arranged in parallel, and initializing state units of the data cache units; the weight and bias read from the external appointed address space are divided and sent into corresponding data caching units corresponding to the neurons operated by the LSTM neural network, and the number of the weight and bias in each data caching unit is the same;
step S2, reading input data from an external designated address space and writing the input data into a plurality of data caching units, wherein the input data written into each data caching unit is complete;
step S3, a plurality of data processing modules corresponding to the data caching units in a one-to-one mode are used for respectively reading the weight, the offset and the input data from the corresponding data caching units, and performing LSTM neural network operation on the data processing modules by adopting a vector point multiplication component, a vector addition component, a vector summation component and a vector nonlinear function conversion component to respectively obtain an output value of each data processing module;
And S4, splicing the output values of the data processing modules to obtain a final output value, namely a final result of the LSTM neural network operation.
Preferably, in the step S3, each data processing module divides the weight and the input data in the corresponding data buffer unit into a plurality of parts, where the number of the weight or the input data in each part is the same as the number of the vector operation units in the corresponding single data processing module; each data caching unit sends a weight and input data into a data processing module corresponding to the data caching unit at each time, a partial sum is obtained through calculation, the partial sum obtained before is taken out of the data caching unit, vector addition is carried out on the partial sum, a new partial sum is obtained, and the new partial sum is sent back to the data caching unit, wherein the initial value of the partial sum is an offset value;
after all input data are sent to a data processing module once, the obtained partial sum is the net activation quantity corresponding to the neuron, then the net activation quantity of the neuron is sent to the data processing module, the output value of the neuron is obtained through the transformation of a nonlinear function tanh or sigmoid function in a data operation submodule, and the vector values of a forgetting gate, an input gate, an output gate and a state unit to be selected in the LSTM neural network are respectively calculated by using different weights and offsets in the mode;
Each data processing module judges whether the calculation of the vector values of the current forgetting gate, the input gate and the state unit to be selected is finished, if so, the calculation of a new state unit is carried out, namely, the vector values of the old state unit and the forgetting gate are sent to the data processing unit, partial sums are obtained through a vector point multiplication component and are sent back to the data caching unit, then the values of the state unit to be selected and the input gate are sent to the data processing unit, partial sums in the data caching unit are obtained through the vector point multiplication component and are sent to the data processing module, an updated state unit is obtained through a vector summation sub-module and is then sent back to the data caching unit, and meanwhile, the updated state unit in the data processing module is transformed through a nonlinear transformation function tanh; and each data processing module judges whether the nonlinear transformation of the current updated data state unit and the output gate are calculated, if so, the output gate and the vector with the nonlinear transformation of the updated data state unit are calculated by a vector point multiplication component to obtain a final output value, and the output value is written back into the data cache unit.
Preferably, the nonlinear function tanh or sigmoid function performs a function operation by a table look-up method.
Other aspects, advantages and salient features of the invention will become apparent to those skilled in the art from the following detailed description, which, taken in conjunction with the annexed drawings, discloses exemplary embodiments of the invention.
As one embodiment of the invention, the invention discloses a device and a method for LSTM network operation, which can be used for accelerating the application of LSTM network utilization. The method specifically comprises the following steps:
(1) The weight and bias used in LSTM network operation are taken out from an external appointed address space through a direct memory access unit and written into each data cache unit, wherein the weight and bias are taken out from the external appointed address space, segmented and then sent into each data cache unit, the weight and bias in each data cache unit are the same in quantity, the weight and bias in each data cache unit correspond to neurons, and state units in the data cache unit are initialized;
(2) The input data is fetched from an external designated address space through a direct memory access unit and written into a data cache unit, and each data cache unit obtains a complete input data;
(3) Dividing the weight and the input data in each data caching unit into a plurality of parts, wherein the number of the weight or the input data of each part is the same as the number of vector operation units in a corresponding single data processing module, sending one part of weight and input data into the data processing module each time, calculating to obtain a partial sum, then taking out the partial sum obtained before from the data caching unit, vector adding the partial sum to obtain a new partial sum, and sending the new partial sum back into the data caching unit. Wherein the initial value of the partial sum is a bias value. After all input data are sent to the data processing module once, the obtained partial sum is the net activation quantity corresponding to the neuron, then the net activation quantity of the neuron is sent to the data processing module, the output value of the neuron is obtained through nonlinear function tanh or sigmoid function transformation in the data operation submodule, and the function transformation can be carried out through two methods of a table look-up method and a function operation. By using different weights and offsets in this way, the vector values of the forgetting gate, the input gate, the output gate and the state unit to be selected in the LSTM network can be calculated respectively. In the same data processing module, vector operation instructions are adopted in the process of calculating part sum, and parallelism exists among data. Then, the data dependency judging sub-module in each data processing module judges whether the calculation of the vector values of the current forgetting gate, the input gate and the state unit to be selected is completed or not, and if so, the calculation of the new state unit is performed. Firstly, an old state unit and a forgetting gate vector value are sent to a data processing unit, and partial sums are obtained through a vector point multiplication component in a data operation submodule and are sent back to a data caching unit; and then the values of the state unit to be selected and the input gate are sent to the data processing unit, partial sums are obtained through the vector point multiplication component in the data operation submodule, partial sums in the data caching unit are sent to the data processing module, the updated state unit is obtained through the vector summation submodule in the data operation submodule and then sent back to the data caching unit, and meanwhile, the updated state unit in the data processing module is transformed through the nonlinear transformation function tanh in the data operation submodule. The data dependency judging sub-module in each data processing module judges whether the nonlinear transformation of the current updated data state unit and the output gate are calculated, if so, the output gate and the vector after the nonlinear transformation of the updated data state unit are calculated through the vector point multiplication component in the data operation sub-module to obtain a final output value, and the output value is written back into the data cache unit. In the whole operation process, the problem of data dependence or data conflict does not exist among different data processing modules, and the data can be processed in parallel all the time.
(4) After the output values in all the data processing modules are written back to the data caching unit, the output values in all the data processing modules are spliced to obtain a final output value, and the final output value is sent to an external designated address through the direct memory access unit.
(5) And (3) judging whether the LSTM network needs to output at the next moment, if so, turning to (2), otherwise, ending the operation.
Fig. 1 is a schematic diagram showing an overall structure of an apparatus for performing LSTM network operations according to an embodiment of the present invention. As shown in fig. 1, the device includes a direct memory access unit 1, an instruction cache unit 2, a controller unit 3, a data cache unit 4, and a data processing module 5, which may all be implemented by hardware circuits.
The direct memory access unit 1 can access an external address space, and can read and write data to each cache unit in the device to complete loading and storing of the data. Specifically, the method comprises the steps of reading an instruction from an instruction cache unit 2, reading weights, offsets and input data required by LSTM network operation from specified storage units to a data cache unit 4, and directly writing the operated output from the data cache unit 4 into an external specified space.
The instruction cache unit 2 reads instructions through the direct memory access unit 1 and caches the read instructions.
The controller unit 3 reads instructions from the instruction cache unit 2, decodes the instructions into micro-instructions controlling the behavior of other modules and sends them to other modules such as the direct memory access unit 1, the data cache unit 4, the data processing module 5, etc.
The data buffer unit 4 initializes the state unit of the LSTM when the device is initialized, and reads the weight and the bias from the external appointed address through the direct memory access unit 1, the weight and the bias read in each data buffer unit 4 correspond to the neuron to be calculated, namely the weight and the bias read in each data buffer unit 4 are a part of the total weight and bias, and the weight and the bias in all the data buffer units 4 are combined and then are read in from the external appointed address; in a specific operation, firstly, input data is obtained from the direct memory access unit 1, each data caching unit 4 obtains a copy of the input data, partial sum is initialized to an offset value, then a part of the weight value, the offset and the input value is sent to the data processing module 5, the data processing module 5 calculates to obtain an intermediate value, then the intermediate value is read out from the data processing module 5 and stored in the data caching unit 4, after all inputs are subjected to one-pass operation, the partial sum is input to the data processing module 5 to calculate to obtain neuron output, and then the neuron output is written back to the data caching unit 4, and finally vector values of an input gate, an output gate, a forgetting gate and a state unit to be selected are obtained. Then, the forgotten gate and the old state unit are sent to the data processing module 5, the partial sum is calculated and written back to the data buffer unit 4, the state unit to be selected and the input gate are sent to the data processing module 5, the partial sum is calculated and written to the data buffer unit 4, the updated state unit is obtained by adding the partial sum and the previous calculated partial sum in the data processing module 5, and the updated state unit is written back to the data buffer unit 4. The output gate is fed into the data processing module 5, vector point multiplication is performed on the value transformed by the nonlinear transformation function tanh of the updated state unit to obtain an output value, and the output value is written back into the data buffer unit 4. Finally, each data buffer unit 4 obtains a corresponding updated state unit and output value, and the output values in all the data buffer units 4 are combined to obtain a final output value. Finally, each data buffer unit 4 writes back part of its output value obtained to the external specified address space through the direct memory access unit 1.
The corresponding operations in the LSTM network are as follows:
f t =σ(W f [h t-1 ,x t ]+b f );
i t =σ(W i [h t-1 ,x t ]+b i );
o t =σ(W o [h t-1 ,x t ]+b o );
h t =o t ⊙tanh(c t );
wherein x is t Is the input data at the t time, h t-1 Output data at time t-1, W f 、W i 、W c And W is o B, respectively representing weight vectors corresponding to the forget gate, the input gate, the update state unit and the output gate f 、b i 、b c And b o The corresponding bias of the forgetting gate, the input gate, the updating state unit and the output gate are respectively indicated; f (f) t The output of the forgetting gate is displayed, and the forgetting gate and the state unit at the time t-1 are subjected to dot multiplication to select the forgotten state unit value; i.e t The output of the input gate is multiplied by the obtained candidate state value point at the time t to selectively add the candidate state value at the time t to the state unit;representing candidate state values obtained by calculation at the moment t; c t Representing a new state value obtained by selectively forgetting the state value at time t-1 and selectively adding the state value at time t, which will be used in calculating the final output time and transmitted to the next time; o (o) t A selection condition indicating that the state unit needs to output as a result part at time t; h is a t An output representing the time t, while it is also to be transmitted to the next time; the product of vector per element operation; sigma is a sigmoid function, and the calculation formula is: / >The calculation formula of the activation function tanh function is +.>
The data processing module 5 reads the partial weight W from the corresponding data buffer unit 4 each time i /W f /W o /W c Sum weight b i /b f /b o /b c Corresponding input data h t-1 ,x t ]The partial sum is calculated by the vector multiplication part and the summation part in the data processing module 5 until the input data of each neuron is calculated once, and the net activation quantity net of the neuron can be obtained i /net f /net o /net c Then the output value is calculated by the conversion of vector nonlinear function sigmoid or tanh function, and the input gate i is respectively finished in this way i Forgetting door fi i Output door o i State unit to be selectedIs calculated by the computer. Then the point multiplication of the old state unit and the forgetting gate and the point multiplication of the state unit to be selected and the input gate are calculated through vector point multiplication components in the data processing module 5 respectively, and then the two results are calculated through vector addition components to obtain a new state unit c t . The newly obtained state unit is written back to the data cache unit 4. The state unit in the data processing module 5 is used for completing the conversion of the tanh function by using a vector nonlinear function conversion component to obtain tanh (c) t ) In the calculation process, the calculation of the tanh function value or the table lookup can be completed. Then, the vector of the output gate and the state unit after the tanh nonlinear transformation is calculated by a vector point multiplication component to obtain a final neuron output value h t . Finally, the neuron output value h t Written back into the data cache unit 4.
FIG. 2 shows a schematic diagram of a data processing module of an apparatus for performing LSTM network operations in accordance with an embodiment of the invention;
as shown in fig. 2, the data processing unit 5 includes a data processing control sub-module 51, a data dependency discrimination sub-module 52, and a data operation sub-module 53.
Wherein the data processing control sub-module 51 controls the operations performed by the data operator sub-module 53. The control data dependency determination sub-module 52 determines whether there is a data dependency for the current operation. For a part of the operations, the data processing control sub-module 51 controls the operations performed by the data operation sub-module 53; for the operation that may have the data dependency relationship, the data processing control sub-module 51 first controls the data dependency determination sub-module 52 to determine whether the current operation has the data dependency relationship, if so, the data processing control sub-module 51 inserts the null operation into the data operation sub-module 53, and after the data dependency relationship is released, controls the data operation sub-module 53 to perform the data operation.
The data dependency discrimination sub-module 52 is controlled by the data processing control sub-module 51, and checks whether there is a data dependency relationship in the data operation sub-module 53. If the next operation needs to use the value which is not finished by the operation, the data dependence exists currently, otherwise, the data dependence does not exist. The method of data dependency detection is that there are registers R1, R2, R3, R4, R5 in the data operation submodule 53, which are used to mark whether the tanh function conversion operation of the input gate, the forgetting gate, the output gate, the state unit to be selected, and the updated state unit is completed, respectively, and a value of the register other than 0 indicates that the operation is completed, and a value of 0 indicates that the operation is not completed. Corresponding to the LSTM network, the data dependency determining sub-module 52 may determine two times of data dependency, that is, determine whether there is data dependency between the input gate and the output gate when calculating the new state unit and determine whether there is data dependency between the output gate and the tanh function conversion of the updated state unit when calculating the output value, and determine whether R1, R3, R4 are all non-0 and R2, R are all non-0, respectively. After the judgment is completed, the judgment result needs to be returned to the data processing control sub-module 51.
The data operation sub-module 53 is controlled by the data processing module sub-module 51 to complete data processing in the network operation process. The data operation submodule 53 includes a vector point multiplication unit, a vector addition unit, a vector summation unit, a vector nonlinear conversion unit, and registers R1, R2, R3, R4, and R5 for indicating whether or not the relevant data operation is completed. Registers R1, R2, R3, R4, R5 for marking whether tanh function conversion operation of input gate, forget gate, output gate, state unit to be selected and updated state unit is completed, respectively, and a value of the register other than 0 indicatesThe operation is complete, a value of 0 indicates that it has not been completed. The vector addition part adds corresponding positions of two vectors to obtain a vector, the vector summation part divides the vector into a plurality of sections, each section is internally summed, and the length of the finally obtained vector is equal to the number of the sections. The vector nonlinear transformation component takes each element in the vector as input to obtain output after nonlinear function transformation. The specific nonlinear transformation can be accomplished in two ways. Taking a sigmoid function with input x as an example, one is to directly calculate the sigmoid (x) by adopting a function operation mode, the other is to complete the calculation by adopting a table lookup method, and a table of the sigmoid function is maintained in the data operation sub-module 53 and the input x is recorded respectively 1 、x 2 …x n (x 1 <x 2 <…<x n ) Output y corresponding to time 1 、y 2 …y n The value of the function corresponding to x is calculated by first finding the interval x i ,x i+1 ]Satisfy x i <x<x i+1 Calculation ofAs an output value. In the LSTM network operation process, the following operation is performed:
first, R1, R2, R3, R4, R5 are set to 0. Initializing the input gate portion sum with a bias; calculating a temporary value by using part of input data and a weight corresponding to the input data through a vector point multiplication component, then segmenting the temporary value according to temporary value vectors corresponding to different neurons, completing summation operation of the temporary value by using a vector summation component, and updating a calculation result, an input gate part and a completion part; and taking the other input data and the weight to perform the same operation to update the partial sum, after all the input data are operated once, obtaining the partial sum which is the net activation quantity of the neurons, and then calculating the output value of the input gate through a vector nonlinear conversion component. The output value is written back into the data cache unit 4 and the R1 register is set to non-0.
And calculating the output values of the forgetting gate, the output gate and the state unit to be selected by using the same method for calculating the output of the input gate, writing the corresponding output values back to the data caching unit 4, and setting the R2, R3 and R4 registers to be non-0.
The operation of the state unit after the idle operation or the update is performed according to the control command of the data processing module sub-module 51. The operation of the updated state unit is: the forgotten gate output value and the old state unit are taken out from the data buffer unit 4, the arrival partial sum calculated by the vector dot multiplication unit, then the input gate output value and the state unit to be selected are taken out from the data buffer unit 4, the arrival partial sum calculated by the vector dot multiplication unit, and the state unit partial sum in between are updated by the vector addition unit. The last state unit is finally written back into the data cache unit 4.
The operation of the no-op or LSTM network output value is performed according to the control command of the data processing module sub-module 51. The operation of the output value is: the updated state unit is used for calculating a nonlinear transformation value of the state unit by using a vector nonlinear function change component, and then R5 is set to be non-0. And then, performing point multiplication operation on the nonlinear transformation values of the output gate and the state unit by using a vector point multiplication component, and calculating a final output value, namely, the output value of the neuron corresponding to the LSTM network. The output value is written back into the data buffer unit 4.
Fig. 3 shows a flowchart provided for performing LSTM network operations in accordance with an embodiment of the present invention.
In step S1, an IO instruction is stored in advance at the head address of the instruction cache unit 2.
In step S2, the controller unit 3 reads the IO instruction from the first address of the instruction cache unit 2, and according to the decoded microinstructions, the direct memory access unit 1 reads all instructions related to LSTM network computation from the external address space and caches them in the instruction cache unit 2.
In step S3, the controller unit 3 reads an IO instruction from the instruction cache unit 2, and according to the decoded micro instruction, the direct memory access unit 1 reads weights and offsets related to LSTM network operations from an external designated address space, including weights and offsets of input gates, output gates, forgetting gates and state units to be selected, and divides the weights and offsets according to different neurons corresponding to the weights, and then reads the weights and offsets into different data cache modules 4.
In step S4, the controller unit 3 reads a state unit initialization command from the command buffer unit 2, and initializes the state unit values in the data buffer module 4 according to the decoded micro command to set the input gate part and the output gate part and the forget gate part and the candidate state unit part and the corresponding neuron bias values.
In step S5, the controller unit 3 reads an IO instruction from the instruction cache unit 2, and according to the decoded micro instruction, the direct memory access unit 1 reads the input value from the external designated address space to the data cache unit 4, and each data cache unit 4 receives an identical vector of the input value.
In step S6, the controller unit 3 reads a data processing instruction from the instruction buffer unit 2, and according to the decoded micro instruction, the data processing module 5 obtains relevant data required for operation from the corresponding data buffer unit 4 to perform operation, the result obtained by the operation is an output value of a part of neurons corresponding to a time point, and the output values obtained by all the data processing modules 5 are combined to correspond to an output value of a time point, for detailed processing procedures, see fig. 4. After the processing is finished, the data processing module 5 stores the intermediate value or the output value and the state unit value obtained by the processing in the data buffer unit 4.
In step S7, the controller unit 3 reads an IO instruction from the instruction cache unit 2, and according to the decoded microinstruction, the output value in the data cache unit 4 is spliced and output to the external designated address through the direct memory access unit 1.
In step S8, the controller unit 3 reads a discrimination instruction from the instruction cache unit 2, and based on the decoded microinstruction, the controller unit 3 determines whether the current forward process is completed, and if so, ends the operation. If not, operation continues in a turn S6.
Fig. 4 is a detailed flowchart illustrating a data processing procedure in a method for performing LSTM network operations according to an embodiment of the present invention.
In step S1, the data processing module 5 reads the weight and the input value of a part of the input gates from the data buffer unit 4.
In step S2, the data processing control sub-module 51 in the data processing module 5 controls the vector dot multiplication means in the data operation sub-module 53 to calculate the input gate weight, the dot multiplication of the input value, then groups the dot multiplication results in the group according to the neuron to which the result belongs, and calculates a partial sum by the vector summation means in the data operation sub-module 53.
In step S3, the data processing module 5 reads in the input gate portion from the data buffer unit 4.
In step S4, the data processing control sub-module 51 in the data processing module 5 controls the data operation sub-module 53 to add the calculated partial sums and the partial sums just read in to obtain updated input gate partial sums.
In step S5, the data processing module 5 writes the updated partial sums into the data caching module 4.
In step S6, the data processing module 5 determines whether all the input gate weights have been operated once, if so, the sum of the input gate values in the data buffer unit is the sum of the input gate values, and the R1 register is set to be non-zero, otherwise, the operation is continued by converting a part of the input gate weights and the input values to S1.
In step S7, the forgetting gate output value, the output gate output value and the output value of the state unit to be selected are obtained by the and operation mode, and the R2, R3 and R4 are juxtaposed to be non-zero, and the output values are written back to the data cache unit 4.
In step S8, the data processing control sub-module 51 in the data processing module 5 controls the data dependency determination sub-module 52 to determine whether the forgetting gate, the input gate and the state unit to be selected are all non-zero, i.e. determine whether R1, R2 and R4 are all non-zero, if not, the data processing control sub-module 51 controls the data operation sub-module 53 to perform the idle operation, and then the operation proceeds to S8, and if yes, the operation proceeds to S9.
In step S9, the data processing module 5 reads the old state unit and forgets the gate output value from the data buffer unit 4.
In step S10, the data processing control sub-module 51 in the data processing module 5 controls the data operation sub-module 53 to calculate partial sums for the old state units and the forgetting gate output value with the vector dot multiplication section.
In step S11, the data processing module 5 reads the candidate state unit and the input gate output value from the data buffer unit 4.
In step S12, the data processing control sub-module 51 in the data processing module 5 controls the data operation sub-module 53 to calculate a partial sum with the vector dot multiplication section for the state unit to be selected and the input gate output value, and calculates an updated state unit by the vector addition section with the partial sum calculated in S10.
In step S13, the data processing module 5 returns the updated status unit to the data buffer unit 4.
In step S14, the data processing control sub-module 51 in the data processing module 5 controls the data operation sub-module 53 to calculate the updated state unit using the vector nonlinear transformation component to obtain the nonlinear transformation function tanh transformation value of the state unit, and sets R5 to be non-zero.
In step S15, the data control sub-module 51 in the data processing module 5 controls the data dependency determination sub-module 52 to determine whether the calculation of the output gate output value and the nonlinear transformation function tanh transformation value of the state unit is completed, that is, whether R3 and R5 are both non-zero, if not, the data processing control sub-module 51 controls the data operation sub-module 53 to perform the idle operation, and then proceeds to S15, and if yes, proceeds to S16.
In step S16, the data processing module 5 reads in the output of the output gate from the data buffer unit 4.
In step S17, the data processing control sub-module 51 in the data processing module 5 controls the data operation sub-module 53 to calculate the output gate output value and the nonlinear transformation function tanh transformation value of the state unit through the vector point multiplication component to obtain an output value, namely, an output value in the neuron corresponding to the data processing module 5 in the LSTM network.
In step S18, the data processing module 5 writes the output value into the data buffer unit 4.
The processes or methods depicted in the preceding figures may be performed by processing logic that comprises hardware (e.g., circuitry, dedicated logic, etc.), firmware, software (e.g., software embodied on a non-transitory computer readable medium), or a combination of both. Although the processes or methods are described above in terms of certain sequential operations, it should be appreciated that certain operations described may be performed in a different order. Further, some operations may be performed in parallel rather than sequentially.
The device works by using a specially designed instruction set, and has higher instruction decoding efficiency. The data processing modules are calculated in parallel, and the data caching units are operated in parallel without data transmission, so that the parallelism of operation is greatly improved. In addition, the weight and the bias are placed in the data cache unit, so that IO operation between the device and an external address space can be reduced, and the bandwidth required by memory access is reduced.
The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the invention thereto, but to limit the invention thereto, and any modifications, equivalents, improvements and equivalents thereof may be made without departing from the spirit and principles of the invention.

Claims (24)

1. An apparatus for performing LSTM neural network operations, comprising:
the data buffer storage units are arranged in parallel, each data buffer storage unit comprises a weight value, a bias and a complete piece of input data, wherein the weight value and the bias read in each data buffer storage unit are part of the total weight value and bias; wherein the total weight and bias refers to a weight and bias read in from an external designated address of the device;
the data processing modules are arranged in parallel, correspond to the data caching units in a one-to-one correspondence, execute parallel operation among the data processing modules, and each data processing module is used for acquiring input data and weight and bias required during operation from the corresponding data caching unit, performing LSTM neural network operation to obtain an output value in each data processing module and writing back to the data caching unit;
And after the output values in all the data processing modules are written back to the data caching unit, splicing the output values in each data processing module to obtain a final output value.
2. The apparatus of claim 1, wherein each of the data processing modules is configured to perform a vector operation of the LSTM neural network operation by separately calculating vector values of a forget gate, an input gate, an output gate, and a state cell to be selected in the LSTM neural network operation.
3. The apparatus according to claim 2, wherein each of the data processing modules is configured to divide the weight and the input data in its corresponding data buffer unit into a plurality of parts, where the number of the weight or the input data in each part is the same as the number of vector operation units in the corresponding single data processing module;
each data processing module is further used for reading a weight and input data from a corresponding data caching unit, calculating to obtain a partial sum, then taking out the partial sum obtained before from the data caching unit, adding vectors to the partial sum to obtain a new partial sum, and sending the new partial sum back to the data caching unit, wherein the initial value of the partial sum is an offset value;
After the data processing module obtains all input data, the obtained partial sum is the net activation quantity corresponding to the neuron, and then the data processing module is also used for obtaining the net activation quantity of the neuron, and the output value of the neuron is obtained through nonlinear function tanh or sigmoid function transformation.
4. The apparatus of claim 3, wherein the data processing module is further configured to obtain the forgotten gate and old state element, calculate a partial sum, and write back to the data cache element;
the data processing module is also used for acquiring a state unit to be selected and an input gate, calculating to obtain a partial sum, writing the partial sum in the data caching unit into the data processing module, adding the partial sum obtained by previous calculation and the vector to obtain an updated state unit, and writing the updated state unit back into the data caching unit;
the data processing module is also used for obtaining an output gate, performing vector point multiplication on the output gate and the value transformed by the nonlinear transformation function tanh of the updated state unit to obtain an output value, and writing the output value back into the data caching unit;
and obtaining corresponding updated state units and output values from each data caching unit, and combining the output values in all the data caching units to obtain a final output value.
5. The apparatus of claim 4, wherein each of the data processing modules performs the LSTM neural network operation using a vector point multiplication component, a vector addition component, a vector summation component, and a vector nonlinear function conversion component to obtain an output value of each of the data processing modules;
the vector multiplication component and the summation component in the data processing module are used for completing the calculation of partial sums;
the vector point multiplication component in the data processing module is used for calculating the point multiplication of the old state unit and the forgetting gate, and the point multiplication of the state unit to be selected and the input gate, and then the two results are calculated by the vector addition component to obtain an updated state unit;
the vector nonlinear function conversion component in the data processing module is used for completing conversion of the tanh function with the updated state unit;
the vector point multiplication component is used for calculating the vector of the output gate and the updated state unit after the tanh nonlinear transformation to obtain a final neuron output value.
6. The apparatus of claim 5, wherein the vector nonlinear function conversion component performs a functional representation by a look-up table method.
7. The apparatus of claim 4, wherein each of the data processing modules is further configured to determine whether the current forget gate, the input gate, and the candidate state cell vector value calculation are complete;
If so, a new state unit calculation is performed.
8. The apparatus of claim 4 or 7, wherein each of the data processing modules is further configured to determine whether the computation of the nonlinear transformation and output gates of the currently updated data state units is complete;
and if the calculation is completed, calculating the output gate and the vector subjected to nonlinear transformation of the data state unit to be updated through a vector point multiplication component to obtain a final output value.
9. The apparatus of claim 1, wherein the apparatus further comprises:
the direct memory access unit is used for acquiring the instruction and input data, weight and bias required by the LSTM neural network operation from an address space outside the device for executing the LSTM neural network operation, respectively transmitting the instruction and the input data, the weight and the bias to the instruction cache unit and the corresponding data cache unit, and writing the operation result back to the external address space from the data processing module or the data cache unit;
the instruction cache unit is used for caching the instruction acquired by the direct memory access unit from the external address space and inputting the instruction into the controller unit;
the controller unit is used for reading the instruction from the instruction cache unit, decoding the instruction into a micro instruction, and controlling the direct memory unit to perform data IO operation, the data processing module to perform related operation and the data cache unit to perform data caching and transmission.
10. The apparatus of claim 1, wherein each of the data processing modules comprises a data processing control sub-module, a data dependency discrimination sub-module, and a data operation sub-module, wherein:
the data processing control sub-module is used for controlling the operation performed by the data operation sub-module;
the data dependence judging submodule is used for judging whether the data dependence exists in the current operation or not.
11. The apparatus of claim 10 wherein for operations that have a data dependency, the data processing control submodule is first configured to control the data dependency determination submodule to determine whether the current operation has a data dependency, and if so, the data processing control submodule is configured to insert a null operation into the data operation submodule, and after the data dependency is released, to control the data operation submodule to perform the data operation;
the data dependency judging sub-module is controlled by the data processing control sub-module and is used for checking whether a data dependency exists in the data operation sub-module; if the next operation needs to use the value which is not operated at present, the data dependence exists at present, otherwise, the data dependence does not exist;
The data operation sub-module is controlled by the data processing control sub-module and is used for completing data processing in the LSTM neural network operation process.
12. The apparatus of claim 11, wherein the data dependency determination submodule is configured to, when performing data dependency detection:
five registers Rl, R2, R3, R4, R5 are present in the data-manipulation submodule, respectively
Marking whether the tanh function conversion operation of the input gate, the forgetting gate, the output gate, the state unit to be selected and the updated state unit is finished or not, wherein a value of a register other than 0 indicates that the operation is finished, and a value of 0 indicates that the operation is not finished; corresponding to the LSTM network, the data dependency judging sub-module judges two times of data dependency, namely judges whether data dependency exists among the input gate, the output gate and the selected state unit when calculating a new state unit and judges whether data dependency exists between the output gate and the tanh function conversion of the updated state unit when calculating an output value, and judges whether R1, R3 and R4 are all non-0 and R2 and R5 are all non-0 respectively.
13. A method of performing LSTM neural network operations, the method comprising:
writing input data, weights and offsets into a plurality of data caching units, wherein each data caching unit comprises a complete input data, and the weights and offsets in each data caching unit are a part of total weights and offsets; the total weight and bias refer to the weight and bias read in from an external designated address;
The method comprises the steps of obtaining weight values and bias required by input data and operation from a plurality of data caching units to a plurality of data processing modules, performing LSTM neural network operation to obtain output values in each data processing module, and writing back the output values to the data caching units, wherein the plurality of data processing modules are in one-to-one correspondence with the plurality of data caching units which are arranged in parallel, and parallel operation is performed among the plurality of data processing modules;
and after the output values in all the data processing modules are written back to the data caching unit, splicing the output values in each data processing module to obtain a final output value.
14. The method of claim 13, wherein the obtaining weights and biases required for the input data and operations from the plurality of data cache units to a plurality of data processing modules, performing LSTM neural network operations, comprises:
vector values of a forgetting gate, an input gate, an output gate and a state unit to be selected in the LSTM neural network operation are calculated respectively to perform vector operation of the LSTM neural network operation.
15. The method of claim 14, wherein the calculating vector values of the forget gate, the input gate, the output gate, and the candidate state cell in the LSTM neural network operation, respectively, comprises:
Dividing the weight value and the input data in each data buffer unit into a plurality of parts, wherein the number of the weight value or the input data in each part is the same as the number of vector operation units in a corresponding single data processing module;
calculating a weight and input data each time to obtain a partial sum, and adding vectors with the partial sum obtained before to obtain a new partial sum, wherein the initial value of the partial sum is a bias value;
after all input data are processed, the obtained partial sum is the net activation quantity corresponding to the neuron, and the net activation quantity of the neuron is transformed through a nonlinear function tanh or sigmoid function to obtain an output value of the neuron;
by using different weights and offsets in this way, the above steps are repeated to calculate the vector values of the forgetting gate, the input gate, the output gate and the state unit to be selected in the LSTM neural network operation respectively.
16. The method of claim 15, wherein the method further comprises:
sending the forgotten gate and the old state unit into a data processing module, calculating to obtain partial sums, and writing the partial sums back into the data caching unit;
the state unit to be selected and the input gate are sent into a data processing module, a partial sum is obtained through calculation, the partial sum in the data caching list is written into the data processing module, the updated state unit is obtained through vector addition with the partial sum obtained through calculation, and the updated state unit is written back into the data caching unit;
Sending the output gate into a data processing module, performing vector point multiplication on the value transformed by the nonlinear transformation function tanh of the updated state unit to obtain an output value, and writing the output value back into a data caching unit;
after the corresponding updated state units and output values are obtained from each data caching unit, the output values in all the data caching units are combined to obtain the final output value.
17. The method of claim 16, wherein the data processing module performs the LSTM neural network operations using a vector point multiplication component, a vector addition component, a vector summation component, and a vector nonlinear function conversion component to obtain an output value for each of the data processing modules; comprising the following steps:
the vector multiplication component and the summation component in the data processing module complete the calculation of partial sums;
calculating the dot multiplication of an old state unit and a forgetting gate, and the dot multiplication of a state unit to be selected and an input gate through a vector dot multiplication component in a data processing module, and then calculating the two results through a vector addition component to obtain an updated state unit;
the updated state unit in the data processing module is converted into a tanh function by using a vector nonlinear function conversion component;
And carrying out vector dot multiplication component operation on the vector of the output gate and the updated state unit after the tanh nonlinear transformation to obtain a final neuron output value.
18. The method of claim 17, wherein the vector nonlinear function conversion is functionally represented by a look-up table method.
19. The method of claim 16, wherein the method further comprises:
judging whether the calculation of the vector values of the current forgetting gate, the input gate and the state unit to be selected is completed or not;
if so, a new state unit calculation is performed.
20. The method according to claim 16 or 19, characterized in that the method further comprises:
judging whether the nonlinear transformation of the current updated data state unit and an output gate are calculated;
and if the calculation is completed, calculating the output gate and the vector subjected to nonlinear transformation of the data state unit to be updated through a vector point multiplication component to obtain a final output value.
21. The method of claim 13, wherein the method further comprises:
the direct memory unit reads the weight and bias for LSTM neural network operation from an external designated address space, writes the weight and bias into a plurality of data cache units which are arranged in parallel, and initializes the state units of the data cache units; the weight and bias read from the external appointed address space are divided and sent into corresponding data caching units corresponding to the neurons operated by the LSTM neural network, and the number of the weight and bias in each data caching unit is the same;
An IO instruction is stored in advance at the first address of the instruction cache unit;
and reading an instruction from the instruction cache unit, decoding the instruction into a micro instruction, controlling the direct memory unit to perform data IO operation, performing related operation by a data processing module, and performing data cache and transmission by the data cache unit.
22. The method of claim 13, wherein each of the data processing modules includes a data processing control sub-module, a data dependency discrimination sub-module, and a data operation sub-module, the method further comprising:
the data processing control sub-module controls the operation performed by the data operation sub-module;
the data dependency judging submodule judges whether the current operation has data dependency or not.
23. The method of claim 22, wherein the data processing control sub-module controls operations performed by a data operation sub-module, comprising:
for the operation with the data dependency relationship, the data processing control sub-module firstly controls the data dependency judging sub-module to judge whether the current operation has the data dependency relationship, if the current operation has the data dependency relationship, the data processing control sub-module enables the data operation sub-module to insert the null operation, and after the data dependency relationship is relieved, the data operation sub-module is controlled to perform the data operation;
The data dependency judging sub-module is controlled by the data processing control sub-module and checks whether a data dependency exists in the data operation sub-module; if the next operation needs to use the value which is not operated at present, the data dependence exists at present, otherwise, the data dependence does not exist;
the data operation sub-module is controlled by the data processing control sub-module and is used for completing data processing in the LSTM neural network operation process.
24. The method of claim 23, wherein the step of the data dependency determination submodule performing data dependency detection comprises:
five registers Rl, R2, R3, R4, R5 are present in the data-manipulation submodule, respectively
Marking whether the tanh function conversion operation of the input gate, the forgetting gate, the output gate, the state unit to be selected and the updated state unit is finished or not, wherein a value of a register other than 0 indicates that the operation is finished, and a value of 0 indicates that the operation is not finished; corresponding to the LSTM network, the data dependency judging sub-module judges two times of data dependency, namely judges whether data dependency exists among the input gate, the output gate and the selected state unit when calculating a new state unit and judges whether data dependency exists between the output gate and the tanh function conversion of the updated state unit when calculating an output value, and judges whether R1, R3 and R4 are all non-0 and R2 and R5 are all non-0 respectively.
CN202110708810.2A 2016-12-30 2016-12-30 Apparatus and method for performing LSTM neural network operation Active CN113537480B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110708810.2A CN113537480B (en) 2016-12-30 2016-12-30 Apparatus and method for performing LSTM neural network operation

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201611269665.8A CN108268939B (en) 2016-12-30 2016-12-30 Apparatus and method for performing LSTM neural network operations
CN202110708810.2A CN113537480B (en) 2016-12-30 2016-12-30 Apparatus and method for performing LSTM neural network operation

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN201611269665.8A Division CN108268939B (en) 2016-12-30 2016-12-30 Apparatus and method for performing LSTM neural network operations

Publications (2)

Publication Number Publication Date
CN113537480A CN113537480A (en) 2021-10-22
CN113537480B true CN113537480B (en) 2024-04-02

Family

ID=62771289

Family Applications (4)

Application Number Title Priority Date Filing Date
CN201611269665.8A Active CN108268939B (en) 2016-12-30 2016-12-30 Apparatus and method for performing LSTM neural network operations
CN202010018716.XA Active CN111260025B (en) 2016-12-30 2016-12-30 Apparatus and method for performing LSTM neural network operation
CN202110713121.0A Active CN113537481B (en) 2016-12-30 2016-12-30 Apparatus and method for performing LSTM neural network operation
CN202110708810.2A Active CN113537480B (en) 2016-12-30 2016-12-30 Apparatus and method for performing LSTM neural network operation

Family Applications Before (3)

Application Number Title Priority Date Filing Date
CN201611269665.8A Active CN108268939B (en) 2016-12-30 2016-12-30 Apparatus and method for performing LSTM neural network operations
CN202010018716.XA Active CN111260025B (en) 2016-12-30 2016-12-30 Apparatus and method for performing LSTM neural network operation
CN202110713121.0A Active CN113537481B (en) 2016-12-30 2016-12-30 Apparatus and method for performing LSTM neural network operation

Country Status (1)

Country Link
CN (4) CN108268939B (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110727462B (en) * 2018-07-16 2021-10-19 上海寒武纪信息科技有限公司 Data processor and data processing method
WO2020061870A1 (en) * 2018-09-27 2020-04-02 深圳大学 Lstm end-to-end single-lead electrocardiogram classification method
CN109543832B (en) * 2018-11-27 2020-03-20 中科寒武纪科技股份有限公司 Computing device and board card
CN111258636B (en) * 2018-11-30 2022-10-04 上海寒武纪信息科技有限公司 Data processing method, processor, data processing device and storage medium
US11494645B2 (en) * 2018-12-06 2022-11-08 Egis Technology Inc. Convolutional neural network processor and data processing method thereof
CN109670581B (en) * 2018-12-21 2023-05-23 中科寒武纪科技股份有限公司 Computing device and board card
WO2020125092A1 (en) * 2018-12-20 2020-06-25 中科寒武纪科技股份有限公司 Computing device and board card
US11042797B2 (en) 2019-01-08 2021-06-22 SimpleMachines Inc. Accelerating parallel processing of data in a recurrent neural network
CN110009100B (en) * 2019-03-28 2021-01-05 安徽寒武纪信息科技有限公司 Calculation method of user-defined operator and related product
CN110020720B (en) * 2019-04-01 2021-05-11 中科寒武纪科技股份有限公司 Operator splicing method and device
CN112346781A (en) * 2019-08-07 2021-02-09 上海寒武纪信息科技有限公司 Instruction processing method and device and related product
CN110347506B (en) * 2019-06-28 2023-01-06 Oppo广东移动通信有限公司 Data processing method and device based on LSTM, storage medium and electronic equipment
CN111652361B (en) * 2020-06-04 2023-09-26 南京博芯电子技术有限公司 Composite granularity near storage approximate acceleration structure system and method for long-short-term memory network
CN111898752A (en) * 2020-08-03 2020-11-06 乐鑫信息科技(上海)股份有限公司 Apparatus and method for performing LSTM neural network operations
CN112948126A (en) * 2021-03-29 2021-06-11 维沃移动通信有限公司 Data processing method, device and chip

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001069424A2 (en) * 2000-03-10 2001-09-20 Jaber Associates, L.L.C. Parallel multiprocessing for the fast fourier transform with pipeline architecture
US6654730B1 (en) * 1999-12-28 2003-11-25 Fuji Xerox Co., Ltd. Neural network arithmetic apparatus and neutral network operation method
CN105184366A (en) * 2015-09-15 2015-12-23 中国科学院计算技术研究所 Time-division-multiplexing general neural network processor
WO2016018569A1 (en) * 2014-07-31 2016-02-04 Qualcomm Incorporated Long short-term memory using a spiking neural network
CN105513591A (en) * 2015-12-21 2016-04-20 百度在线网络技术(北京)有限公司 Method and device for speech recognition by use of LSTM recurrent neural network model

Family Cites Families (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0296861A (en) * 1988-10-03 1990-04-09 Mitsubishi Electric Corp Microprocessor peripheral function circuit device
JPH03162800A (en) * 1989-08-29 1991-07-12 Mitsubishi Electric Corp Semiconductor memory device
JP2001034597A (en) * 1999-07-22 2001-02-09 Fujitsu Ltd Cache memory device
JP2008097572A (en) * 2006-09-11 2008-04-24 Matsushita Electric Ind Co Ltd Processing device, computer system, and mobile apparatus
CN101197017A (en) * 2007-12-24 2008-06-11 深圳市物证检验鉴定中心 Police criminal technology inspection and appraisal information system and method thereof
CN102004446A (en) * 2010-11-25 2011-04-06 福建师范大学 Self-adaptation method for back-propagation (BP) nerve cell with multilayer structure
CN104303162B (en) * 2012-01-12 2018-03-27 桑迪士克科技有限责任公司 The system and method received for managing caching
CN103150596B (en) * 2013-02-22 2015-12-23 百度在线网络技术(北京)有限公司 The training system of a kind of reverse transmittance nerve network DNN
JP6115455B2 (en) * 2013-11-29 2017-04-19 富士通株式会社 Parallel computer system, parallel computer system control method, information processing apparatus, arithmetic processing apparatus, and communication control apparatus
US20150269481A1 (en) * 2014-03-24 2015-09-24 Qualcomm Incorporated Differential encoding in neural networks
JP6453681B2 (en) * 2015-03-18 2019-01-16 株式会社東芝 Arithmetic apparatus, arithmetic method and program
US10140572B2 (en) * 2015-06-25 2018-11-27 Microsoft Technology Licensing, Llc Memory bandwidth management for deep learning applications
US20160379109A1 (en) * 2015-06-29 2016-12-29 Microsoft Technology Licensing, Llc Convolutional neural networks on hardware accelerators
CN105095961B (en) * 2015-07-16 2017-09-29 清华大学 A kind of hybrid system of artificial neural network and impulsive neural networks
CN107563497B (en) * 2016-01-20 2021-03-19 中科寒武纪科技股份有限公司 Computing device and operation method for sparse artificial neural network
CN106022468B (en) * 2016-05-17 2018-06-01 成都启英泰伦科技有限公司 the design method of artificial neural network processor integrated circuit and the integrated circuit
CN105893159B (en) * 2016-06-21 2018-06-19 北京百度网讯科技有限公司 Data processing method and device
CN106203621B (en) * 2016-07-11 2019-04-30 北京深鉴智能科技有限公司 The processor calculated for convolutional neural networks

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6654730B1 (en) * 1999-12-28 2003-11-25 Fuji Xerox Co., Ltd. Neural network arithmetic apparatus and neutral network operation method
WO2001069424A2 (en) * 2000-03-10 2001-09-20 Jaber Associates, L.L.C. Parallel multiprocessing for the fast fourier transform with pipeline architecture
WO2016018569A1 (en) * 2014-07-31 2016-02-04 Qualcomm Incorporated Long short-term memory using a spiking neural network
CN105184366A (en) * 2015-09-15 2015-12-23 中国科学院计算技术研究所 Time-division-multiplexing general neural network processor
CN105513591A (en) * 2015-12-21 2016-04-20 百度在线网络技术(北京)有限公司 Method and device for speech recognition by use of LSTM recurrent neural network model

Also Published As

Publication number Publication date
CN108268939B (en) 2021-09-07
CN113537481B (en) 2024-04-02
CN111260025A (en) 2020-06-09
CN108268939A (en) 2018-07-10
CN111260025B (en) 2023-11-14
CN113537481A (en) 2021-10-22
CN113537480A (en) 2021-10-22

Similar Documents

Publication Publication Date Title
CN113537480B (en) Apparatus and method for performing LSTM neural network operation
CN111860812B (en) Apparatus and method for performing convolutional neural network training
CN109117948B (en) Method for converting picture style and related product
CN109376861B (en) Apparatus and method for performing full connectivity layer neural network training
WO2018120016A1 (en) Apparatus for executing lstm neural network operation, and operational method
CN110929863B (en) Apparatus and method for performing LSTM operations
CN110298443B (en) Neural network operation device and method
CN107704267B (en) Convolution neural network operation instruction and method thereof
CN109358900B (en) Artificial neural network forward operation device and method supporting discrete data representation
WO2017185347A1 (en) Apparatus and method for executing recurrent neural network and lstm computations
US20200050918A1 (en) Processing apparatus and processing method
WO2017185387A1 (en) Method and device for executing forwarding operation of fully-connected layered neural network
US10853722B2 (en) Apparatus for executing LSTM neural network operation, and operational method
CN107886166B (en) Device and method for executing artificial neural network operation
CN109754062B (en) Execution method of convolution expansion instruction and related product
EP3451238A1 (en) Apparatus and method for executing pooling operation
CN108171328B (en) Neural network processor and convolution operation method executed by same
US20210098001A1 (en) Information processing method and terminal device
EP3561732A1 (en) Operation apparatus and method for artificial neural network
WO2018058427A1 (en) Neural network computation apparatus and method
CN109711540B (en) Computing device and board card
WO2017185248A1 (en) Apparatus and method for performing auto-learning operation of artificial neural network
WO2018058452A1 (en) Apparatus and method for performing artificial neural network operation
WO2017177446A1 (en) Discrete data representation-supporting apparatus and method for back-training of artificial neural network
WO2017185335A1 (en) Apparatus and method for executing batch normalization operation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant