CN113159309B - NAND flash memory-based low-power-consumption neural network accelerator storage architecture - Google Patents

NAND flash memory-based low-power-consumption neural network accelerator storage architecture Download PDF

Info

Publication number
CN113159309B
CN113159309B CN202110349392.2A CN202110349392A CN113159309B CN 113159309 B CN113159309 B CN 113159309B CN 202110349392 A CN202110349392 A CN 202110349392A CN 113159309 B CN113159309 B CN 113159309B
Authority
CN
China
Prior art keywords
matrix
cache
weight
neural network
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110349392.2A
Other languages
Chinese (zh)
Other versions
CN113159309A (en
Inventor
姜小波
邓晗珂
莫志杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202110349392.2A priority Critical patent/CN113159309B/en
Publication of CN113159309A publication Critical patent/CN113159309A/en
Application granted granted Critical
Publication of CN113159309B publication Critical patent/CN113159309B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0866Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches for peripheral storage systems, e.g. disk cache
    • G06F12/0868Data transfer between cache memory and other subsystems, e.g. storage devices or host systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0877Cache access modes
    • G06F12/0882Page mode
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/20Employing a main memory using a specific memory technology
    • G06F2212/202Non-volatile memory
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Neurology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Memory System Of A Hierarchy Structure (AREA)
  • Semiconductor Memories (AREA)

Abstract

The invention provides a low-power consumption neural network accelerator storage framework based on a NAND flash memory, which is characterized in that: the off-chip NAND flash memory comprises an off-chip NAND flash memory storage unit, a neural network computing circuit, an internal global cache and a controller; the internal global cache comprises a weight cache, an input cache, an intermediate result cache and an output cache; when the neural network calculation is carried out, the controller reads the weight data of the off-chip NAND flash memory storage unit and loads the weight data into the weight cache; loading input data into an input cache; the neural network computing circuit loads the weight data stored in the weight cache and the input data stored in the input cache, then the operation is carried out, the intermediate computing result of the neural network computing circuit is cached in the intermediate result cache, the final computing result is cached in the output cache, and then the final computing result is output. The framework can meet the calculation requirement of a deep learning algorithm for completing reasoning tasks on end-side equipment, is low in power consumption and has a power-off protection effect.

Description

NAND flash memory-based low-power-consumption neural network accelerator storage architecture
Technical Field
The invention relates to the technical field of integrated circuit design, in particular to a low-power-consumption neural network accelerator storage architecture based on a NAND flash memory.
Background
With the development of artificial intelligence and the Internet of things, the combination of the artificial intelligence and the Internet of things is further accelerated, and the end-side artificial intelligence is in a rapid development stage. On the other hand, with the synchronous improvement of the performance and complexity of the deep learning algorithm, the calculation task of deep learning needs to be integrated into the hardware architecture to accelerate the operation speed.
For end-side devices, various complex application scenarios are also encountered, subject to power consumption and cost constraints. In this context, low power consumption and low cost are the basic features of the end-side neural network accelerator in order to satisfy the deep learning algorithm that can complete the inference task at the end side.
Currently, the memories used are classified into two categories, volatile and nonvolatile, and the volatile memories are classified into Static Random Access Memories (SRAMs) and Dynamic Random Access Memories (DRAMs), and among the nonvolatile memories, flash memories are the dominant market at present. With the development of mobile devices and the internet of things, the off-chip memory is mainly based on DRAM and NAND flash memory at present. SRAM has the fastest read speed but the highest cost and is therefore only used for high speed internal cache usage. In the existing neural network accelerator architecture, a common memory architecture is: the SRAM serves as an on-chip cache and the DRAM serves as an off-chip memory. The DRAM has a higher read/write speed than the NAND flash memory, but cannot store data for a long time and is higher in cost. When the end side uses the DRAM for off-chip storage, data loss will re-read the data from the cloud, wasting a lot of power consumption, bandwidth and time cost.
Disclosure of Invention
In order to overcome the defects and shortcomings in the prior art, the invention aims to provide a low-power-consumption neural network accelerator storage architecture based on a NAND flash memory; the framework can meet the calculation requirement of a deep learning algorithm for completing reasoning tasks on end-side equipment, is low in power consumption and has a power-off protection effect.
In order to achieve the purpose, the invention is realized by the following technical scheme: a low-power consumption neural network accelerator storage architecture based on NAND flash memory is characterized in that: the off-chip NAND flash memory comprises an off-chip NAND flash memory storage unit, a neural network computing circuit, an internal global cache and a controller; the internal global cache comprises a weight cache, an input cache, an intermediate result cache and an output cache;
the off-chip NAND flash memory storage unit is used for storing weight data from a cloud or a server;
the controller is used for controlling the flow and calculation of data and is responsible for controlling the calculation of the neural network calculation circuit and the writing-in and writing-out of the data;
the neural network computing circuit is used for carrying out data computation;
the weight cache is used for caching weight data loaded from an off-chip NAND flash memory storage unit; the input cache is used for caching input data; the intermediate result cache is used for caching the intermediate calculation result of the neural network calculation circuit; the output cache is used for caching output data of the neural network computing circuit;
when neural network calculation is carried out, the controller reads the weight data of the off-chip NAND flash memory storage unit and loads the weight data into the weight cache; loading input data into an input cache; the neural network computing circuit loads the weight data stored in the weight cache and the input data stored in the input cache, then the operation is carried out, the intermediate computing result of the neural network computing circuit is cached in the intermediate result cache, the final computing result is cached in the output cache, and then the final computing result is output.
Preferably, it is applied to a neural network model with several module layers; the method for calculating the neural network by the low-power-consumption neural network accelerator storage framework comprises the following steps: loading the weight data of the neural network model into an off-chip NAND flash memory storage unit from a cloud or a server; the neural network computing circuit sequentially computes each module layer of the neural network model layer by layer:
when the first layer module layer is calculated, corresponding weight data of the first layer module layer in an off-chip NAND flash memory storage unit is loaded into a weight cache, the weight data stored in the weight cache and input data stored in an input cache are loaded into a neural network calculating circuit, then operation is carried out, and a middle calculation result of the neural network calculating circuit is cached into a middle result cache;
when the next module layer is calculated, the corresponding weight data of the currently calculated module layer in the off-chip NAND flash memory storage unit is loaded into the weight cache, and the neural network calculating circuit loads the weight data stored in the weight cache and the intermediate calculation result stored in the intermediate result cache and then performs operation; if the current calculated module layer is not the last module layer, caching the intermediate calculation result of the neural network calculation circuit into an intermediate result cache; if the current calculated module layer is the last module layer, the calculation result of the neural network calculation circuit is cached in an output cache, and then the final calculation result is output.
Preferably, applied to a neural network model transform encoder; the method for calculating the neural network by using the low-power-consumption neural network accelerator storage framework comprises the following steps: the method comprises the following steps:
s1, loading the weight and bias data of the neural network model from a cloud or a server by an off-chip NAND flash memory storage unit: position coding weight matrix and query vector weight matrix W Q A key vector weight matrix W K Value vector weight matrix W V Query vector offset vector B Q Key vector offset vector B K Value vector offset vector B V Feedforward layer first layer weight matrix W E1 Feedforward layer second layer weight matrix W F2 Feedforward layer first layer offset vector B F1 Feedforward layer second layer offset vector B F2 The multi-head attention module outputs a weight matrix Wo and a multi-head attention output offset vector B O And layer normalized gain and bias;
s2, loading the position coding weight matrix stored in the off-chip NAND flash memory storage unit into a weight cache; loading an input matrix X corresponding to the current word vector into an input cache; the neural network computing circuit takes out an input matrix X from the input cache, takes out a position coding weight matrix from the weight cache, and computes an input matrix X after position coding;
query vector weight matrix W for storing off-chip NAND flash memory cells Q A key vector weight matrix W K Sum vector weight matrix W V And query vector offset vector B Q Key vector offset vector B K Sum vector offset vector B V Loading into a weight cache; taking out the weight and the offset data from the weight cache, performing linear layer calculation on the weight and the input matrix X subjected to position coding in a neural network calculation circuit to obtain a matrix Q, a matrix K and a matrix V, and storing the matrix Q, the matrix K and the matrix V into an intermediate result cache; the calculation formula is as follows:
Q=W Q X+B Q
K=W K X+B K
V=W V X+B V
multi-head attention module output weight W for storing off-chip NAND flash memory storage unit O And an output offset B O Loading into a weight cache; loading the matrix Q and the matrix K which are cached and stored by the intermediate result to a neural network computing circuit; performing multiplication and addition operation on the matrix in the PE array, and calculating the multiplication and addition operation result through a Softmax module to obtain an attention fraction matrix S; the calculation formula is as follows:
Figure BDA0003001724670000031
wherein d is k The column numbers of the matrix Q and the matrix K correspond to the dimensionality of the word vector;
then, loading the matrix V stored in the intermediate result cache to a neural network computing circuit, and multiplying the matrix V with the attention fractional matrix S to obtain an output matrix Z i (ii) a Multi-head attention module output weight W for caching weight O And an output offset B O Loading to a neural network computing circuit; output matrix Z of each head of multi-head attention module i Output weight matrix W of multi-head attention module after splicing O Performing multiplication and addition operation, adding output offset vector B to the result of multiplication and addition operation O Obtaining a matrix Z; storing the matrix Z into an intermediate result cache; the calculation formula is as follows:
Z i =SV
Z=W O (Z 1 ...Z i )+B O
feed-forward layer first layer weight matrix W for off-chip NAND flash memory cell storage F1 And feed forward layer first layer offset vector B F1 Loading into a weight cache; loading the input matrix X and the final output matrix Z cached and stored in the intermediate result to a neural network computing circuit, and performing addition operation of the input matrix X and the final output matrix Z in the PE array to obtain a matrix A 0 Matrix A 0 Calculating by a layer normalization module to obtain a matrix L 0
Feed-forward layer first layer weight matrix W for storing weight cache F1 And feed forward layer first layer offset vector B F1 Loading to a neural network computing circuit; matrix L 0 And a feedforward layer first layer weight matrix W F1 Performing multiply-add operation, adding the result of the multiply-add operation to the first layer offset vector B of the feedforward layer F1 And derive the matrix F by ReLU 0 (ii) a Will matrix F 0 Storing the data in an intermediate result cache; the calculation formula is as follows:
F 0 =ReLU(W F1 L 0 +B F1 )
feed-forward layer second layer weight matrix W to store off-chip NAND flash memory cells F2 And a feedforward layer second layer offset vector B F2 Loading into a weight cache; matrix F for caching intermediate results 0 And a feedforward layer second layer weight matrix W stored in a weight cache F2 And a feedforward layer second layer offset vector B F2 Loading to a neural network computing circuit, and performing matrix F on the PE array 0 And a feedforward layer second layer weight matrix W F2 Multiply-add operation of adding the result of the multiply-add operation to the second layer offset vector B of the feedforward layer F2 Calculating to obtain a matrix F 1
F 1 =W F2 F 0 +B F2
Matrix A for caching intermediate results 0 Loading to a neural network computing circuit to obtain a matrix F 1 And matrix A 0 Performing addition operation to obtain a matrix A 1 Matrix A 1 The result L is obtained through the calculation of a layer normalization module 1 The result L is 1 Storing the output data in an output cache as the output of a neural network model Transformer encoder;
and S3, the off-chip NAND flash memory storage unit adopts the weight data of the currently stored neural network model, and jumps to the step S2 to encode the word vector of the next sentence until all the word vectors of all the input sentences are encoded.
Preferably, the neural network computing circuit includes a PE (basic arithmetic unit) array for performing a matrix multiply-add operation and other computing modules; the other calculation module comprises any one or more than two of an addition tree, an activation function operation unit and a nonlinear function operation unit.
Preferably, a high-speed interface is also included; the high-speed interface is used for data transmission between the off-chip NAND flash memory storage unit and the internal global cache.
Preferably, the weight data is loaded by reading in units of pages between the off-chip NAND flash memory cells and the weight cache.
Preferably, the off-chip NAND flash memory storage unit is further configured to store intermediate calculation results and/or final operation results of the neural network calculation circuit.
Compared with the prior art, the invention has the following advantages and beneficial effects:
the low-power-consumption neural network accelerator storage framework can meet the calculation requirement of a deep learning algorithm for completing an inference task on end-side equipment, is low in power consumption and has a power-off protection function; the deep learning algorithm is realized by adopting hardware, so that the performance of the deep learning algorithm can be improved, and the operation speed is accelerated.
Drawings
FIG. 1 is a block diagram of the structure of the storage architecture of the NAND flash memory based low power consumption neural network accelerator of the present invention;
FIG. 2 is a schematic block diagram of a neural network model Transformer encoder applied to the storage architecture of the NAND flash based low-power neural network accelerator according to the second embodiment.
Detailed Description
The invention is described in further detail below with reference to the drawings and the detailed description.
Example one
In this embodiment, a storage architecture of a low-power neural network accelerator based on a NAND flash memory, as shown in fig. 1, includes an off-chip NAND flash memory storage unit, a neural network computing circuit, an internal global cache, and a controller; the internal global cache includes a weight cache, an input cache, an intermediate result cache, and an output cache.
The off-chip NAND flash memory storage unit is used for storing weight data from a cloud or a server.
The controller is used for controlling the flow and calculation of data and is responsible for controlling the calculation of the neural network calculation circuit and the writing-in and writing-out of data.
The neural network computing circuit is used for carrying out data computation; the neural network computing circuit comprises a PE (basic operation unit) array and other computing modules for matrix multiply-add operation; the other calculation modules include any one or more than two of an addition tree, an activation function operation unit and a nonlinear function operation unit.
The weight cache is used for caching weight data loaded from the off-chip NAND flash memory storage unit; the input cache is used for caching input data; the intermediate result cache is used for caching the intermediate calculation result of the neural network calculation circuit; the output buffer is used for buffering the output data of the neural network computing circuit.
The system also comprises a high-speed interface; the high-speed interface is a medium for data exchange between the off-chip NAND flash memory storage unit and the internal global cache, and is used for data transmission between the off-chip NAND flash memory storage unit and the internal global cache.
The off-chip NAND flash memory storage unit may also be used to store intermediate calculation results and/or final operation results of the neural network calculation circuit.
When the neural network calculation is carried out, the controller reads the weight data of the off-chip NAND flash memory storage unit through controlling the high-speed interface and loads the weight data into the weight cache; between the off-chip NAND flash memory cell and the weight cache, the weight data is read and loaded by page unit. Loading input data into an input cache; the neural network computing circuit loads the weight data stored in the weight cache and the input data stored in the input cache, then the operation is carried out, the intermediate computing result of the neural network computing circuit is cached in the intermediate result cache, the final computing result is cached in the output cache, and then the final computing result is output.
The following description will take the example that the storage architecture of the low-power neural network accelerator is applied to a neural network model with several module layers. The method for calculating the neural network by the storage framework of the low-power-consumption neural network accelerator comprises the following steps: loading the weight data of the neural network model into an off-chip NAND flash memory storage unit from a cloud or a server; the neural network computing circuit sequentially computes each module layer of the neural network model layer by layer:
when the first layer module layer is calculated, corresponding weight data of the first layer module layer in an off-chip NAND flash memory storage unit is loaded into a weight cache, the weight data stored in the weight cache and input data stored in an input cache are loaded into a neural network calculating circuit, then operation is carried out, and a middle calculation result of the neural network calculating circuit is cached into a middle result cache;
when the next module layer is calculated, the corresponding weight data of the currently calculated module layer in the off-chip NAND flash memory storage unit is loaded into the weight cache, and the neural network calculating circuit loads the weight data stored in the weight cache and the intermediate calculation result stored in the intermediate result cache and then performs operation; if the current calculated module layer is not the last module layer, caching the intermediate calculation result of the neural network calculation circuit into an intermediate result cache; if the current calculated module layer is the last module layer, the calculation result of the neural network calculation circuit is cached in an output cache, and then the final calculation result is output.
The low-power-consumption neural network accelerator storage framework can meet the calculation requirement of a deep learning algorithm for completing an inference task on end-side equipment, is low in power consumption and has a power-off protection function; the deep learning algorithm is realized by adopting hardware, so that the performance of the deep learning algorithm can be improved, and the operation speed is accelerated.
The off-chip storage of the invention uses NAND flash memory, belongs to nonvolatile storage equipment, and the basic storage unit is a floating gate MOS tube. The electrons are controlled to pass in and out of the floating gate by an external electric field to write and read data, and the stored information is retained in the floating gate and cannot be lost due to power failure when the power is cut off. However, most of the off-chip memories used in the neural network accelerators are DRAMs, the DRAMs store data of one bit by adding a capacitor to an MOS transistor, and the capacitor is used for storing charges, so that the charges must be periodically supplemented to refresh the data, and once the power is turned off, the charges stored in the capacitor disappear, and the data stored in the DRAMs are lost. When the end side uses the DRAM for off-chip storage, once the data is lost due to power failure, the data is to be read from the cloud again, which wastes a lot of power consumption, bandwidth and time cost. Compared with the current neural network accelerator based on DRAM as off-chip storage, the neural network accelerator based on NAND flash memory as off-chip storage obviously has better power-off protection effect.
The basic memory cells of an off-chip storage DRAM used by the existing neural network are an MOS transistor and a capacitor, data is stored by charging and discharging the capacitor, and extra power consumption is caused because the charge in the capacitor is gradually lost along with time and needs to be refreshed additionally at regular time.
Compared with the existing neural network processor storage architecture based on DRAM as off-chip storage, the neural network processor storage architecture using the NAND flash memory as off-chip storage has the advantages that the basic storage unit is only one floating gate MOS tube, one capacitor is used less than that of the DRAM with the basic storage unit being one capacitor and MOS tube, the price cost is lower than that of the DRAM, and the low cost has extremely high competitiveness in end-side equipment. In addition, the basic memory unit of the NAND flash memory is one less capacitor than that of the DRAM, namely the NAND flash memory has higher integration level under the same area, the NAND flash memory has the advantages of light weight and small volume under the condition of the same integration level, and the neural network accelerator based on the NAND flash memory as off-chip storage is lighter and smaller in volume and is more suitable for being used by end-side devices.
Example two
In this embodiment, a storage architecture of a low-power neural network accelerator is applied to a neural network model Transformer encoder as an example for explanation. The principle of the neural network model transform encoder is shown in fig. 2. The method for calculating the neural network by the storage framework of the low-power-consumption neural network accelerator comprises the following steps: the method comprises the following steps:
s1, loading the weight and the bias data of the neural network model from a cloud or a server by an off-chip NAND flash memory storage unit: position coding weight matrix and query vector weight matrix W Q Key vector weight matrix W K Value vector weight matrix W V Query vector offset vector B Q Key vector offset vector B K Value vector offset vector B V Feedforward layer first layer weight matrix W F1 Feedforward layer second layer weight matrix W F2 Feedforward layer first layer offset vector B F1 Feedforward layer second layer offset vector B F2 The multi-head attention module outputs a weight matrix Wo and a multi-head attention output offset vector B O And layer normalized gain and bias;
s2, loading the position coding weight matrix stored in the off-chip NAND flash memory storage unit into a weight cache; loading an input matrix X corresponding to the current word vector into an input cache; the neural network computing circuit takes out an input matrix X from the input cache, takes out a position coding weight matrix from the weight cache, and computes the input matrix X after position coding;
query vector weight matrix W for off-chip NAND flash memory cell storage Q Key vector weight matrix W K Sum vector weight matrix W V And query vector offset vector B Q Key vector offset vector B K Sum vector offset vector B V Loading into a weight cache; taking out the weight and the offset data from the weight cache, carrying out linear layer calculation on the weight and the input matrix X after position coding in a neural network calculation circuit to obtain a matrix Q, a matrix K and a matrix V, and storing the matrix Q, the matrix K and the matrix V into an intermediate result cache; the calculation formula is as follows:
Q=W Q X+B Q
K=W K X+B K
V=W V X+B V
multi-headed stamp for off-chip NAND flash memory storage unit storageGravity module output weight W O And an output offset B O Loading into a weight cache; and loading the matrix Q and the matrix K stored in the intermediate result cache to a neural network computing circuit. Firstly, performing multiplication and addition operation on a matrix in a PE array, and calculating a result of the multiplication and addition operation through a Softmax module to obtain an attention fraction matrix S; the calculation formula is as follows:
Figure BDA0003001724670000081
wherein d is k The column numbers of the matrix Q and the matrix K correspond to the dimensionality of the word vector;
then, loading the matrix V stored in the intermediate result cache to a neural network computing circuit, and multiplying the matrix V with the attention fractional matrix S to obtain an output matrix Z i (ii) a Multi-head attention module output weight W for caching weight O And an output bias B O Loading to a neural network computing circuit; output matrix Z of each head of multi-head attention module i Output weight matrix W of multi-head attention module after splicing O Performing multiplication and addition operation, adding output offset vector B to the result of the multiplication and addition operation O Obtaining a matrix Z; storing the matrix Z into an intermediate result cache; the calculation formula is as follows:
Z i =SV
Z=W O (Z 1 ...Z i )+B O
feed-forward layer first layer weight matrix W for off-chip NAND flash memory cell storage F1 And feed forward layer first layer offset vector B F1 Loading into a weight cache; loading the input matrix X and the final output matrix Z cached and stored in the intermediate result to a neural network computing circuit, and performing addition operation of the input matrix X and the final output matrix Z in the PE array to obtain a matrix A 0 Matrix A 0 Calculating by a layer normalization module to obtain a matrix L 0
Feed-forward layer first layer weight matrix W for storing weight cache F1 And feed forward layer first layer offset vector B F1 Load to neural network meterA calculation circuit; matrix L 0 And the first layer weight matrix W of the feedforward layer F1 Performing multiply-add operation, adding the result of the multiply-add operation to the first layer offset vector B of the feedforward layer F1 And derive the matrix F by ReLU 0 (ii) a Will matrix F 0 Storing the data in an intermediate result cache; the calculation formula is as follows:
F 0 =ReLU(W F1 L 0 +B F1 )
feed-forward layer second layer weight matrix W to store off-chip NAND flash memory cells F2 And a feedforward layer second layer offset vector B F2 Loading into a weight cache; matrix F for caching intermediate results 0 And a feedforward layer second layer weight matrix W stored in a weight cache F2 And a feedforward layer second layer offset vector B F2 Loading to a neural network computing circuit, and performing matrix F on the PE array 0 And a feedforward layer second layer weight matrix W F2 Multiply-add operation of adding the result of the multiply-add operation to the second layer offset vector B of the feedforward layer F2 Calculating to obtain a matrix F 1
F 1 =W F2 F 0 +B F2
Matrix A for caching intermediate results 0 Loading to a neural network computing circuit to obtain a matrix F 1 And matrix A 0 Performing addition operation to obtain a matrix A 1 Matrix A 1 The result L is obtained through the calculation of a layer normalization module 1 The result L is 1 Storing the output data in an output cache as the output of a neural network model Transformer encoder;
and S3, the off-chip NAND flash memory storage unit adopts the weight data of the currently stored neural network model, and jumps to the step S2 to encode the word vector of the next sentence until all the word vectors of all the input sentences are encoded.
The Softmax module and the layer normalization module are both nonlinear function operation units of the neural network computing circuit.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such modifications are intended to be included in the scope of the present invention.

Claims (4)

1. A low-power consumption neural network accelerator storage architecture based on a NAND flash memory is characterized in that: the off-chip NAND flash memory comprises an off-chip NAND flash memory storage unit, a neural network computing circuit, an internal global cache and a controller; the internal global cache comprises a weight cache, an input cache, an intermediate result cache and an output cache;
the off-chip NAND flash memory storage unit is used for storing weight data from a cloud or a server;
the controller is used for controlling the flow and calculation of data and is responsible for controlling the calculation of the neural network calculation circuit and the writing-in and writing-out of the data;
the neural network computing circuit is used for carrying out data computation;
the weight cache is used for caching weight data loaded from an off-chip NAND flash memory storage unit; the input cache is used for caching input data; the intermediate result cache is used for caching the intermediate calculation result of the neural network calculation circuit; the output cache is used for caching output data of the neural network computing circuit;
when the neural network calculation is carried out, the controller reads the weight data of the off-chip NAND flash memory storage unit and loads the weight data into the weight cache; loading input data into an input cache; the neural network computing circuit loads the weight data stored in the weight cache and the input data stored in the input cache, then the operation is carried out, the intermediate computing result of the neural network computing circuit is cached in the intermediate result cache, the final computing result is cached in the output cache, and the final computing result is output;
the method is applied to a neural network model with a plurality of module layers; the method for calculating the neural network by the low-power-consumption neural network accelerator storage framework comprises the following steps: loading the weight data of the neural network model into an off-chip NAND flash memory storage unit from a cloud or a server; the neural network computing circuit sequentially computes each module layer of the neural network model layer by layer:
when the first layer module layer is calculated, corresponding weight data of the first layer module layer in an off-chip NAND flash memory storage unit is loaded into a weight cache, the weight data stored in the weight cache and input data stored in an input cache are loaded into a neural network calculating circuit, then operation is carried out, and a middle calculation result of the neural network calculating circuit is cached into a middle result cache;
when the next module layer is calculated, the corresponding weight data of the currently calculated module layer in the off-chip NAND flash memory storage unit is loaded into the weight cache, and the neural network calculating circuit loads the weight data stored in the weight cache and the intermediate calculation result stored in the intermediate result cache and then performs operation; if the current calculated module layer is not the last module layer, caching the intermediate calculation result of the neural network calculation circuit into an intermediate result cache; if the current calculated module layer is the last module layer, caching the calculation result of the neural network calculation circuit into an output cache, and outputting the final calculation result;
applying the neural network model transform coder; the method for calculating the neural network by the low-power-consumption neural network accelerator storage framework comprises the following steps: the method comprises the following steps:
s1, loading the weight and the bias data of the neural network model from a cloud or a server by an off-chip NAND flash memory storage unit: position coding weight matrix and query vector weight matrix W Q Key vector weight matrix W K Value vector weight matrix W V Query vector offset vector B Q Key vector offset vector B K Value vector offset vector B V Feedforward layer first layer weight matrix W F1 Feedforward layer second layer weight matrix W F2 Feedforward layer first layer offset vector B F1 Feedforward layer second layer offset vector B F2 The multi-head attention module outputs a weight matrix Wo and a multi-head attention output offset vector B O And layer normalized gain and bias;
s2, loading the position coding weight matrix stored in the off-chip NAND flash memory storage unit into a weight cache; loading an input matrix X corresponding to the current word vector into an input cache; the neural network computing circuit takes out an input matrix X from the input cache, takes out a position coding weight matrix from the weight cache, and computes the input matrix X after position coding;
query vector weight matrix W for storing off-chip NAND flash memory cells Q Key vector weight matrix W K Sum vector weight matrix W V And query vector offset vector B Q Key vector offset vector B K Sum vector offset vector B V Loading into a weight cache; taking out the weight and the offset data from the weight cache, carrying out linear layer calculation on the weight and the input matrix X after position coding in a neural network calculation circuit to obtain a matrix Q, a matrix K and a matrix V, and storing the matrix Q, the matrix K and the matrix V into an intermediate result cache;
multi-head attention module output weight W for storing off-chip NAND flash memory storage unit O And an output offset B O Loading into a weight cache; loading the matrix Q and the matrix K which are cached and stored by the intermediate result to a neural network computing circuit; performing multiplication and addition operation on the matrix in the PE array, and calculating the multiplication and addition operation result through a Softmax module to obtain an attention fraction matrix S;
then, loading the matrix V stored in the intermediate result cache to a neural network computing circuit, and multiplying the matrix V with the attention fractional matrix S to obtain an output matrix Z i (ii) a Multi-head attention module output weight W for caching weight O And an output offset B O Loading to a neural network computing circuit; output matrix Z of each head of multi-head attention module i Output weight matrix W of multi-head attention module after splicing O Performing multiplication and addition operation, adding output offset vector B to the result of multiplication and addition operation O Obtaining a matrix Z; storing the matrix Z into an intermediate result cache;
feed-forward layer first layer weight matrix W for off-chip NAND flash memory cell storage F1 And feed forward layer first layer offset vector B F1 Loading into a weight cache; loading the input matrix X and the final output matrix Z stored in the intermediate result cache to the neural network computing computerIn the way, the addition operation of the input matrix X and the final output matrix Z is carried out on the PE array to obtain a matrix A 0 Matrix A 0 Obtaining a matrix L through calculation of a layer normalization module 0
Feed-forward layer first layer weight matrix W for caching weights F1 And feed forward layer first layer offset vector B F1 Loading to a neural network computing circuit; matrix L 0 And a feedforward layer first layer weight matrix W F1 Performing multiply-add operation, adding the result of the multiply-add operation to the first layer offset vector B of the feedforward layer F1 And derive the matrix F by ReLU 0 (ii) a Will matrix F 0 Storing the data in an intermediate result cache;
feed-forward layer second layer weight matrix W to store off-chip NAND flash memory cells F2 And a feedforward layer second layer offset vector B F2 Loading into a weight cache; matrix F for caching intermediate results 0 And a feedforward layer second layer weight matrix W stored in a weight cache F2 And a feedforward layer second layer offset vector B F2 Loading to a neural network computing circuit, and performing matrix F on the PE array 0 And a feedforward layer second layer weight matrix W F2 Multiply-add operation of adding the result of the multiply-add operation to the second layer offset vector B of the feedforward layer F2 Calculating to obtain a matrix F 1
Matrix A for caching intermediate results 0 Loading to a neural network computing circuit to obtain a matrix F 1 And matrix A 0 Performing addition operation to obtain a matrix A 1 Matrix A 1 The result L is obtained through the calculation of a layer normalization module 1 The result L is 1 Storing the output data in an output cache as the output of a neural network model Transformer encoder;
and S3, the off-chip NAND flash memory storage unit adopts the currently stored weight data of the neural network model, and skips to the step S2 to encode the word vector of the next sentence until all the word vectors of all the input sentences are encoded.
2. The NAND-flash-based low power neural network accelerator memory architecture of claim 1, wherein: the system also comprises a high-speed interface; the high-speed interface is used for data transmission between the off-chip NAND flash memory storage unit and the internal global cache.
3. The NAND-flash-based low power neural network accelerator memory architecture of claim 1, wherein: between the off-chip NAND flash memory cell and the weight cache, the weight data is read and loaded by page unit.
4. The NAND-flash-based low power neural network accelerator memory architecture of claim 1, wherein: the off-chip NAND flash memory storage unit is also used for storing the final operation result of the neural network computing circuit.
CN202110349392.2A 2021-03-31 2021-03-31 NAND flash memory-based low-power-consumption neural network accelerator storage architecture Active CN113159309B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110349392.2A CN113159309B (en) 2021-03-31 2021-03-31 NAND flash memory-based low-power-consumption neural network accelerator storage architecture

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110349392.2A CN113159309B (en) 2021-03-31 2021-03-31 NAND flash memory-based low-power-consumption neural network accelerator storage architecture

Publications (2)

Publication Number Publication Date
CN113159309A CN113159309A (en) 2021-07-23
CN113159309B true CN113159309B (en) 2023-03-21

Family

ID=76885744

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110349392.2A Active CN113159309B (en) 2021-03-31 2021-03-31 NAND flash memory-based low-power-consumption neural network accelerator storage architecture

Country Status (1)

Country Link
CN (1) CN113159309B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117787366B (en) * 2024-02-28 2024-05-10 苏州元脑智能科技有限公司 Hardware accelerator and scheduling method thereof

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111222626A (en) * 2019-11-07 2020-06-02 合肥恒烁半导体有限公司 Data segmentation operation method of neural network based on NOR Flash module
CN111241028A (en) * 2018-11-28 2020-06-05 北京知存科技有限公司 Digital-analog hybrid storage and calculation integrated chip and calculation device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200184335A1 (en) * 2018-12-06 2020-06-11 Western Digital Technologies, Inc. Non-volatile memory die with deep learning neural network
CN109948774B (en) * 2019-01-25 2022-12-13 中山大学 Neural network accelerator based on network layer binding operation and implementation method thereof
CN110490311A (en) * 2019-07-08 2019-11-22 华南理工大学 Convolutional neural networks accelerator and its control method based on RISC-V framework
CN111062471B (en) * 2019-11-23 2023-05-02 复旦大学 Deep learning accelerator for accelerating BERT neural network operation

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111241028A (en) * 2018-11-28 2020-06-05 北京知存科技有限公司 Digital-analog hybrid storage and calculation integrated chip and calculation device
CN111222626A (en) * 2019-11-07 2020-06-02 合肥恒烁半导体有限公司 Data segmentation operation method of neural network based on NOR Flash module

Also Published As

Publication number Publication date
CN113159309A (en) 2021-07-23

Similar Documents

Publication Publication Date Title
Nguyen et al. An approximate memory architecture for a reduction of refresh power consumption in deep learning applications
US11625584B2 (en) Reconfigurable memory compression techniques for deep neural networks
CN112151091B (en) 8T SRAM unit and memory computing device
CN109934336B (en) Neural network dynamic acceleration platform design method based on optimal structure search and neural network dynamic acceleration platform
CN109979503B (en) Static random access memory circuit structure for realizing Hamming distance calculation in memory
CN113159309B (en) NAND flash memory-based low-power-consumption neural network accelerator storage architecture
CN110991608B (en) Convolutional neural network quantitative calculation method and system
Putra et al. Respawn: Energy-efficient fault-tolerance for spiking neural networks considering unreliable memories
CN110176264A (en) A kind of high-low-position consolidation circuit structure calculated interior based on memory
US20210216846A1 (en) Transpose memory unit for multi-bit convolutional neural network based computing-in-memory applications, transpose memory array structure for multi-bit convolutional neural network based computing-in-memory applications and computing method thereof
CN113296734A (en) Multi-position storage device
US11495287B2 (en) Memory unit for multi-bit convolutional neural network based computing-in-memory applications, memory array structure for multi-bit convolutional neural network based computing-in-memory applications and computing method
CN112233712B (en) 6T SRAM (static random Access memory) storage device, storage system and storage method
Sehgal et al. Trends in analog and digital intensive compute-in-SRAM designs
CN117217318B (en) Text generation method and device based on Transformer network model
CN113539318A (en) Memory computing circuit chip based on magnetic cache and computing device
CN117574970A (en) Inference acceleration method, system, terminal and medium for large-scale language model
Yue et al. AERIS: Area/Energy-efficient 1T2R ReRAM based processing-in-memory neural network system-on-a-chip
US11615834B2 (en) Semiconductor storage device and information processor
US11763162B2 (en) Dynamic gradient calibration method for computing-in-memory neural network and system thereof
CN114895869A (en) Multi-bit memory computing device with symbols
Kozu et al. Low Power Neural Network by Reducing SRAM Operating Voltage
CN113392963A (en) CNN hardware acceleration system design method based on FPGA
Kumar et al. Design and power analysis of 16× 16 SRAM Array Employing 7T I-LSVL
Do Park et al. 10.76 TOPS/W CNN Algorithm Circuit using Processor-In-Memory with 8T-SRAM

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant