CN111967587A - Arithmetic unit array structure for neural network processing - Google Patents
Arithmetic unit array structure for neural network processing Download PDFInfo
- Publication number
- CN111967587A CN111967587A CN202010728621.7A CN202010728621A CN111967587A CN 111967587 A CN111967587 A CN 111967587A CN 202010728621 A CN202010728621 A CN 202010728621A CN 111967587 A CN111967587 A CN 111967587A
- Authority
- CN
- China
- Prior art keywords
- operation unit
- unit module
- module
- local bus
- intermediate result
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 14
- 230000005284 excitation Effects 0.000 claims description 12
- 230000005540 biological transmission Effects 0.000 claims description 6
- 238000009825 accumulation Methods 0.000 claims description 4
- 230000003993 interaction Effects 0.000 claims description 4
- 230000002457 bidirectional effect Effects 0.000 claims description 2
- 238000004364 calculation method Methods 0.000 abstract description 3
- 238000010586 diagram Methods 0.000 description 6
- 238000000034 method Methods 0.000 description 6
- 230000008569 process Effects 0.000 description 4
- 230000003068 static effect Effects 0.000 description 3
- 102100030148 Integrator complex subunit 8 Human genes 0.000 description 2
- 101710092891 Integrator complex subunit 8 Proteins 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000008707 rearrangement Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000005265 energy consumption Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000009191 jumping Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30098—Register arrangements
- G06F9/3012—Organisation of register space, e.g. banked or distributed register file
- G06F9/30134—Register stacks; shift registers
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Neurology (AREA)
- Complex Calculations (AREA)
Abstract
The invention relates to an arithmetic unit array structure facing neural network processing. The system consists of an arithmetic unit module and a local bus module. The single operation unit module is responsible for completing one-dimensional convolution operation, the local bus module upwards transmits the intermediate result, the intermediate result is accumulated, the operation of two-dimensional convolution is completed, the write back of the intermediate result is reduced, and the overall energy efficiency ratio of the system is improved. The operation unit module is internally provided with a plurality of register files, and simultaneously performs the super-two-dimensional convolution operation of a plurality of convolution kernels, so that the data reuse degree is further improved, and the write back of intermediate results is reduced. The operation unit array adopts a self-organizing mode, receives a control signal from the top layer, automatically calculates the space position of the current operation unit required by the completion of two-dimensional convolution operation according to the ID configuration of the adjacent operation units by the local bus module, then automatically completes the receiving and sending of data and related operation, and has certain autonomy. The invention can improve the calculation efficiency in the neural network processing.
Description
Technical Field
The invention belongs to the technical field of integrated circuit design, and particularly relates to an arithmetic unit array structure for neural network processing.
Background
Deep convolutional neural networks are well applied to important fields such as computer vision, speech recognition and robot control, but various applications also continuously put higher requirements on the precision and complexity of neural network algorithms, so that the algorithm implementation faces a series of challenging problems. Although the traditional processor architecture makes certain progress, the problems of low data reuse degree, poor energy efficiency ratio and the like caused by direct communication of data between arithmetic units exist. In order to improve the above problems, in recent years, researchers have designed a space-based processor architecture based on array parallelism, and with proper data flow strategies, data reuse degree and operation speed of a neural network algorithm can be significantly improved.
Convolution operation is the most basic operation in a neural network algorithm, and for the current deep convolution neural network, the convolution operation with huge calculation amount is generally needed. Convolution operations, i.e., tensor operations, are described by mathematical expressions-that is,the key to the realization is that the weights of a plurality of convolution kernels and the numerical value of the input characteristic diagram are subjected to multiply-accumulate operation. If the direct solution is directly carried out according to the mode of the operation formula, along with the improvement of the complexity of the neural network algorithm and the increase of the data calculation amount, the direct solution method can frequently store read and write data from the outside, and the energy efficiency ratio of the system is greatly reduced. The other method is to adopt an adaptive data flow strategy, fix a certain data type and reduce the data reading and writing times. The adaptive data stream can select a proper storage hierarchy to access data, and energy consumption caused by memory access is minimized. The arithmetic unit array matched with the data flow strategy is a common hardware implementation mode and is beneficial to the implementation of an input bus and an output bus, so that the transmission efficiency of data is greatly improved. The common data flow strategies comprise Weight fixing (WS, Weight static), Output fixing (OS, Output static) and Row fixing (Row static), the data reuse degree of the Weight can only be improved by using the Weight fixed data flow strategy, the read-write times of the intermediate result can only be reduced by using the Output fixed data flow strategy, and the read-write times of three types of data can not only be improved by using the Row fixed data flow strategy, but also can be reduced. And row fixed data flow strategyA plurality of convolution kernels can be introduced into the operation unit to carry out super-two-dimensional operation, and the data reuse degree and the write-back frequency of the intermediate result are further improved. The design provides a row fixed data flow strategy, and a two-dimensional operation unit array structure is adopted to complete the realization of a neural network algorithm.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to provide an arithmetic unit array structure facing neural network processing, which adopts a row fixed data flow strategy and two-dimensional arithmetic unit array arrangement, can realize the maximization of data multiplexing degree,
the invention provides an arithmetic unit array structure facing neural network processing, wherein: the system comprises an arithmetic unit module and a local bus module, wherein the input end of the local bus module is connected with an ID value;
the operation unit module is divided into a top operation unit module, a middle operation unit module and a bottom operation unit module, and the local bus module is positioned in the vertical direction of the operation unit module; the operation unit module consists of a state machine, a register file and a multiply-accumulate unit module, wherein the register file comprises an input excitation register file, a weight register file and an intermediate result register file, the data request input end of the state machine is connected with the input end of the register file, and the register file and the multiply-accumulate unit are in bidirectional interaction;
the operation unit module is the most basic unit for completing convolution operation, is responsible for receiving input data from the input local bus module, completes one-dimensional convolution operation, sends an intermediate result to the local bus module for upward transmission and completes accumulation of the intermediate result according to different array positions of the operation unit module, and finally obtains output excitation;
the local bus module is responsible for unidirectional interaction of intermediate results between adjacent operation units in the vertical direction and calculates the spatial position of the operation units according to the ID values of the adjacent operation unit modules;
the local bus module obtains a group of enabling signals according to the ID value of the operation unit module in the vertical direction, and feeds the enabling signals back to the operation unit module connected with the local bus module, so that the spatial position of each operation unit module is calculated. According to the enabling signal, the operation unit module reads input data and carries out a convolution operation unit to obtain an intermediate result of one-dimensional convolution operation; then the operation unit module at the bottom transmits the middle result to the middle operation unit module along the local bus, the middle operation unit module accumulates the middle result and the input middle result, and transmits the accumulated middle result to the operation unit module at the top; the top arithmetic unit module accumulates the intermediate result of the top arithmetic unit module and the intermediate result from the storage system and the bottom arithmetic unit module to finally obtain the output excitation.
In the invention, the operation unit module internally comprises register files of three data types to form the lowest-level storage level of a processor architecture, input data is fixed in the registers, the data reuse degree is improved, and the read-write times of the input data are reduced, thereby reducing the power consumption caused by memory access. And temporarily storing an intermediate result obtained by multiply-accumulate operation, and reducing the read-write times of the intermediate result, thereby further improving the energy efficiency ratio of the system.
The invention has the beneficial effects that: in the design, a row fixed data stream strategy is adopted, so that the write back of an intermediate result in the two-dimensional convolution operation process is avoided, the read-write times of the intermediate result are reduced, and the high energy efficiency ratio is realized. Meanwhile, in the implementation process of the arithmetic unit, the storage hierarchy of the register file is introduced, so that the function is ensured to be correct, the data reuse degree is further improved, the write-back frequency of an intermediate result is reduced, and the energy efficiency ratio of the system is further improved.
Drawings
Fig. 1 is a basic block diagram of an arithmetic unit array structure.
Fig. 2 is a block diagram of the arithmetic unit.
FIG. 3 is a layout diagram of spatial locations of the operation cells.
FIG. 4 is a schematic diagram of window sliding data rearrangement inside an arithmetic unit. Wherein: (a) inputting a stimulus register for the arithmetic unit, and (b) inputting a stimulus register shift operation for the arithmetic unit.
Reference numbers in the figures: 1 is an arithmetic unit module, 2 is a local bus module, 3 is a register file, 4 is an input excitation register file, 5 is a weight register file, 6 is an intermediate result register file, 7 is a state machine, 8 is a multiplication and accumulation unit module, 9 is a top arithmetic unit module, 10 is an intermediate arithmetic unit module, and 11 is a bottom arithmetic unit module.
Detailed Description
The invention is further illustrated by the following examples in conjunction with the accompanying drawings.
Example 1: the basic block diagram of the arithmetic unit array structure is shown in fig. 1. The working process of the design is as follows: firstly, the local bus module 2 obtains a set of enabling signals according to the ID value of the operation unit module 1 in the vertical direction, and feeds the enabling signals back to the operation unit module 1 connected with the local bus module, thereby calculating the spatial position of each operation unit module 1. According to the enabling signal, the operation unit module 1 reads input data and carries out a convolution operation unit to obtain an intermediate result of one-dimensional convolution operation; then the bottom arithmetic unit module 11 transmits the intermediate result to the intermediate arithmetic unit module 10 along the local bus, the intermediate arithmetic unit module 10 accumulates the intermediate result of itself and the input intermediate result, and transmits the accumulated intermediate result to the top arithmetic unit module 9; the top arithmetic unit module 9 accumulates the intermediate result of itself with the intermediate result from the storage system and the bottom arithmetic unit module 11 to finally obtain the output excitation.
The structure of the arithmetic unit module 1 is shown in fig. 2, and the arithmetic unit module 1 includes 6 FIFO interfaces, wherein the read FIFO interface corresponds to the input excitation, the weight, the intermediate result in the channel direction and the intermediate result from the bottom arithmetic unit module 11, and the write FIFO interface corresponds to the intermediate result transmitted upward and the intermediate result written back to the storage system, and all these FIFOs constitute the data transmission channel of the arithmetic unit module 1 and the local bus module 2. The enable signal from the local bus module 2 controls the arithmetic unit module 1 to read and write the intermediate result FIFO, and the relative position of the arithmetic unit module 1 in the software model in the spatial array can be realized by mixing and matching the enable signal, and fig. 3 lists the combination of the spatial position of the arithmetic unit corresponding to the match of the enable signal. It should be noted that any enable collocation combination other than the FIG. 2 style is illegal. The register files 3 of three data types are introduced into the operation unit module 1 to form the bottommost layer of the storage system, so that the data reuse degree can be improved, the read-write times of a high-level storage unit can be reduced, and the data access power consumption overhead can be reduced. Table 1 is a convolution operation unit internal register file information table. Table 1 lists the data bit widths and depths for the three register files. The depth of the data input into the excitation register file 4 is 12, and the value is equal to the height of the convolution operation array, namely the maximum size of a convolution kernel, and the maximum window width corresponding to each convolution operation can be expanded according to design requirements. The reason why the depth of the weight register file 5 is 224 and the numerical value is large is that a plurality of convolution kernels need to be stored at the same time in order to support the super-two-dimensional operation inside the operation unit. The data depth of the intermediate result register file 6 is 24, each position represents an output excitation in a different channel direction, and this value limits the maximum number of convolution kernels that can be fixed inside the one-dimensional convolution operation unit. The data transmission path and the internal storage hierarchy of the convolution operation unit provide data guarantee for operation. The internal work flow is composed of a state machine 7, one-dimensional convolution operation and data reading are sequentially completed through jumping of the state machine 7, wherein: IDLE represents IDLE state; SET _ CONFIG represents the reading of configuration parameters, including the line width of input excitation, the line width of convolution kernel, the horizontal direction step of convolution operation and the number of convolution kernels for performing super-two-dimensional operation; reading a corresponding quantity of weights from the FIFO and placing the weights into an internal register file according to the configuration parameters by the READ _ FILTER; the READ _ WINDOW _ DATA is responsible for reading input stimuli, and when the state is jumped to for the first time, the input stimuli with the line width of the convolution kernel needs to be READ, and then the input stimuli are according to the step parameters. A corresponding number of input stimuli are read and placed into the internal register file. It should be noted here that the data read according to the stride parameter window sliding will cover the data discarded due to the window sliding, and increase the multiplexing degree of the input stimuli, but the sliding will cause the input stimuli to change in the relative position of the register file, so a shifter is introduced inside the register to rearrange the position of the window data to the correct position, and the data rearrangement process is shown in fig. 4. The disable _ DATA & ALU is responsible for placing bet DATA from the internal registers into the arithmetic unit for multiply-accumulate operations. According to the classification of the operation unit module 1, the operation unit module 11 at the bottom writes the intermediate result back to the FIFO for transmission upwards, and the operation unit modules 10 at the middle and the operation unit modules 9 at the top temporarily write the result back to the register file 3. PSUM _ READ _ MEM & ACCUM _ PSUM _ MEM accumulates intermediate results inside the arithmetic units along the channel direction, and only jumps when the arithmetic unit is positioned at the top, other types of arithmetic units belong to illegal jumps.
TABLE 1
Register type | Data bit width | Depth of |
4 | INT8 | 12 |
5 | INT8 | 224 |
6 | INT32 | 24 |
PSUM _ READ _ LOCAL & ACCUM _ PSUM _ LOCAL READs the intermediate result from the FIFO for accumulation, which state jumps only on the top and intermediate arithmetic units. FINISH represents the end of the current operation, and the relevant register is reset.
Claims (1)
1. An arithmetic unit array structure oriented to neural network processing is characterized in that: the system comprises an arithmetic unit module and a local bus module, wherein the input end of the local bus module is connected with an ID value;
the operation unit module is divided into a top operation unit module, a middle operation unit module and a bottom operation unit module, and the local bus module is positioned in the vertical direction of the operation unit module; the operation unit module consists of a state machine, a register file and a multiply-accumulate unit module, wherein the register file comprises an input excitation register file, a weight register file and an intermediate result register file, the data request input end of the state machine is connected with the input end of the register file, and the register file and the multiply-accumulate unit are in bidirectional interaction;
the operation unit module is the most basic unit for completing convolution operation, is responsible for receiving input data from the input local bus module, completes one-dimensional convolution operation, sends an intermediate result to the local bus module for upward transmission and completes accumulation of the intermediate result according to different array positions of the operation unit module, and finally obtains output excitation;
the local bus module is responsible for unidirectional interaction of intermediate results between adjacent operation units in the vertical direction and calculates the spatial position of the operation units according to the ID values of the adjacent operation unit modules;
the local bus module obtains a group of enabling signals according to the ID value of the operation unit module in the vertical direction, and feeds the enabling signals back to the operation unit module connected with the local bus module, so that the spatial position of each operation unit module is calculated; according to the enabling signal, the operation unit module reads input data and carries out a convolution operation unit to obtain an intermediate result of one-dimensional convolution operation; then the operation unit module at the bottom transmits the middle result to the middle operation unit module along the local bus, the middle operation unit module accumulates the middle result and the input middle result, and transmits the accumulated middle result to the operation unit module at the top; the top arithmetic unit module accumulates the intermediate result of the top arithmetic unit module and the intermediate result from the storage system and the bottom arithmetic unit module to finally obtain the output excitation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010728621.7A CN111967587B (en) | 2020-07-27 | 2020-07-27 | Method for constructing operation unit array structure facing neural network processing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010728621.7A CN111967587B (en) | 2020-07-27 | 2020-07-27 | Method for constructing operation unit array structure facing neural network processing |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111967587A true CN111967587A (en) | 2020-11-20 |
CN111967587B CN111967587B (en) | 2024-03-29 |
Family
ID=73362989
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010728621.7A Active CN111967587B (en) | 2020-07-27 | 2020-07-27 | Method for constructing operation unit array structure facing neural network processing |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111967587B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113422786A (en) * | 2021-08-24 | 2021-09-21 | 机械科学研究总院江苏分院有限公司 | Communication system and communication method based on Internet of things equipment |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107341544A (en) * | 2017-06-30 | 2017-11-10 | 清华大学 | A kind of reconfigurable accelerator and its implementation based on divisible array |
US20180307976A1 (en) * | 2017-04-19 | 2018-10-25 | Beijing Deephi Intelligence Technology Co., Ltd. | Device for implementing artificial neural network with separate computation units |
CN109993297A (en) * | 2019-04-02 | 2019-07-09 | 南京吉相传感成像技术研究院有限公司 | A kind of the sparse convolution neural network accelerator and its accelerated method of load balancing |
CN110390384A (en) * | 2019-06-25 | 2019-10-29 | 东南大学 | A kind of configurable general convolutional neural networks accelerator |
CN110751280A (en) * | 2019-09-19 | 2020-02-04 | 华中科技大学 | Configurable convolution accelerator applied to convolutional neural network |
-
2020
- 2020-07-27 CN CN202010728621.7A patent/CN111967587B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180307976A1 (en) * | 2017-04-19 | 2018-10-25 | Beijing Deephi Intelligence Technology Co., Ltd. | Device for implementing artificial neural network with separate computation units |
CN107341544A (en) * | 2017-06-30 | 2017-11-10 | 清华大学 | A kind of reconfigurable accelerator and its implementation based on divisible array |
CN109993297A (en) * | 2019-04-02 | 2019-07-09 | 南京吉相传感成像技术研究院有限公司 | A kind of the sparse convolution neural network accelerator and its accelerated method of load balancing |
CN110390384A (en) * | 2019-06-25 | 2019-10-29 | 东南大学 | A kind of configurable general convolutional neural networks accelerator |
CN110751280A (en) * | 2019-09-19 | 2020-02-04 | 华中科技大学 | Configurable convolution accelerator applied to convolutional neural network |
Non-Patent Citations (1)
Title |
---|
陈晓东 等: "基于FPGA的神经网络控制器及其应用", 第六届全国信息获取与处理学术会议论文集(3), 31 August 2008 (2008-08-31) * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113422786A (en) * | 2021-08-24 | 2021-09-21 | 机械科学研究总院江苏分院有限公司 | Communication system and communication method based on Internet of things equipment |
Also Published As
Publication number | Publication date |
---|---|
CN111967587B (en) | 2024-03-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Ma et al. | Optimizing the convolution operation to accelerate deep neural networks on FPGA | |
CN110097174B (en) | Method, system and device for realizing convolutional neural network based on FPGA and row output priority | |
JP7329533B2 (en) | Method and accelerator apparatus for accelerating operations | |
Demmel et al. | Avoiding communication in sparse matrix computations | |
JP7358382B2 (en) | Accelerators and systems for accelerating calculations | |
Li et al. | A high performance FPGA-based accelerator for large-scale convolutional neural networks | |
CN106940815A (en) | A kind of programmable convolutional neural networks Crypto Coprocessor IP Core | |
CN107657581A (en) | Convolutional neural network CNN hardware accelerator and acceleration method | |
CN109977347B (en) | Reconfigurable FFT processor supporting multimode configuration | |
WO2020156508A1 (en) | Method and device for operating on basis of chip with operation array, and chip | |
CN111898733A (en) | Deep separable convolutional neural network accelerator architecture | |
US20230135185A1 (en) | Pooling unit for deep learning acceleration | |
Liu et al. | Toward full-stack acceleration of deep convolutional neural networks on FPGAs | |
Lu et al. | CHIP-KNN: A configurable and high-performance k-nearest neighbors accelerator on cloud FPGAs | |
CN111144556B (en) | Hardware circuit of range batch normalization algorithm for deep neural network training and reasoning | |
CN111967587B (en) | Method for constructing operation unit array structure facing neural network processing | |
Andri et al. | Going further with winograd convolutions: Tap-wise quantization for efficient inference on 4x4 tiles | |
CN113157638A (en) | Low-power-consumption in-memory calculation processor and processing operation method | |
CN104869284A (en) | High-efficiency FPGA implementation method and device for bilinear interpolation amplification algorithm | |
Sun et al. | Efficient tensor cores support in tvm for low-latency deep learning | |
CN113312285B (en) | Convolutional neural network accelerator and working method thereof | |
Zhao et al. | A deep residual networks accelerator on FPGA | |
CN115170381A (en) | Visual SLAM acceleration system and method based on deep learning | |
CN113642722A (en) | Chip for convolution calculation, control method thereof and electronic device | |
Brown et al. | Nemo-cnn: An efficient near-memory accelerator for convolutional neural networks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |