CN111967587A - Arithmetic unit array structure for neural network processing - Google Patents

Arithmetic unit array structure for neural network processing Download PDF

Info

Publication number
CN111967587A
CN111967587A CN202010728621.7A CN202010728621A CN111967587A CN 111967587 A CN111967587 A CN 111967587A CN 202010728621 A CN202010728621 A CN 202010728621A CN 111967587 A CN111967587 A CN 111967587A
Authority
CN
China
Prior art keywords
operation unit
unit module
module
local bus
intermediate result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010728621.7A
Other languages
Chinese (zh)
Other versions
CN111967587B (en
Inventor
韩军
张权
张永亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
Original Assignee
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University filed Critical Fudan University
Priority to CN202010728621.7A priority Critical patent/CN111967587B/en
Publication of CN111967587A publication Critical patent/CN111967587A/en
Application granted granted Critical
Publication of CN111967587B publication Critical patent/CN111967587B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/3012Organisation of register space, e.g. banked or distributed register file
    • G06F9/30134Register stacks; shift registers

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

The invention relates to an arithmetic unit array structure facing neural network processing. The system consists of an arithmetic unit module and a local bus module. The single operation unit module is responsible for completing one-dimensional convolution operation, the local bus module upwards transmits the intermediate result, the intermediate result is accumulated, the operation of two-dimensional convolution is completed, the write back of the intermediate result is reduced, and the overall energy efficiency ratio of the system is improved. The operation unit module is internally provided with a plurality of register files, and simultaneously performs the super-two-dimensional convolution operation of a plurality of convolution kernels, so that the data reuse degree is further improved, and the write back of intermediate results is reduced. The operation unit array adopts a self-organizing mode, receives a control signal from the top layer, automatically calculates the space position of the current operation unit required by the completion of two-dimensional convolution operation according to the ID configuration of the adjacent operation units by the local bus module, then automatically completes the receiving and sending of data and related operation, and has certain autonomy. The invention can improve the calculation efficiency in the neural network processing.

Description

Arithmetic unit array structure for neural network processing
Technical Field
The invention belongs to the technical field of integrated circuit design, and particularly relates to an arithmetic unit array structure for neural network processing.
Background
Deep convolutional neural networks are well applied to important fields such as computer vision, speech recognition and robot control, but various applications also continuously put higher requirements on the precision and complexity of neural network algorithms, so that the algorithm implementation faces a series of challenging problems. Although the traditional processor architecture makes certain progress, the problems of low data reuse degree, poor energy efficiency ratio and the like caused by direct communication of data between arithmetic units exist. In order to improve the above problems, in recent years, researchers have designed a space-based processor architecture based on array parallelism, and with proper data flow strategies, data reuse degree and operation speed of a neural network algorithm can be significantly improved.
Convolution operation is the most basic operation in a neural network algorithm, and for the current deep convolution neural network, the convolution operation with huge calculation amount is generally needed. Convolution operations, i.e., tensor operations, are described by mathematical expressions-that is,
Figure 876045DEST_PATH_IMAGE001
the key to the realization is that the weights of a plurality of convolution kernels and the numerical value of the input characteristic diagram are subjected to multiply-accumulate operation. If the direct solution is directly carried out according to the mode of the operation formula, along with the improvement of the complexity of the neural network algorithm and the increase of the data calculation amount, the direct solution method can frequently store read and write data from the outside, and the energy efficiency ratio of the system is greatly reduced. The other method is to adopt an adaptive data flow strategy, fix a certain data type and reduce the data reading and writing times. The adaptive data stream can select a proper storage hierarchy to access data, and energy consumption caused by memory access is minimized. The arithmetic unit array matched with the data flow strategy is a common hardware implementation mode and is beneficial to the implementation of an input bus and an output bus, so that the transmission efficiency of data is greatly improved. The common data flow strategies comprise Weight fixing (WS, Weight static), Output fixing (OS, Output static) and Row fixing (Row static), the data reuse degree of the Weight can only be improved by using the Weight fixed data flow strategy, the read-write times of the intermediate result can only be reduced by using the Output fixed data flow strategy, and the read-write times of three types of data can not only be improved by using the Row fixed data flow strategy, but also can be reduced. And row fixed data flow strategyA plurality of convolution kernels can be introduced into the operation unit to carry out super-two-dimensional operation, and the data reuse degree and the write-back frequency of the intermediate result are further improved. The design provides a row fixed data flow strategy, and a two-dimensional operation unit array structure is adopted to complete the realization of a neural network algorithm.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to provide an arithmetic unit array structure facing neural network processing, which adopts a row fixed data flow strategy and two-dimensional arithmetic unit array arrangement, can realize the maximization of data multiplexing degree,
the invention provides an arithmetic unit array structure facing neural network processing, wherein: the system comprises an arithmetic unit module and a local bus module, wherein the input end of the local bus module is connected with an ID value;
the operation unit module is divided into a top operation unit module, a middle operation unit module and a bottom operation unit module, and the local bus module is positioned in the vertical direction of the operation unit module; the operation unit module consists of a state machine, a register file and a multiply-accumulate unit module, wherein the register file comprises an input excitation register file, a weight register file and an intermediate result register file, the data request input end of the state machine is connected with the input end of the register file, and the register file and the multiply-accumulate unit are in bidirectional interaction;
the operation unit module is the most basic unit for completing convolution operation, is responsible for receiving input data from the input local bus module, completes one-dimensional convolution operation, sends an intermediate result to the local bus module for upward transmission and completes accumulation of the intermediate result according to different array positions of the operation unit module, and finally obtains output excitation;
the local bus module is responsible for unidirectional interaction of intermediate results between adjacent operation units in the vertical direction and calculates the spatial position of the operation units according to the ID values of the adjacent operation unit modules;
the local bus module obtains a group of enabling signals according to the ID value of the operation unit module in the vertical direction, and feeds the enabling signals back to the operation unit module connected with the local bus module, so that the spatial position of each operation unit module is calculated. According to the enabling signal, the operation unit module reads input data and carries out a convolution operation unit to obtain an intermediate result of one-dimensional convolution operation; then the operation unit module at the bottom transmits the middle result to the middle operation unit module along the local bus, the middle operation unit module accumulates the middle result and the input middle result, and transmits the accumulated middle result to the operation unit module at the top; the top arithmetic unit module accumulates the intermediate result of the top arithmetic unit module and the intermediate result from the storage system and the bottom arithmetic unit module to finally obtain the output excitation.
In the invention, the operation unit module internally comprises register files of three data types to form the lowest-level storage level of a processor architecture, input data is fixed in the registers, the data reuse degree is improved, and the read-write times of the input data are reduced, thereby reducing the power consumption caused by memory access. And temporarily storing an intermediate result obtained by multiply-accumulate operation, and reducing the read-write times of the intermediate result, thereby further improving the energy efficiency ratio of the system.
The invention has the beneficial effects that: in the design, a row fixed data stream strategy is adopted, so that the write back of an intermediate result in the two-dimensional convolution operation process is avoided, the read-write times of the intermediate result are reduced, and the high energy efficiency ratio is realized. Meanwhile, in the implementation process of the arithmetic unit, the storage hierarchy of the register file is introduced, so that the function is ensured to be correct, the data reuse degree is further improved, the write-back frequency of an intermediate result is reduced, and the energy efficiency ratio of the system is further improved.
Drawings
Fig. 1 is a basic block diagram of an arithmetic unit array structure.
Fig. 2 is a block diagram of the arithmetic unit.
FIG. 3 is a layout diagram of spatial locations of the operation cells.
FIG. 4 is a schematic diagram of window sliding data rearrangement inside an arithmetic unit. Wherein: (a) inputting a stimulus register for the arithmetic unit, and (b) inputting a stimulus register shift operation for the arithmetic unit.
Reference numbers in the figures: 1 is an arithmetic unit module, 2 is a local bus module, 3 is a register file, 4 is an input excitation register file, 5 is a weight register file, 6 is an intermediate result register file, 7 is a state machine, 8 is a multiplication and accumulation unit module, 9 is a top arithmetic unit module, 10 is an intermediate arithmetic unit module, and 11 is a bottom arithmetic unit module.
Detailed Description
The invention is further illustrated by the following examples in conjunction with the accompanying drawings.
Example 1: the basic block diagram of the arithmetic unit array structure is shown in fig. 1. The working process of the design is as follows: firstly, the local bus module 2 obtains a set of enabling signals according to the ID value of the operation unit module 1 in the vertical direction, and feeds the enabling signals back to the operation unit module 1 connected with the local bus module, thereby calculating the spatial position of each operation unit module 1. According to the enabling signal, the operation unit module 1 reads input data and carries out a convolution operation unit to obtain an intermediate result of one-dimensional convolution operation; then the bottom arithmetic unit module 11 transmits the intermediate result to the intermediate arithmetic unit module 10 along the local bus, the intermediate arithmetic unit module 10 accumulates the intermediate result of itself and the input intermediate result, and transmits the accumulated intermediate result to the top arithmetic unit module 9; the top arithmetic unit module 9 accumulates the intermediate result of itself with the intermediate result from the storage system and the bottom arithmetic unit module 11 to finally obtain the output excitation.
The structure of the arithmetic unit module 1 is shown in fig. 2, and the arithmetic unit module 1 includes 6 FIFO interfaces, wherein the read FIFO interface corresponds to the input excitation, the weight, the intermediate result in the channel direction and the intermediate result from the bottom arithmetic unit module 11, and the write FIFO interface corresponds to the intermediate result transmitted upward and the intermediate result written back to the storage system, and all these FIFOs constitute the data transmission channel of the arithmetic unit module 1 and the local bus module 2. The enable signal from the local bus module 2 controls the arithmetic unit module 1 to read and write the intermediate result FIFO, and the relative position of the arithmetic unit module 1 in the software model in the spatial array can be realized by mixing and matching the enable signal, and fig. 3 lists the combination of the spatial position of the arithmetic unit corresponding to the match of the enable signal. It should be noted that any enable collocation combination other than the FIG. 2 style is illegal. The register files 3 of three data types are introduced into the operation unit module 1 to form the bottommost layer of the storage system, so that the data reuse degree can be improved, the read-write times of a high-level storage unit can be reduced, and the data access power consumption overhead can be reduced. Table 1 is a convolution operation unit internal register file information table. Table 1 lists the data bit widths and depths for the three register files. The depth of the data input into the excitation register file 4 is 12, and the value is equal to the height of the convolution operation array, namely the maximum size of a convolution kernel, and the maximum window width corresponding to each convolution operation can be expanded according to design requirements. The reason why the depth of the weight register file 5 is 224 and the numerical value is large is that a plurality of convolution kernels need to be stored at the same time in order to support the super-two-dimensional operation inside the operation unit. The data depth of the intermediate result register file 6 is 24, each position represents an output excitation in a different channel direction, and this value limits the maximum number of convolution kernels that can be fixed inside the one-dimensional convolution operation unit. The data transmission path and the internal storage hierarchy of the convolution operation unit provide data guarantee for operation. The internal work flow is composed of a state machine 7, one-dimensional convolution operation and data reading are sequentially completed through jumping of the state machine 7, wherein: IDLE represents IDLE state; SET _ CONFIG represents the reading of configuration parameters, including the line width of input excitation, the line width of convolution kernel, the horizontal direction step of convolution operation and the number of convolution kernels for performing super-two-dimensional operation; reading a corresponding quantity of weights from the FIFO and placing the weights into an internal register file according to the configuration parameters by the READ _ FILTER; the READ _ WINDOW _ DATA is responsible for reading input stimuli, and when the state is jumped to for the first time, the input stimuli with the line width of the convolution kernel needs to be READ, and then the input stimuli are according to the step parameters. A corresponding number of input stimuli are read and placed into the internal register file. It should be noted here that the data read according to the stride parameter window sliding will cover the data discarded due to the window sliding, and increase the multiplexing degree of the input stimuli, but the sliding will cause the input stimuli to change in the relative position of the register file, so a shifter is introduced inside the register to rearrange the position of the window data to the correct position, and the data rearrangement process is shown in fig. 4. The disable _ DATA & ALU is responsible for placing bet DATA from the internal registers into the arithmetic unit for multiply-accumulate operations. According to the classification of the operation unit module 1, the operation unit module 11 at the bottom writes the intermediate result back to the FIFO for transmission upwards, and the operation unit modules 10 at the middle and the operation unit modules 9 at the top temporarily write the result back to the register file 3. PSUM _ READ _ MEM & ACCUM _ PSUM _ MEM accumulates intermediate results inside the arithmetic units along the channel direction, and only jumps when the arithmetic unit is positioned at the top, other types of arithmetic units belong to illegal jumps.
TABLE 1
Register type Data bit width Depth of field
4 INT8 12
5 INT8 224
6 INT32 24
PSUM _ READ _ LOCAL & ACCUM _ PSUM _ LOCAL READs the intermediate result from the FIFO for accumulation, which state jumps only on the top and intermediate arithmetic units. FINISH represents the end of the current operation, and the relevant register is reset.

Claims (1)

1. An arithmetic unit array structure oriented to neural network processing is characterized in that: the system comprises an arithmetic unit module and a local bus module, wherein the input end of the local bus module is connected with an ID value;
the operation unit module is divided into a top operation unit module, a middle operation unit module and a bottom operation unit module, and the local bus module is positioned in the vertical direction of the operation unit module; the operation unit module consists of a state machine, a register file and a multiply-accumulate unit module, wherein the register file comprises an input excitation register file, a weight register file and an intermediate result register file, the data request input end of the state machine is connected with the input end of the register file, and the register file and the multiply-accumulate unit are in bidirectional interaction;
the operation unit module is the most basic unit for completing convolution operation, is responsible for receiving input data from the input local bus module, completes one-dimensional convolution operation, sends an intermediate result to the local bus module for upward transmission and completes accumulation of the intermediate result according to different array positions of the operation unit module, and finally obtains output excitation;
the local bus module is responsible for unidirectional interaction of intermediate results between adjacent operation units in the vertical direction and calculates the spatial position of the operation units according to the ID values of the adjacent operation unit modules;
the local bus module obtains a group of enabling signals according to the ID value of the operation unit module in the vertical direction, and feeds the enabling signals back to the operation unit module connected with the local bus module, so that the spatial position of each operation unit module is calculated; according to the enabling signal, the operation unit module reads input data and carries out a convolution operation unit to obtain an intermediate result of one-dimensional convolution operation; then the operation unit module at the bottom transmits the middle result to the middle operation unit module along the local bus, the middle operation unit module accumulates the middle result and the input middle result, and transmits the accumulated middle result to the operation unit module at the top; the top arithmetic unit module accumulates the intermediate result of the top arithmetic unit module and the intermediate result from the storage system and the bottom arithmetic unit module to finally obtain the output excitation.
CN202010728621.7A 2020-07-27 2020-07-27 Method for constructing operation unit array structure facing neural network processing Active CN111967587B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010728621.7A CN111967587B (en) 2020-07-27 2020-07-27 Method for constructing operation unit array structure facing neural network processing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010728621.7A CN111967587B (en) 2020-07-27 2020-07-27 Method for constructing operation unit array structure facing neural network processing

Publications (2)

Publication Number Publication Date
CN111967587A true CN111967587A (en) 2020-11-20
CN111967587B CN111967587B (en) 2024-03-29

Family

ID=73362989

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010728621.7A Active CN111967587B (en) 2020-07-27 2020-07-27 Method for constructing operation unit array structure facing neural network processing

Country Status (1)

Country Link
CN (1) CN111967587B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113422786A (en) * 2021-08-24 2021-09-21 机械科学研究总院江苏分院有限公司 Communication system and communication method based on Internet of things equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107341544A (en) * 2017-06-30 2017-11-10 清华大学 A kind of reconfigurable accelerator and its implementation based on divisible array
US20180307976A1 (en) * 2017-04-19 2018-10-25 Beijing Deephi Intelligence Technology Co., Ltd. Device for implementing artificial neural network with separate computation units
CN109993297A (en) * 2019-04-02 2019-07-09 南京吉相传感成像技术研究院有限公司 A kind of the sparse convolution neural network accelerator and its accelerated method of load balancing
CN110390384A (en) * 2019-06-25 2019-10-29 东南大学 A kind of configurable general convolutional neural networks accelerator
CN110751280A (en) * 2019-09-19 2020-02-04 华中科技大学 Configurable convolution accelerator applied to convolutional neural network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180307976A1 (en) * 2017-04-19 2018-10-25 Beijing Deephi Intelligence Technology Co., Ltd. Device for implementing artificial neural network with separate computation units
CN107341544A (en) * 2017-06-30 2017-11-10 清华大学 A kind of reconfigurable accelerator and its implementation based on divisible array
CN109993297A (en) * 2019-04-02 2019-07-09 南京吉相传感成像技术研究院有限公司 A kind of the sparse convolution neural network accelerator and its accelerated method of load balancing
CN110390384A (en) * 2019-06-25 2019-10-29 东南大学 A kind of configurable general convolutional neural networks accelerator
CN110751280A (en) * 2019-09-19 2020-02-04 华中科技大学 Configurable convolution accelerator applied to convolutional neural network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陈晓东 等: "基于FPGA的神经网络控制器及其应用", 第六届全国信息获取与处理学术会议论文集(3), 31 August 2008 (2008-08-31) *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113422786A (en) * 2021-08-24 2021-09-21 机械科学研究总院江苏分院有限公司 Communication system and communication method based on Internet of things equipment

Also Published As

Publication number Publication date
CN111967587B (en) 2024-03-29

Similar Documents

Publication Publication Date Title
Ma et al. Optimizing the convolution operation to accelerate deep neural networks on FPGA
CN110097174B (en) Method, system and device for realizing convolutional neural network based on FPGA and row output priority
JP7329533B2 (en) Method and accelerator apparatus for accelerating operations
Demmel et al. Avoiding communication in sparse matrix computations
JP7358382B2 (en) Accelerators and systems for accelerating calculations
Li et al. A high performance FPGA-based accelerator for large-scale convolutional neural networks
CN106940815A (en) A kind of programmable convolutional neural networks Crypto Coprocessor IP Core
CN107657581A (en) Convolutional neural network CNN hardware accelerator and acceleration method
CN109977347B (en) Reconfigurable FFT processor supporting multimode configuration
WO2020156508A1 (en) Method and device for operating on basis of chip with operation array, and chip
CN111898733A (en) Deep separable convolutional neural network accelerator architecture
US20230135185A1 (en) Pooling unit for deep learning acceleration
Liu et al. Toward full-stack acceleration of deep convolutional neural networks on FPGAs
Lu et al. CHIP-KNN: A configurable and high-performance k-nearest neighbors accelerator on cloud FPGAs
CN111144556B (en) Hardware circuit of range batch normalization algorithm for deep neural network training and reasoning
CN111967587B (en) Method for constructing operation unit array structure facing neural network processing
Andri et al. Going further with winograd convolutions: Tap-wise quantization for efficient inference on 4x4 tiles
CN113157638A (en) Low-power-consumption in-memory calculation processor and processing operation method
CN104869284A (en) High-efficiency FPGA implementation method and device for bilinear interpolation amplification algorithm
Sun et al. Efficient tensor cores support in tvm for low-latency deep learning
CN113312285B (en) Convolutional neural network accelerator and working method thereof
Zhao et al. A deep residual networks accelerator on FPGA
CN115170381A (en) Visual SLAM acceleration system and method based on deep learning
CN113642722A (en) Chip for convolution calculation, control method thereof and electronic device
Brown et al. Nemo-cnn: An efficient near-memory accelerator for convolutional neural networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant