CN111967587B - Method for constructing operation unit array structure facing neural network processing - Google Patents

Method for constructing operation unit array structure facing neural network processing Download PDF

Info

Publication number
CN111967587B
CN111967587B CN202010728621.7A CN202010728621A CN111967587B CN 111967587 B CN111967587 B CN 111967587B CN 202010728621 A CN202010728621 A CN 202010728621A CN 111967587 B CN111967587 B CN 111967587B
Authority
CN
China
Prior art keywords
operation unit
module
unit module
local bus
register file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010728621.7A
Other languages
Chinese (zh)
Other versions
CN111967587A (en
Inventor
韩军
张权
张永亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
Original Assignee
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University filed Critical Fudan University
Priority to CN202010728621.7A priority Critical patent/CN111967587B/en
Publication of CN111967587A publication Critical patent/CN111967587A/en
Application granted granted Critical
Publication of CN111967587B publication Critical patent/CN111967587B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/3012Organisation of register space, e.g. banked or distributed register file
    • G06F9/30134Register stacks; shift registers

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

The invention relates to a method for constructing an operation unit array structure facing neural network processing. The system consists of an operation unit module and a local bus module. The single operation unit module is responsible for completing one-dimensional convolution operation, the local bus module transmits the intermediate result upwards, intermediate result accumulation is carried out, two-dimensional convolution operation is completed, writing back of the intermediate result is reduced, and the overall energy efficiency ratio of the system is improved. The operation unit module is internally provided with a plurality of register files, and simultaneously performs super two-dimensional convolution operation of a plurality of convolution kernels, so that the data multiplexing degree is further improved, and the write-back of intermediate results is reduced. The operation unit array adopts a self-organizing mode, receives a control signal from the top layer, automatically calculates the space position of the current operation unit required by two-dimensional convolution operation by the local bus module according to the ID configuration of the adjacent operation unit, and then automatically completes the receiving and transmitting of data and the related operation, thereby having certain autonomy. The invention can improve the calculation efficiency in the neural network processing.

Description

Method for constructing operation unit array structure facing neural network processing
Technical Field
The invention belongs to the technical field of integrated circuit design, and particularly relates to a method for constructing an operation unit array structure facing neural network processing.
Background
The deep convolutional neural network is well applied to important fields such as computer vision, voice recognition and robot control, but various applications also continuously put higher requirements on the accuracy and complexity of the neural network algorithm, so that the implementation of the algorithm faces a series of challenging problems. Although the conventional processor architecture has advanced, there are problems of low data multiplexing and poor energy efficiency caused by direct communication of data between the arithmetic units. In order to improve the problems, researchers in recent years design space processor architectures based on array parallelism, and the data multiplexing degree and the operation speed of a neural network algorithm can be remarkably improved by matching with a proper data flow strategy.
Convolution operation is the most basic operation in the neural network algorithm, and for the current deep convolution neural network, the convolution operation with huge calculation amount is generally required. Convolution operations, i.e., tensor operations, are described by mathematical expressions,
the key of the realization is that the weights of a plurality of convolution kernels are multiplied and accumulated with the numerical value of the input feature map. If the method is directly solved in a mode of the operation formula, along with the increase of the complexity of the neural network algorithm and the increase of the data calculation amount, the method for directly solving can frequently read and write data from external storage, and greatly reduces the energy efficiency ratio of the system. The other method is to adopt an adaptive data flow strategy to fix a certain data type and reduce the data reading and writing times. The adaptive data flow can select a proper storage hierarchy to access the data, and the energy consumption caused by access is minimized. The operation unit array matched with the data flow strategy is a common hardware implementation mode and is beneficial to the realization of an input bus and an output bus, so that the data transmission efficiency is greatly improved. The common data flow strategies include weight fixing (WS, weight Stationary), output fixing (OS, output Stationary) and Row fixing (Row fixing), compared with the three data flow strategies, the weight fixing data flow strategy can only improve the data multiplexing degree of weight, the output fixing data flow strategy can only reduce the reading and writing times of intermediate results, and the Row fixing data flow strategy can not only simultaneously improve the multiplexing degree of three types of data, but also reduce the reading and writing times of data. And a row fixed data flow strategy can introduce a plurality of convolution kernels into the operation unit to perform super-two-dimensional operation, so that the data multiplexing degree and the write-back times of intermediate results are further improved. The design provides a two-dimensional operation unit array structure based on a row fixed data flow strategy to complete the realization of a neural network algorithm.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to provide a method for constructing an operation unit array structure facing to neural network processing, which adopts a row fixed data flow strategy and two-dimensional operation unit array arrangement, can realize the maximization of data multiplexing degree,
the invention provides a method for constructing an operation unit array structure facing neural network processing, wherein: the device comprises an operation unit module and a local bus module, wherein the input end of the local bus module is connected with an ID value;
the operation unit module is divided into a top operation unit module, a middle operation unit module and a bottom operation unit module, and the local bus module is positioned in the vertical direction of the operation unit module; the operation unit module consists of a state machine, a register file and a multiplication and accumulation unit module, wherein the register file comprises an input excitation register file, a weight register file and an intermediate result register file, the data request input end of the state machine is connected with the input end of the register file, and the register file and the multiplication and accumulation unit are in bidirectional interaction;
the operation unit module is used for completing the most basic unit of convolution operation, is responsible for receiving input data from the input local bus module, completing one-dimensional convolution operation, sending an intermediate result to the local bus module for upward transmission and completing accumulation of the intermediate result according to different positions of the operation unit module array, and finally obtaining output excitation;
the local bus module is responsible for unidirectional interaction of intermediate results between adjacent operation units in the vertical direction and calculates the spatial position of the operation units according to the ID numerical values of the adjacent operation unit modules;
the local bus module obtains a group of enabling signals according to the ID values of the operation unit modules in the vertical direction and feeds back the enabling signals to the operation unit modules connected with the local bus module, so that the space position of each operation unit module is calculated. According to the enabling signal, the operation unit module reads the input data and carries out convolution operation to obtain an intermediate result of one-dimensional convolution operation; then the middle result is upwards transmitted to the middle operation unit module along the local bus by the bottom operation unit module, the middle operation unit module accumulates the middle result of the middle operation unit module and the input middle result, and the accumulated middle result is upwards transmitted to the top operation unit module; the top operation unit module accumulates the intermediate result with the intermediate result from the storage system and the bottom operation unit module to finally obtain output excitation.
In the invention, the operation unit module internally comprises three data types of register files to form the lowest storage level of the processor architecture, and the input data is fixed in the registers, thereby improving the data multiplexing degree and reducing the read-write times of the input data, thereby reducing the power consumption caused by memory access. And temporarily storing the intermediate result obtained by multiply-accumulate operation, and reducing the read-write times of the intermediate result, thereby further improving the energy efficiency ratio of the system.
The invention has the beneficial effects that: in the design, a row fixed data flow strategy is adopted, so that the write back of an intermediate result in the two-dimensional convolution operation process is avoided, the read-write times of the intermediate result are reduced, and the high energy efficiency ratio is realized. Meanwhile, in the implementation process of the operation unit, a storage layer of a register file is introduced, so that the data multiplexing degree is further improved and the write-back times of intermediate results are reduced while the correct function is ensured, and the energy efficiency ratio of the system is further improved.
Drawings
Fig. 1 is a basic block diagram of an arithmetic unit array structure.
Fig. 2 is a block diagram of an arithmetic unit module.
FIG. 3 is a layout of the spatial locations of the arithmetic units.
Fig. 4 is a schematic diagram of rearrangement of sliding data in the window of the operation unit. Wherein: (a) Inputting a stimulus register for the arithmetic unit, (b) inputting a stimulus register shift operation for the arithmetic unit.
Reference numerals in the drawings: 1 is an operation unit module, 2 is a local bus module, 3 is a register file, 4 is an input excitation register file, 5 is a weight register file, 6 is an intermediate result register file, 7 is a state machine, 8 is a multiplication accumulation unit module, 9 is a top operation unit module, 10 is an intermediate operation unit module, and 11 is a bottom operation unit module.
Description of the embodiments
The invention is further illustrated by the following examples in connection with the accompanying drawings.
Example 1: the basic block diagram of the operation unit array structure is shown in fig. 1. The working process of the design is as follows: first, the local bus module 2 obtains a set of enable signals according to the ID values of the arithmetic unit modules 1 in the vertical direction, and feeds back the enable signals to the arithmetic unit modules 1 connected thereto, thereby calculating the spatial position of each arithmetic unit module 1. According to the enabling signal, the operation unit module 1 reads the input data, and carries out convolution operation unit to obtain an intermediate result of one-dimensional convolution operation; then the middle result is transmitted upwards to the middle operation unit module 10 along the local bus by the bottom operation unit module 11, the middle operation unit module 10 accumulates the middle result of the middle operation unit module and the input middle result, and the accumulated middle result is transmitted upwards to the top operation unit module 9; the top arithmetic unit module 9 adds up its own intermediate results with intermediate results from the memory system and the bottom arithmetic unit module 11, finally obtaining output stimuli.
The operation unit module 1 is structured as shown in fig. 2, and includes 6 FIFO interfaces, wherein the read FIFO interfaces respectively correspond to the input stimulus, the weight, the intermediate result in the channel direction and the intermediate result from the bottom operation unit module 11, the write FIFO interfaces correspond to the intermediate result transmitted upward and the intermediate result written back to the memory system, and all these FIFOs constitute the data transmission channels of the operation unit module 1 and the local bus module 2. The enabling signal of the local bus module 2 controls the operation unit module 1 to read and write the intermediate result FIFO, the relative position of the operation unit module 1 in the space array in the software model can be realized through the mixed collocation of the enabling signals, and the combination of the matching of the enabling signals corresponding to the space position of the operation unit is listed in FIG. 3. It should be noted that any combination of enabling matches other than the pattern of fig. 2 is illegal. The register file 3 with three data types is introduced into the operation unit module 1 to form the bottommost layer of the storage system, so that the data multiplexing degree can be improved, the read-write times of a high-level storage unit can be reduced, and the data access power consumption overhead can be reduced. Table 1 lists the data bit widths and depths of the three register files. The data depth of the input excitation register file 4 is 12, and the value is equal to the height of the convolution operation array, that is, the maximum size of the convolution kernel, and the maximum window width corresponding to each convolution operation can be expanded according to design requirements. The depth of the weight register file 5 is 224, and the reason for the larger value is that a plurality of convolution kernels need to be stored simultaneously in order to support the super two-dimensional operation in the operation unit. The intermediate result register file 6 has a data depth of 24, each representing the output stimulus in a different channel direction, which value limits the maximum number of fixable convolution kernels inside the one-dimensional convolution operation unit. The data transmission path and the internal storage hierarchy of the convolution operation unit provide data assurance for operation. The internal workflow is composed of a state machine 7, and one-dimensional convolution operation and data reading are sequentially completed through the jump of the state machine 7, wherein: IDLE represents an IDLE state; set_config represents the reading of configuration parameters, including the line width of input excitation, the line width of convolution kernel, the step of convolution operation in the horizontal direction and the number of convolution kernels for performing super-two-dimensional operation; read_filter READs the corresponding number of weights from FIFO according to configuration parameters and places them into internal register file; READ_WINDOW_DATA is responsible for reading the input stimulus, the first time to jump to this state, the input stimulus of the convolution kernel linewidth size needs to be READ, followed by a stride parameter. A corresponding number of read input stimuli are placed into the internal register file. It should be noted that, the data read according to the window sliding of the stride parameter will cover the data discarded due to the window sliding, so as to increase the multiplexing degree of the input excitation, but the sliding will cause the relative position of the input excitation in the register file to change, so that a shifter is introduced into the register to rearrange the window data into the correct position, and the data rearrangement process is shown in fig. 4. The distrip_data & ALU is responsible for placing the internal register bet DATA into the arithmetic unit for multiply accumulate operations. Depending on the classification of the arithmetic unit modules 1, the intermediate results from the arithmetic are written back into the FIFO by the bottom arithmetic unit module 11 for upward transmission, while the intermediate and top arithmetic unit modules 10 and 9 temporarily write the calculated results back into the register file 3.PSUM_READ_MEM & ACCUM_PSUM_MEM accumulates intermediate results inside the arithmetic units in the channel direction, and only jumps when the arithmetic unit is located at the top, other types of arithmetic units belong to illegal jumps.
TABLE 1
Register type Data bit width Depth of
4 INT8 12
5 INT8 224
6 INT32 24
PSUM_READ_LOCAL & ACCUM_PSUM_LOCAL READs intermediate results from the FIFO for accumulation, which state jumps only at the top and intermediate arithmetic units. FINISH represents the end of the current operation, resetting the relevant registers.

Claims (1)

1. A method for constructing an operation unit array structure facing neural network processing is characterized in that: the device comprises an operation unit module and a local bus module, wherein the input end of the local bus module is connected with an ID value;
the operation unit module is divided into a top operation unit module, a middle operation unit module and a bottom operation unit module, and the local bus module is positioned in the vertical direction of the operation unit module; the operation unit module consists of a state machine, a register file and a multiplication and accumulation unit module, wherein the register file comprises an input excitation register file, a weight register file and an intermediate result register file, the data request input end of the state machine is connected with the input end of the register file, and the register file and the multiplication and accumulation unit are in bidirectional interaction;
the operation unit module is used for completing the most basic unit of convolution operation, is responsible for receiving input data from the input local bus module, completing one-dimensional convolution operation, sending an intermediate result to the local bus module for upward transmission and completing accumulation of the intermediate result according to different positions of the operation unit module array, and finally obtaining output excitation;
the local bus module is responsible for unidirectional interaction of intermediate results between adjacent operation units in the vertical direction and calculates the spatial position of the operation units according to the ID numerical values of the adjacent operation unit modules;
the local bus module obtains a group of enabling signals according to the ID values of the operation unit modules in the vertical direction and feeds back the enabling signals to the operation unit modules connected with the local bus module, so that the space position of each operation unit module is calculated; according to the enabling signal, the operation unit module reads the input data and carries out convolution operation to obtain an intermediate result of one-dimensional convolution operation; then the middle result is upwards transmitted to the middle operation unit module along the local bus by the bottom operation unit module, the middle operation unit module accumulates the middle result of the middle operation unit module and the input middle result, and the accumulated middle result is upwards transmitted to the top operation unit module; the top operation unit module accumulates the intermediate result with the intermediate result from the storage system and the bottom operation unit module to finally obtain output excitation.
CN202010728621.7A 2020-07-27 2020-07-27 Method for constructing operation unit array structure facing neural network processing Active CN111967587B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010728621.7A CN111967587B (en) 2020-07-27 2020-07-27 Method for constructing operation unit array structure facing neural network processing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010728621.7A CN111967587B (en) 2020-07-27 2020-07-27 Method for constructing operation unit array structure facing neural network processing

Publications (2)

Publication Number Publication Date
CN111967587A CN111967587A (en) 2020-11-20
CN111967587B true CN111967587B (en) 2024-03-29

Family

ID=73362989

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010728621.7A Active CN111967587B (en) 2020-07-27 2020-07-27 Method for constructing operation unit array structure facing neural network processing

Country Status (1)

Country Link
CN (1) CN111967587B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113422786B (en) * 2021-08-24 2021-11-30 机械科学研究总院江苏分院有限公司 Communication system and communication method based on Internet of things equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107341544A (en) * 2017-06-30 2017-11-10 清华大学 A kind of reconfigurable accelerator and its implementation based on divisible array
CN109993297A (en) * 2019-04-02 2019-07-09 南京吉相传感成像技术研究院有限公司 A kind of the sparse convolution neural network accelerator and its accelerated method of load balancing
CN110390384A (en) * 2019-06-25 2019-10-29 东南大学 A kind of configurable general convolutional neural networks accelerator
CN110751280A (en) * 2019-09-19 2020-02-04 华中科技大学 Configurable convolution accelerator applied to convolutional neural network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107679621B (en) * 2017-04-19 2020-12-08 赛灵思公司 Artificial neural network processing device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107341544A (en) * 2017-06-30 2017-11-10 清华大学 A kind of reconfigurable accelerator and its implementation based on divisible array
CN109993297A (en) * 2019-04-02 2019-07-09 南京吉相传感成像技术研究院有限公司 A kind of the sparse convolution neural network accelerator and its accelerated method of load balancing
CN110390384A (en) * 2019-06-25 2019-10-29 东南大学 A kind of configurable general convolutional neural networks accelerator
CN110751280A (en) * 2019-09-19 2020-02-04 华中科技大学 Configurable convolution accelerator applied to convolutional neural network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于FPGA的神经网络控制器及其应用;陈晓东 等;第六届全国信息获取与处理学术会议论文集(3);20080831;全文 *

Also Published As

Publication number Publication date
CN111967587A (en) 2020-11-20

Similar Documents

Publication Publication Date Title
CN110097174B (en) Method, system and device for realizing convolutional neural network based on FPGA and row output priority
CN106940815A (en) A kind of programmable convolutional neural networks Crypto Coprocessor IP Core
CN111898733B (en) Deep separable convolutional neural network accelerator architecture
CN109086244A (en) Matrix convolution vectorization implementation method based on vector processor
CN109977347B (en) Reconfigurable FFT processor supporting multimode configuration
CN104391820A (en) Universal floating point matrix processor hardware structure based on FPGA (field programmable gate array)
US20230135185A1 (en) Pooling unit for deep learning acceleration
CN111967587B (en) Method for constructing operation unit array structure facing neural network processing
Wang et al. A low-latency sparse-winograd accelerator for convolutional neural networks
CN111144556B (en) Hardware circuit of range batch normalization algorithm for deep neural network training and reasoning
Liu et al. WinoCNN: Kernel sharing Winograd systolic array for efficient convolutional neural network acceleration on FPGAs
Liu et al. Toward full-stack acceleration of deep convolutional neural networks on FPGAs
Liu et al. Memory-efficient architecture for accelerating generative networks on FPGA
CN113220630A (en) Reconfigurable array optimization method and automatic tuning method of hardware accelerator
US8214818B2 (en) Method and apparatus to achieve maximum outer level parallelism of a loop
CN114356836A (en) RISC-V based three-dimensional interconnected many-core processor architecture and working method thereof
Andri et al. Going further with winograd convolutions: Tap-wise quantization for efficient inference on 4x4 tiles
CN116451752A (en) Deep neural network hardware accelerator device
CN109447257B (en) Operation device of deep neural network acceleration chip with self-organized channels
CN113157638A (en) Low-power-consumption in-memory calculation processor and processing operation method
CN110555793B (en) Efficient deep convolution implementation method and visual processing method comprising same
CN113312285B (en) Convolutional neural network accelerator and working method thereof
CN115170381A (en) Visual SLAM acceleration system and method based on deep learning
CN113642722A (en) Chip for convolution calculation, control method thereof and electronic device
Zhao et al. A deep residual networks accelerator on FPGA

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant