CN112766479B - Neural network accelerator supporting channel separation convolution based on FPGA - Google Patents

Neural network accelerator supporting channel separation convolution based on FPGA Download PDF

Info

Publication number
CN112766479B
CN112766479B CN202110100516.3A CN202110100516A CN112766479B CN 112766479 B CN112766479 B CN 112766479B CN 202110100516 A CN202110100516 A CN 202110100516A CN 112766479 B CN112766479 B CN 112766479B
Authority
CN
China
Prior art keywords
ormu
convolution
neural network
cluster
fpga
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110100516.3A
Other languages
Chinese (zh)
Other versions
CN112766479A (en
Inventor
陆生礼
苏晶晶
庞伟
刘昊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN202110100516.3A priority Critical patent/CN112766479B/en
Publication of CN112766479A publication Critical patent/CN112766479A/en
Application granted granted Critical
Publication of CN112766479B publication Critical patent/CN112766479B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/20Handling requests for interconnection or transfer for access to input/output bus
    • G06F13/28Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access DMA, cycle steal
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/483Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers
    • G06F7/485Adding; Subtracting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Analysis (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computational Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Nonlinear Science (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)
  • Advance Control (AREA)
  • Logic Circuits (AREA)

Abstract

The invention discloses a neural network accelerator supporting channel separation convolution based on FPGA, comprising: the system comprises a Ping-Pong register file, an ORMU (output characteristic value) array, a functional unit module, a memory interface module and the like, wherein the ORMU array is used for mapping output characteristic values of configurable data streams; the Ping-Pong register file receives configuration and control words from the control processor, and sends an interrupt signal after completing calculation; the ORMU array interconnects the ORMU units and the cache by adopting a configurable network on chip so as to meet the calculation of neural networks with different data bandwidth requirements; the functional unit module is used for realizing functions of Pooling Pooling, relu activation, batch normalization BN and the like; the memory interface module is used for transmitting the weight and the characteristic value. The invention supports different requirements of channel separation convolution (channel-by-channel convolution and point-by-point convolution), traditional convolution and full connection on data bandwidth through a flexible layered mesh network-on-chip, thereby ensuring higher utilization rate of a computing unit and greatly improving reasoning/computing speed.

Description

Neural network accelerator supporting channel separation convolution based on FPGA
Technical Field
The invention relates to a neural network accelerator hardware structure technology supporting channel separation convolution based on FPGA (field Programmable Gate Array), belonging to the technical field of electronic information and deep learning.
Background
In recent years, due to the explosive growth of effective data (text, video, audio, etc.) and advances in semiconductor technology, deep learning has rapidly progressed and has enjoyed great success in the fields of machine vision, natural language, and the like. Because the deep network has a multilayer nonlinear structure, the deep network has strong feature expression capability and modeling capability for complex tasks, and simultaneously brings the characteristics of huge parameter quantity and complex calculation. Although today servers with powerful computing power and mass storage can easily perform the inference of even the most complex convolutional neural networks, in most practical applications, the forward inference process of convolutional neural networks must be performed at terminals with limited resources and power consumption in order to reduce delay and reduce security risks. For example, autopilot, drone navigation, and robotics.
In order to meet the requirements of practical application, expanding the application of the convolutional neural network in an embedded terminal becomes an important trend of recent development of the convolutional neural network, and the important trend aims at reducing the size of a convolutional neural network model and improving the processing efficiency of hardware. In the process of the exploration, a plurality of innovative technologies are provided, including weight value and characteristic value quantization, weight value pruning and replacement of traditional convolution calculation by adopting channel separation convolution, so that the structure of the convolution neural network becomes very compact and the characteristic values and the weights are more sparse.
Nevertheless, these algorithmic optimizations are only theoretical reductions in computational effort and memory costs, and most conventional convolutional neural network accelerators today are not very good at translating this theoretical benefit into practical increases in energy efficiency and processing speed. The irregularity of the network structure and the sparseness of data may cause the great decrease of the time and space utilization rate of the MAC (multi access control) unit of the neural network hardware accelerator, which may further result in the degradation of performance.
Among the above-mentioned methods, DW-CNN (Depth-wise proportional Neural Networks) and PW-CNN (Point-wise proportional Neural Networks) that use channel separation convolution instead of standard convolution are widely used in various lightweight Neural Networks to greatly reduce the number of parameters and computational complexity.
Designing a flexible and efficient accelerator supporting channel separation convolution based on the above analysis will make the accelerator have inherent advantages in energy efficiency and processing speed.
Disclosure of Invention
The technical problem is as follows: the invention aims to provide a neural network accelerator supporting channel separation convolution based on an FPGA (field programmable gate array). Through supporting channel separation convolution, the advantages of large reduction of parameters and calculated amount brought by a convolution neural network with a compact structure are fully utilized, and therefore the energy efficiency and the processing speed are improved. Therefore, the flexibly configurable network on chip is adopted to meet the requirement change of the computing unit on different network structure bandwidth, and meanwhile, a data flow mode with stable output characteristic value rows is adopted to fully explore data multiplexing and improve energy efficiency.
The technical scheme is as follows: the invention relates to a neural network accelerator supporting channel separation convolution based on an FPGA (field programmable gate array), which comprises a Ping-Pong register file module, an ORMU array capable of configuring data stream, a functional unit and a memory interface module;
the Ping-Pong register file module receives configuration information from an external control processor through a configuration bus and starts a control command such as calculation and the like, configures data streams and controls a calculation process according to relevant configuration and control information, and simultaneously sends state information of each unit of an accelerator and a calculation completion interrupt signal to the external controller;
the ORMU array of configurable data streams contains 4 independent ORMU array slices of configurable data streams; the ORMU array sheet of the configurable data flow interconnects the on-chip memory and the ORMU computing unit through a configurable mesh on-chip network so as to meet the neural network with different data bandwidth requirements and complete convolution calculation;
the functional unit module receives the output characteristic value obtained by ORMU array calculation, and performs bias addition, normalization, activation and pooling operations on the output characteristic value, and finally outputs the calculation result of the neural network;
the memory interface module is used for reading the input characteristic value and the weight value stored in the external memory and writing the output characteristic value into the external memory.
Wherein:
the Ping-Pong register file module comprises a configuration register group, a command register group and a state register group, wherein the configuration register group stores basic parameters of a convolutional neural network and data stream configuration information of a computing processing unit array, ping-Pong operation is adopted at the same time, namely two groups of configuration register groups are adopted, when a computing unit adopts the configuration information of a first group, a CPU (central processing unit) can configure parameters of a next layer through the configuration register group of a second group, and the mechanism realizes the calculation switching of an accelerator at different layers and hides the reconfiguration time of the CPU.
The ORMU array of the configurable data stream comprises a main router cluster, a sub router cluster, a global buffer cluster and an ORMU cluster, wherein the global buffer cluster is used for storing input characteristic values and calculation partial sums; the sub router cluster and the main router cluster are used for interconnecting the grouped global buffer area and the ORMU, and the network on chip is configured into different data flow modes according to the data multiplexing opportunities existing in different neural networks and the difference of the data bandwidth requirements; the ORMU cluster stably finishes row mapping of the output characteristic values through the output characteristic value rows, improves data multiplexing as much as possible, and finishes convolution calculation efficiently.
The global buffer cluster comprises 3 input buffer subareas and 4 parts and buffer subareas; the input buffer subarea stores input characteristic values from an external memory, and the partial sum buffer accesses an ORMU cluster to calculate partial sum generated in the convolution process; the buffer sub-regions in the global buffer cluster are individually interconnected with corresponding sub-routers in the router cluster.
The main router cluster and the sub router cluster comprise 3 input routers, 3 weight routers and 4 part and routers, and the input routers and the weight routers respectively correspond to the input characteristic value buffer area, the weight buffer area and the part and buffer area; the weight router is directly interconnected with the external memory; each router is interconnected with all ORMUs in the ORMU cluster.
The main router cluster and the sub router cluster are configured into different data stream modes according to different structures of the convolutional neural network, wherein the different data stream modes comprise unicast, group multicast, cross multicast and broadcast.
The ORMU cluster is composed of 1*4 ORMU units, wherein each ORMU unit comprises 3 input register stacks, 1 part and register stack, 3 weight cache regions composed of SRAM and 3 multiplication-addition devices.
The ORMU unit supports traditional convolution, channel separation convolution, and fully-concatenated mapping and computation.
The functional unit comprises 4 functional slices, which respectively correspond to 4 configurable ORMU arrays for realizing
The memory interface module includes three DMAs, DMA _ IFM for reading input feature values, DMA _ WT for reading weights, and DMA _ OFM for writing output feature values to the external memory.
Has the advantages that: different network structures including traditional convolution, channel separation convolution, full connection and the like are flexibly and efficiently supported by the network on chip capable of configuring the data stream under the condition of ensuring higher utilization rate of a computing unit; and a line-stable data multiplexing mode is adopted, data multiplexing is fully explored through a local register and a cache, the access of an external memory is reduced, and the energy efficiency is improved.
Description of the drawings:
figure 1 is a system architecture diagram of the present invention,
figure 2 is a network on chip for transmitting input characteristic values,
figure 3 is a network on chip transmitting weights,
figure 4 is a network on chip that transports partial sums,
fig. 5 is a block diagram of the ORMU unit.
Description of the symbols:
FPGA: field-Programmable Gate Array
Ping-Pong register: ping-pong register
ORMU array: output Feature Map Row Mapping Unit, output Feature value Row Mapping Unit
SRAM: static random access Memory
DMA: direct Memory Access (DMA)
DMA _ IFM: DMAinput Feature Map, DMA for transferring input Feature values
DMA _ WT: DMAweight, DMA for transfer weights
DMA _ OFM: DMAoutput Feature Map, DMA for transferring output Feature values
Detailed Description
The technical solution and the advantages of the present invention will be described in detail with reference to the accompanying drawings.
As shown in fig. 1,4 convolution types shown in table 1 are used as examples of hardware structures of the convolutional neural network accelerator designed according to the present invention, and the operation modes thereof are described in detail.
The external control processor firstly writes related parameters such as the size of the input characteristic value of the layer, the number of channels, whether padding exists, a convolution calculation mode (full connection, channel separation convolution and traditional convolution) and the like and network-on-chip data flow configuration information into an accelerator related register through a configuration bus. Secondly, the DMA is controlled to write the input feature values and the weight values into the corresponding input buffer subarea and the weight buffer subarea in the ORMU unit, respectively. And after the calculation is finished, the obtained output characteristics are written into the functional unit, the operations of pooling, relu and the like are finished, an interrupt is generated to an external controller, and the calculation result is written back to an external memory.
For conventional convolution, the output feature value calculations for channels 1-16 are mapped to slice1, the output feature values for channels 17-32 are mapped to slice2, and so on. For this purpose, each slice needs to calculate 24 × 16 output feature values, each slice contains 8 (columns) × 6 (rows) of ORMUs, each ORMU maps the calculation of one row of output feature values, maps the input feature values of 1-8 channels 1-3 rows and the corresponding 16 sets of 3 × 8 weights to the ORMU of (1,1), maps the input feature values of 1-8 channels 2-4 rows and the corresponding 16 sets of 3 × 8 weights to the ORMU of (1,2), and so on, the first row ORMU can calculate the partial sums of the output feature values of 1-8 rows of 16 channels, and also maps the input feature values of 1-3 rows of 9-16 channels and the corresponding 16 sets of 3 × 8 weights to the ORMU of (2,1), mapping input characteristic values of 9-16 channels 2-4 rows and corresponding 16 groups of 3-8 weights to ORMUs with the position (2,2), repeating the steps, accumulating the ORMUs corresponding to the 1 st row and the 2 nd row to obtain output characteristic value partial sums of 16 channels 1-8 rows, repeating the steps, accumulating the ORMUs corresponding to the 3 rd row and the 4 th row to obtain output characteristic value partial sums of 16 channels 9-16 rows, accumulating the ORMUs corresponding to the 5 th row and the 6 th row to obtain output characteristic value partial sums of 16 channels 17-24 rows, writing the partial sums into corresponding partial sums buffer after calculation, reading input characteristic values and weights of 17-32 channels, and accumulating the partial sums to obtain all output characteristic values.
For the channel-by-channel convolution, the output feature value calculations for channels 1-8 are mapped to the output feature values for channels slice1,9-16 are mapped to slice2, and so on. For this purpose, each slice needs to calculate output characteristic values of 48 × 8, map input characteristic values of 1-3 rows of 1-8 channels and corresponding weights to (1,1) positions, map input characteristic values of 2-4 rows of 1-8 channels to (1,2), and so on, each ORMU maps the same row of each channel of 8 output characteristic value channels, thereby completing the calculation of output characteristic values of 48 rows of 8 channels.
For point-by-point convolution, the output feature value calculations for channels 1-32 are mapped to slice1, the output feature values for channels 33-64 are mapped to slice2, and so on. For this reason, each slice needs to calculate 48 × 32 output feature values, map the 1 st row input feature values of 1-24 channels and the corresponding 32 groups of 1 × 24 weights to the (1,1) position, complete the 1 st row mapping of 32 channel output feature values, map the 2 nd row input feature value of 1-24 channels and the corresponding 32 groups of 1 × 24 weights to (1,2), complete the 2 nd row mapping of 32 channel output feature values, and so on, complete the mapping of 32 channel output feature 48 rows, and complete the calculation of 128 channels in total for 4 slices.
For full connectivity, the output feature value calculations for the 1-480 channels are mapped to the output feature values for the slice1,481-960 channels are mapped to slice2, and so on. For this reason, each slice needs to calculate 1 × 480 output characteristic values, map the input characteristic values of 1-24 channels and the corresponding weights of 1-10 groups 1 × 24 to (1,1), map the input characteristic values of 1-24 channels and the corresponding weights of 11-20 groups 1 × 24 to (1,2), and so on, 48 ORMUs can map the partial sum of the output characteristic values of 480 channels, write a part and buffer after the calculation is completed, read in the input characteristic values of 25-48 channels and the corresponding weights, and accumulate the partial sum with the previous one, so on, it is known that all 480 channel input characteristic values are accumulated to obtain the characteristic values of 1 × 480, and 4 slices complete the calculation of the output characteristic values of 1 × 1920 in total.
As shown in fig. 2, the input buffer subregions and the input eigenvalue routers are interconnected in a one-to-one correspondence, and each input eigenvalue router is interconnected with all ORMUs in the ORMU cluster, so as to form a fully connected network.
As shown in fig. 3, the data of the weights directly comes from the external memory, each weight router is directly interconnected with the external memory, all the ORMUs in the ORMU cluster share three weight buses, and the 3 weight buses are respectively connected to 3 different weight buffers inside the ORMU for caching the weights of different rows of the conventional convolution or channel-by-channel convolution or the weights of different channels in point-by-point convolution or full connection. For the channel-by-channel convolution, weight multiplexing still exists, for this reason, the weight transmission network is configured in a broadcast mode, and weight data are interconnected through routers, so that the same weight is shared by ORMUs of different ORMU clusters.
With reference to fig. 4, most convolution kernels are 3*3, a single ORMU can complete mapping, when a convolution kernel is larger than 3, two or more adjacent rows of ORMUs are needed to map different rows of the convolution kernels, and then mapping of an output row is achieved through accumulation, or for traditional convolution, point-by-point convolution and full connection, accumulation of input feature values of different channels exists, accumulation of input feature values of different channels can be achieved by mapping the input feature values of different channels to the ORMUs of different rows, and calculation of the output row is achieved.
With reference to fig. 5, for the conventional convolution, three Iact Scratch pads respectively buffer 1 row of feature values of 8 channels, three Weight Scratch pads respectively buffer 1 row of weights of 16 groups of 8 channels, during the calculation process, each row of the 1 st input feature value of the 3 rows of the 1 st channel is read, multiplied by each row of the 1 st Weight of the 1 st channel 3 row Weight of the corresponding 1 st group and added to store in the Psum Scratch Pad, the input feature value is kept unchanged, multiplied by each row of the 1 st Weight of the 1 st channel 3 row Weight of the 2 nd group and added to store in the Psum Scratch Pad, and so on to complete the partial sum calculation of 16 groups, then reading the 1 st input characteristic value of each 3 lines of the 2 nd channel, repeating the above 16 times of calculation and accumulating with the previous 16 times of results, and so on to complete the calculation and accumulation of 8 channels, then reading the 2 nd input characteristic value of each 3 lines of the 1 st channel and corresponding Weight, repeating the above two steps and accumulating, and so on to complete the 3 rd input characteristic value, thereby obtaining the complete partial sum of the 1 st output characteristic value of the same lines of the 16 channels, writing the partial sum into the external part and the buffer area, and then completing the 2,3, …, and 24 output characteristic value partial sums of the same lines of the 16 channels.
The calculation mode of the channel-by-channel convolution is similar to that of the traditional convolution, and the difference is that the accumulation of input channels does not exist, and the description is omitted.
For point-by-point convolution, because the size of the convolution kernel is 1*1, in order to fully utilize 3 multipliers inside the ORMU, the 3 Iact Scratch pads do not buffer the input feature values of different lines, but buffer the input feature values of different channels in the same line, for example, the first buffer buffers 1 line of 1-8 channels, the second buffer buffers 9 line of 1-16 channels, and the third buffer buffers 17 line of 24 channels, the weights are the same, the calculation mode is similar to the conventional convolution, and details are not repeated.
For the calculation of full connection, the calculation mode is similar to point-by-point convolution, and the description is omitted.
The above embodiments are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modifications made on the basis of the technical scheme according to the technical idea of the present invention fall within the protection scope of the present invention.
Table 1 is a size description of four different convolution types
TABLE 1
Convolution Type Input Size Output Size Stride Kernel Size
conv 26*26*32 24*24*64 1 3*3*32*64
depth-wise conv 50*50*32 48*48*32 1 3*3*32
point-wise conv 48*48*24 48*48*128 1*1*72*128
FC 1*1*480 1*1*1920 1*1*480*1440

Claims (10)

1. The utility model provides a neural network accelerator of support channel separation convolution based on FPGA which characterized in that: the neural network accelerator comprises a Ping-Pong register file module, an ORMU array capable of configuring data flow, a functional unit and a memory interface module;
the Ping-Pong register file module receives configuration information from an external control processor and control commands such as starting calculation and the like through a configuration bus, configures data streams and controls the calculation process according to the relevant configuration and control information, and simultaneously sends state information of each unit of an accelerator and a calculation completion interrupt signal to the external controller;
the ORMU array of configurable data streams contains 4 independent ORMU array slices of configurable data streams; the ORMU array sheet of the configurable data flow interconnects the on-chip memory and the ORMU computing unit through a configurable on-chip network to meet the computation of neural networks with different data bandwidth requirements;
the functional unit module receives the output characteristic value obtained by ORMU array calculation, and performs operations such as bias addition, normalization, activation, pooling and the like on the output characteristic value, and finally outputs the calculation result of the neural network;
the memory interface module is used for reading the input characteristic value and the weight value stored in the external memory and writing the output characteristic value into the external memory.
2. The FPGA-based neural network accelerator that supports channel separation convolution of claim 1, wherein: the Ping-Pong register file module comprises a configuration register group, a command register group and a state register group, wherein the configuration register group stores basic parameters of a convolutional neural network and ORMU array data flow configuration information, ping-Pong operation is adopted, namely two groups of configuration register groups are adopted, when a computing unit adopts the configuration information of a first group, a CPU can configure parameters of a next layer through the configuration register groups of a second group, the mechanism is used for realizing the calculation switching of an accelerator at different layers, and the reconfiguration time of the CPU is hidden.
3. The FPGA-based neural network accelerator that supports channel separation convolution of claim 1, wherein: the ORMU array of the configurable data stream comprises a main router cluster, a sub router cluster, a global buffer cluster and an ORMU cluster, wherein the global buffer cluster is used for storing input characteristic values and calculation partial sums; the sub router cluster and the main router cluster are used for interconnecting the global buffer area and the ORMU, and the network on chip is configured into different data flow modes according to the data multiplexing opportunities existing in different neural networks and the difference of the data bandwidth requirements; the ORMU cluster stably finishes row mapping of output characteristic values through the output characteristic value rows, improves data multiplexing as much as possible, and finishes convolution calculation efficiently.
4. The FPGA-based neural network accelerator that supports channel separation convolution of claim 3, wherein: the global buffer cluster comprises 3 input buffer subareas and 4 parts and buffer subareas; the input buffer subarea stores input characteristic values from an external memory, and the partial sum buffer accesses an ORMU cluster to calculate partial sum generated in the convolution process; the buffer sub-regions in the global buffer cluster are individually interconnected with corresponding sub-routers in the router cluster.
5. The FPGA-based neural network accelerator that supports channel separation convolution of claim 3, wherein: the main router cluster and the sub router cluster comprise 3 input routers, 3 weight routers and 4 part and routers, and the input routers and the weight routers respectively correspond to the input characteristic value buffer area, the weight buffer area and the part and buffer area; the weight router is directly interconnected with the external memory; each router is interconnected with all ORMUs in the ORMU cluster.
6. The FPGA-based neural network accelerator that supports channel separation convolution of claim 5, wherein: the main router cluster and the sub router cluster are configured into different data stream modes according to different structures of the convolutional neural network, wherein the different data stream modes comprise unicast, group multicast, cross multicast and broadcast.
7. The FPGA-based neural network accelerator that supports channel separation convolution of claim 3, wherein: the ORMU cluster is composed of 1*4 ORMU units, wherein the ORMU unit comprises 3 input register stacks, 1 part and register stack, 3 weight cache regions composed of SRAM and 3 multiply-add devices.
8. The FPGA-based neural network accelerator that supports channel separation convolution of claim 7, wherein: the ORMU unit supports traditional convolution, channel separation convolution, and fully-concatenated mapping and computation.
9. The FPGA-based neural network accelerator that supports channel separation convolution of claim 1, wherein: the functional units comprise 4 functional slices, which respectively correspond to 4 configurable ORMU arrays for implementation.
10. The FPGA-based neural network accelerator that supports channel separation convolution of claim 1, wherein: the memory interface module includes three DMAs, DMA _ IFM for reading input feature values, DMA _ WT for reading weights, and DMA _ OFM for writing output feature values to the external memory.
CN202110100516.3A 2021-01-26 2021-01-26 Neural network accelerator supporting channel separation convolution based on FPGA Active CN112766479B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110100516.3A CN112766479B (en) 2021-01-26 2021-01-26 Neural network accelerator supporting channel separation convolution based on FPGA

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110100516.3A CN112766479B (en) 2021-01-26 2021-01-26 Neural network accelerator supporting channel separation convolution based on FPGA

Publications (2)

Publication Number Publication Date
CN112766479A CN112766479A (en) 2021-05-07
CN112766479B true CN112766479B (en) 2022-11-11

Family

ID=75707328

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110100516.3A Active CN112766479B (en) 2021-01-26 2021-01-26 Neural network accelerator supporting channel separation convolution based on FPGA

Country Status (1)

Country Link
CN (1) CN112766479B (en)

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109948774B (en) * 2019-01-25 2022-12-13 中山大学 Neural network accelerator based on network layer binding operation and implementation method thereof
CN109934339B (en) * 2019-03-06 2023-05-16 东南大学 General convolutional neural network accelerator based on one-dimensional pulse array
CN110390384B (en) * 2019-06-25 2021-07-06 东南大学 Configurable general convolutional neural network accelerator

Also Published As

Publication number Publication date
CN112766479A (en) 2021-05-07

Similar Documents

Publication Publication Date Title
CN108805266B (en) Reconfigurable CNN high-concurrency convolution accelerator
CN109102065B (en) Convolutional neural network accelerator based on PSoC
CN110222818B (en) Multi-bank row-column interleaving read-write method for convolutional neural network data storage
US20180122456A1 (en) Dpu architecture
TWI718336B (en) System for dpu operations
JP7381429B2 (en) Storage system and method for accelerating hierarchical sorting around storage
CN111433758A (en) Programmable operation and control chip, design method and device thereof
US20220179823A1 (en) Reconfigurable reduced instruction set computer processor architecture with fractured cores
CN110705702A (en) Dynamic extensible convolutional neural network accelerator
KR20220071723A (en) Method and apparatus for performing deep learning operations
CN111079908B (en) Network-on-chip data processing method, storage medium, computer device and apparatus
CN112766479B (en) Neural network accelerator supporting channel separation convolution based on FPGA
CN111610963B (en) Chip structure and multiply-add calculation engine thereof
CN117234720A (en) Dynamically configurable memory computing fusion data caching structure, processor and electronic equipment
CN102201817B (en) Low-power-consumption LDPC (low density parity check) decoder based on optimization of folding structure of memorizer
CN113487020B (en) Ragged storage structure for neural network calculation and neural network calculation method
CN113159302B (en) Routing structure for reconfigurable neural network processor
CN112101537B (en) CNN accelerator and electronic device
CN112346704B (en) Full-streamline type multiply-add unit array circuit for convolutional neural network
CN115081603A (en) Computing device, integrated circuit device and board card for executing Winograd convolution
CN112862079A (en) Design method of flow type convolution calculation architecture and residual error network acceleration system
TWI828052B (en) Computer system and memory management method based on wafer-on-wafer architecture
Brown et al. Nemo-cnn: An efficient near-memory accelerator for convolutional neural networks
CN116882467B (en) Edge-oriented multimode configurable neural network accelerator circuit structure
Shahan et al. FPGA based convolution and memory architecture for Convolutional Neural Network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant