CN112766479B

CN112766479B - Neural network accelerator supporting channel separation convolution based on FPGA

Info

Publication number: CN112766479B
Application number: CN202110100516.3A
Authority: CN
Inventors: 陆生礼; 苏晶晶; 庞伟; 刘昊
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2021-01-26
Filing date: 2021-01-26
Publication date: 2022-11-11
Anticipated expiration: 2041-01-26
Also published as: CN112766479A

Abstract

The invention discloses a neural network accelerator supporting channel separation convolution based on FPGA, comprising: the system comprises a Ping-Pong register file, an ORMU (output characteristic value) array, a functional unit module, a memory interface module and the like, wherein the ORMU array is used for mapping output characteristic values of configurable data streams; the Ping-Pong register file receives configuration and control words from the control processor, and sends an interrupt signal after completing calculation; the ORMU array interconnects the ORMU units and the cache by adopting a configurable network on chip so as to meet the calculation of neural networks with different data bandwidth requirements; the functional unit module is used for realizing functions of Pooling Pooling, relu activation, batch normalization BN and the like; the memory interface module is used for transmitting the weight and the characteristic value. The invention supports different requirements of channel separation convolution (channel-by-channel convolution and point-by-point convolution), traditional convolution and full connection on data bandwidth through a flexible layered mesh network-on-chip, thereby ensuring higher utilization rate of a computing unit and greatly improving reasoning/computing speed.

Description

Neural network accelerator supporting channel separation convolution based on FPGA

Technical Field

The invention relates to a neural network accelerator hardware structure technology supporting channel separation convolution based on FPGA (field Programmable Gate Array), belonging to the technical field of electronic information and deep learning.

Background

In recent years, due to the explosive growth of effective data (text, video, audio, etc.) and advances in semiconductor technology, deep learning has rapidly progressed and has enjoyed great success in the fields of machine vision, natural language, and the like. Because the deep network has a multilayer nonlinear structure, the deep network has strong feature expression capability and modeling capability for complex tasks, and simultaneously brings the characteristics of huge parameter quantity and complex calculation. Although today servers with powerful computing power and mass storage can easily perform the inference of even the most complex convolutional neural networks, in most practical applications, the forward inference process of convolutional neural networks must be performed at terminals with limited resources and power consumption in order to reduce delay and reduce security risks. For example, autopilot, drone navigation, and robotics.

In order to meet the requirements of practical application, expanding the application of the convolutional neural network in an embedded terminal becomes an important trend of recent development of the convolutional neural network, and the important trend aims at reducing the size of a convolutional neural network model and improving the processing efficiency of hardware. In the process of the exploration, a plurality of innovative technologies are provided, including weight value and characteristic value quantization, weight value pruning and replacement of traditional convolution calculation by adopting channel separation convolution, so that the structure of the convolution neural network becomes very compact and the characteristic values and the weights are more sparse.

Nevertheless, these algorithmic optimizations are only theoretical reductions in computational effort and memory costs, and most conventional convolutional neural network accelerators today are not very good at translating this theoretical benefit into practical increases in energy efficiency and processing speed. The irregularity of the network structure and the sparseness of data may cause the great decrease of the time and space utilization rate of the MAC (multi access control) unit of the neural network hardware accelerator, which may further result in the degradation of performance.

Among the above-mentioned methods, DW-CNN (Depth-wise proportional Neural Networks) and PW-CNN (Point-wise proportional Neural Networks) that use channel separation convolution instead of standard convolution are widely used in various lightweight Neural Networks to greatly reduce the number of parameters and computational complexity.

Designing a flexible and efficient accelerator supporting channel separation convolution based on the above analysis will make the accelerator have inherent advantages in energy efficiency and processing speed.

Disclosure of Invention

The technical problem is as follows: the invention aims to provide a neural network accelerator supporting channel separation convolution based on an FPGA (field programmable gate array). Through supporting channel separation convolution, the advantages of large reduction of parameters and calculated amount brought by a convolution neural network with a compact structure are fully utilized, and therefore the energy efficiency and the processing speed are improved. Therefore, the flexibly configurable network on chip is adopted to meet the requirement change of the computing unit on different network structure bandwidth, and meanwhile, a data flow mode with stable output characteristic value rows is adopted to fully explore data multiplexing and improve energy efficiency.

The technical scheme is as follows: the invention relates to a neural network accelerator supporting channel separation convolution based on an FPGA (field programmable gate array), which comprises a Ping-Pong register file module, an ORMU array capable of configuring data stream, a functional unit and a memory interface module;

the Ping-Pong register file module receives configuration information from an external control processor through a configuration bus and starts a control command such as calculation and the like, configures data streams and controls a calculation process according to relevant configuration and control information, and simultaneously sends state information of each unit of an accelerator and a calculation completion interrupt signal to the external controller;

the ORMU array of configurable data streams contains 4 independent ORMU array slices of configurable data streams; the ORMU array sheet of the configurable data flow interconnects the on-chip memory and the ORMU computing unit through a configurable mesh on-chip network so as to meet the neural network with different data bandwidth requirements and complete convolution calculation;

the functional unit module receives the output characteristic value obtained by ORMU array calculation, and performs bias addition, normalization, activation and pooling operations on the output characteristic value, and finally outputs the calculation result of the neural network;

the memory interface module is used for reading the input characteristic value and the weight value stored in the external memory and writing the output characteristic value into the external memory.

Wherein:

the Ping-Pong register file module comprises a configuration register group, a command register group and a state register group, wherein the configuration register group stores basic parameters of a convolutional neural network and data stream configuration information of a computing processing unit array, ping-Pong operation is adopted at the same time, namely two groups of configuration register groups are adopted, when a computing unit adopts the configuration information of a first group, a CPU (central processing unit) can configure parameters of a next layer through the configuration register group of a second group, and the mechanism realizes the calculation switching of an accelerator at different layers and hides the reconfiguration time of the CPU.

The ORMU array of the configurable data stream comprises a main router cluster, a sub router cluster, a global buffer cluster and an ORMU cluster, wherein the global buffer cluster is used for storing input characteristic values and calculation partial sums; the sub router cluster and the main router cluster are used for interconnecting the grouped global buffer area and the ORMU, and the network on chip is configured into different data flow modes according to the data multiplexing opportunities existing in different neural networks and the difference of the data bandwidth requirements; the ORMU cluster stably finishes row mapping of the output characteristic values through the output characteristic value rows, improves data multiplexing as much as possible, and finishes convolution calculation efficiently.

The global buffer cluster comprises 3 input buffer subareas and 4 parts and buffer subareas; the input buffer subarea stores input characteristic values from an external memory, and the partial sum buffer accesses an ORMU cluster to calculate partial sum generated in the convolution process; the buffer sub-regions in the global buffer cluster are individually interconnected with corresponding sub-routers in the router cluster.

The main router cluster and the sub router cluster comprise 3 input routers, 3 weight routers and 4 part and routers, and the input routers and the weight routers respectively correspond to the input characteristic value buffer area, the weight buffer area and the part and buffer area; the weight router is directly interconnected with the external memory; each router is interconnected with all ORMUs in the ORMU cluster.

The main router cluster and the sub router cluster are configured into different data stream modes according to different structures of the convolutional neural network, wherein the different data stream modes comprise unicast, group multicast, cross multicast and broadcast.

The ORMU cluster is composed of 1*4 ORMU units, wherein each ORMU unit comprises 3 input register stacks, 1 part and register stack, 3 weight cache regions composed of SRAM and 3 multiplication-addition devices.

The ORMU unit supports traditional convolution, channel separation convolution, and fully-concatenated mapping and computation.

The functional unit comprises 4 functional slices, which respectively correspond to 4 configurable ORMU arrays for realizing

The memory interface module includes three DMAs, DMA _ IFM for reading input feature values, DMA _ WT for reading weights, and DMA _ OFM for writing output feature values to the external memory.

Has the advantages that: different network structures including traditional convolution, channel separation convolution, full connection and the like are flexibly and efficiently supported by the network on chip capable of configuring the data stream under the condition of ensuring higher utilization rate of a computing unit; and a line-stable data multiplexing mode is adopted, data multiplexing is fully explored through a local register and a cache, the access of an external memory is reduced, and the energy efficiency is improved.

Description of the drawings:

figure 1 is a system architecture diagram of the present invention,

figure 2 is a network on chip for transmitting input characteristic values,

figure 3 is a network on chip transmitting weights,

figure 4 is a network on chip that transports partial sums,

fig. 5 is a block diagram of the ORMU unit.

Description of the symbols:

FPGA: field-Programmable Gate Array

Ping-Pong register: ping-pong register

ORMU array: output Feature Map Row Mapping Unit, output Feature value Row Mapping Unit

SRAM: static random access Memory

DMA: direct Memory Access (DMA)

DMA _ IFM: DMAinput Feature Map, DMA for transferring input Feature values

DMA _ WT: DMAweight, DMA for transfer weights

DMA _ OFM: DMAoutput Feature Map, DMA for transferring output Feature values

Detailed Description

The technical solution and the advantages of the present invention will be described in detail with reference to the accompanying drawings.

As shown in fig. 1,4 convolution types shown in table 1 are used as examples of hardware structures of the convolutional neural network accelerator designed according to the present invention, and the operation modes thereof are described in detail.

The external control processor firstly writes related parameters such as the size of the input characteristic value of the layer, the number of channels, whether padding exists, a convolution calculation mode (full connection, channel separation convolution and traditional convolution) and the like and network-on-chip data flow configuration information into an accelerator related register through a configuration bus. Secondly, the DMA is controlled to write the input feature values and the weight values into the corresponding input buffer subarea and the weight buffer subarea in the ORMU unit, respectively. And after the calculation is finished, the obtained output characteristics are written into the functional unit, the operations of pooling, relu and the like are finished, an interrupt is generated to an external controller, and the calculation result is written back to an external memory.

For conventional convolution, the output feature value calculations for channels 1-16 are mapped to slice1, the output feature values for channels 17-32 are mapped to slice2, and so on. For this purpose, each slice needs to calculate 24 × 16 output feature values, each slice contains 8 (columns) × 6 (rows) of ORMUs, each ORMU maps the calculation of one row of output feature values, maps the input feature values of 1-8 channels 1-3 rows and the corresponding 16 sets of 3 × 8 weights to the ORMU of (1,1), maps the input feature values of 1-8 channels 2-4 rows and the corresponding 16 sets of 3 × 8 weights to the ORMU of (1,2), and so on, the first row ORMU can calculate the partial sums of the output feature values of 1-8 rows of 16 channels, and also maps the input feature values of 1-3 rows of 9-16 channels and the corresponding 16 sets of 3 × 8 weights to the ORMU of (2,1), mapping input characteristic values of 9-16 channels 2-4 rows and corresponding 16 groups of 3-8 weights to ORMUs with the position (2,2), repeating the steps, accumulating the ORMUs corresponding to the 1 st row and the 2 nd row to obtain output characteristic value partial sums of 16 channels 1-8 rows, repeating the steps, accumulating the ORMUs corresponding to the 3 rd row and the 4 th row to obtain output characteristic value partial sums of 16 channels 9-16 rows, accumulating the ORMUs corresponding to the 5 th row and the 6 th row to obtain output characteristic value partial sums of 16 channels 17-24 rows, writing the partial sums into corresponding partial sums buffer after calculation, reading input characteristic values and weights of 17-32 channels, and accumulating the partial sums to obtain all output characteristic values.

For the channel-by-channel convolution, the output feature value calculations for channels 1-8 are mapped to the output feature values for channels slice1,9-16 are mapped to slice2, and so on. For this purpose, each slice needs to calculate output characteristic values of 48 × 8, map input characteristic values of 1-3 rows of 1-8 channels and corresponding weights to (1,1) positions, map input characteristic values of 2-4 rows of 1-8 channels to (1,2), and so on, each ORMU maps the same row of each channel of 8 output characteristic value channels, thereby completing the calculation of output characteristic values of 48 rows of 8 channels.

For point-by-point convolution, the output feature value calculations for channels 1-32 are mapped to slice1, the output feature values for channels 33-64 are mapped to slice2, and so on. For this reason, each slice needs to calculate 48 × 32 output feature values, map the 1 st row input feature values of 1-24 channels and the corresponding 32 groups of 1 × 24 weights to the (1,1) position, complete the 1 st row mapping of 32 channel output feature values, map the 2 nd row input feature value of 1-24 channels and the corresponding 32 groups of 1 × 24 weights to (1,2), complete the 2 nd row mapping of 32 channel output feature values, and so on, complete the mapping of 32 channel output feature 48 rows, and complete the calculation of 128 channels in total for 4 slices.

For full connectivity, the output feature value calculations for the 1-480 channels are mapped to the output feature values for the slice1,481-960 channels are mapped to slice2, and so on. For this reason, each slice needs to calculate 1 × 480 output characteristic values, map the input characteristic values of 1-24 channels and the corresponding weights of 1-10 groups 1 × 24 to (1,1), map the input characteristic values of 1-24 channels and the corresponding weights of 11-20 groups 1 × 24 to (1,2), and so on, 48 ORMUs can map the partial sum of the output characteristic values of 480 channels, write a part and buffer after the calculation is completed, read in the input characteristic values of 25-48 channels and the corresponding weights, and accumulate the partial sum with the previous one, so on, it is known that all 480 channel input characteristic values are accumulated to obtain the characteristic values of 1 × 480, and 4 slices complete the calculation of the output characteristic values of 1 × 1920 in total.

As shown in fig. 2, the input buffer subregions and the input eigenvalue routers are interconnected in a one-to-one correspondence, and each input eigenvalue router is interconnected with all ORMUs in the ORMU cluster, so as to form a fully connected network.

As shown in fig. 3, the data of the weights directly comes from the external memory, each weight router is directly interconnected with the external memory, all the ORMUs in the ORMU cluster share three weight buses, and the 3 weight buses are respectively connected to 3 different weight buffers inside the ORMU for caching the weights of different rows of the conventional convolution or channel-by-channel convolution or the weights of different channels in point-by-point convolution or full connection. For the channel-by-channel convolution, weight multiplexing still exists, for this reason, the weight transmission network is configured in a broadcast mode, and weight data are interconnected through routers, so that the same weight is shared by ORMUs of different ORMU clusters.

With reference to fig. 4, most convolution kernels are 3*3, a single ORMU can complete mapping, when a convolution kernel is larger than 3, two or more adjacent rows of ORMUs are needed to map different rows of the convolution kernels, and then mapping of an output row is achieved through accumulation, or for traditional convolution, point-by-point convolution and full connection, accumulation of input feature values of different channels exists, accumulation of input feature values of different channels can be achieved by mapping the input feature values of different channels to the ORMUs of different rows, and calculation of the output row is achieved.

With reference to fig. 5, for the conventional convolution, three Iact Scratch pads respectively buffer 1 row of feature values of 8 channels, three Weight Scratch pads respectively buffer 1 row of weights of 16 groups of 8 channels, during the calculation process, each row of the 1 st input feature value of the 3 rows of the 1 st channel is read, multiplied by each row of the 1 st Weight of the 1 st channel 3 row Weight of the corresponding 1 st group and added to store in the Psum Scratch Pad, the input feature value is kept unchanged, multiplied by each row of the 1 st Weight of the 1 st channel 3 row Weight of the 2 nd group and added to store in the Psum Scratch Pad, and so on to complete the partial sum calculation of 16 groups, then reading the 1 st input characteristic value of each 3 lines of the 2 nd channel, repeating the above 16 times of calculation and accumulating with the previous 16 times of results, and so on to complete the calculation and accumulation of 8 channels, then reading the 2 nd input characteristic value of each 3 lines of the 1 st channel and corresponding Weight, repeating the above two steps and accumulating, and so on to complete the 3 rd input characteristic value, thereby obtaining the complete partial sum of the 1 st output characteristic value of the same lines of the 16 channels, writing the partial sum into the external part and the buffer area, and then completing the 2,3, …, and 24 output characteristic value partial sums of the same lines of the 16 channels.

The calculation mode of the channel-by-channel convolution is similar to that of the traditional convolution, and the difference is that the accumulation of input channels does not exist, and the description is omitted.

For point-by-point convolution, because the size of the convolution kernel is 1*1, in order to fully utilize 3 multipliers inside the ORMU, the 3 Iact Scratch pads do not buffer the input feature values of different lines, but buffer the input feature values of different channels in the same line, for example, the first buffer buffers 1 line of 1-8 channels, the second buffer buffers 9 line of 1-16 channels, and the third buffer buffers 17 line of 24 channels, the weights are the same, the calculation mode is similar to the conventional convolution, and details are not repeated.

For the calculation of full connection, the calculation mode is similar to point-by-point convolution, and the description is omitted.

The above embodiments are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modifications made on the basis of the technical scheme according to the technical idea of the present invention fall within the protection scope of the present invention.

Table 1 is a size description of four different convolution types

TABLE 1

Convolution Type	Input Size	Output Size	Stride	Kernel Size
					conv	262632	242464	1	3332*64
depth-wise conv	505032	484832	1	3332
					point-wise conv	484824	4848128		1172*128
FC	11480	111920		11480*1440

Claims

1. The utility model provides a neural network accelerator of support channel separation convolution based on FPGA which characterized in that: the neural network accelerator comprises a Ping-Pong register file module, an ORMU array capable of configuring data flow, a functional unit and a memory interface module;

the Ping-Pong register file module receives configuration information from an external control processor and control commands such as starting calculation and the like through a configuration bus, configures data streams and controls the calculation process according to the relevant configuration and control information, and simultaneously sends state information of each unit of an accelerator and a calculation completion interrupt signal to the external controller;

the ORMU array of configurable data streams contains 4 independent ORMU array slices of configurable data streams; the ORMU array sheet of the configurable data flow interconnects the on-chip memory and the ORMU computing unit through a configurable on-chip network to meet the computation of neural networks with different data bandwidth requirements;

the functional unit module receives the output characteristic value obtained by ORMU array calculation, and performs operations such as bias addition, normalization, activation, pooling and the like on the output characteristic value, and finally outputs the calculation result of the neural network;

2. The FPGA-based neural network accelerator that supports channel separation convolution of claim 1, wherein: the Ping-Pong register file module comprises a configuration register group, a command register group and a state register group, wherein the configuration register group stores basic parameters of a convolutional neural network and ORMU array data flow configuration information, ping-Pong operation is adopted, namely two groups of configuration register groups are adopted, when a computing unit adopts the configuration information of a first group, a CPU can configure parameters of a next layer through the configuration register groups of a second group, the mechanism is used for realizing the calculation switching of an accelerator at different layers, and the reconfiguration time of the CPU is hidden.

3. The FPGA-based neural network accelerator that supports channel separation convolution of claim 1, wherein: the ORMU array of the configurable data stream comprises a main router cluster, a sub router cluster, a global buffer cluster and an ORMU cluster, wherein the global buffer cluster is used for storing input characteristic values and calculation partial sums; the sub router cluster and the main router cluster are used for interconnecting the global buffer area and the ORMU, and the network on chip is configured into different data flow modes according to the data multiplexing opportunities existing in different neural networks and the difference of the data bandwidth requirements; the ORMU cluster stably finishes row mapping of output characteristic values through the output characteristic value rows, improves data multiplexing as much as possible, and finishes convolution calculation efficiently.

4. The FPGA-based neural network accelerator that supports channel separation convolution of claim 3, wherein: the global buffer cluster comprises 3 input buffer subareas and 4 parts and buffer subareas; the input buffer subarea stores input characteristic values from an external memory, and the partial sum buffer accesses an ORMU cluster to calculate partial sum generated in the convolution process; the buffer sub-regions in the global buffer cluster are individually interconnected with corresponding sub-routers in the router cluster.

5. The FPGA-based neural network accelerator that supports channel separation convolution of claim 3, wherein: the main router cluster and the sub router cluster comprise 3 input routers, 3 weight routers and 4 part and routers, and the input routers and the weight routers respectively correspond to the input characteristic value buffer area, the weight buffer area and the part and buffer area; the weight router is directly interconnected with the external memory; each router is interconnected with all ORMUs in the ORMU cluster.

6. The FPGA-based neural network accelerator that supports channel separation convolution of claim 5, wherein: the main router cluster and the sub router cluster are configured into different data stream modes according to different structures of the convolutional neural network, wherein the different data stream modes comprise unicast, group multicast, cross multicast and broadcast.

7. The FPGA-based neural network accelerator that supports channel separation convolution of claim 3, wherein: the ORMU cluster is composed of 1*4 ORMU units, wherein the ORMU unit comprises 3 input register stacks, 1 part and register stack, 3 weight cache regions composed of SRAM and 3 multiply-add devices.

8. The FPGA-based neural network accelerator that supports channel separation convolution of claim 7, wherein: the ORMU unit supports traditional convolution, channel separation convolution, and fully-concatenated mapping and computation.

9. The FPGA-based neural network accelerator that supports channel separation convolution of claim 1, wherein: the functional units comprise 4 functional slices, which respectively correspond to 4 configurable ORMU arrays for implementation.

10. The FPGA-based neural network accelerator that supports channel separation convolution of claim 1, wherein: the memory interface module includes three DMAs, DMA _ IFM for reading input feature values, DMA _ WT for reading weights, and DMA _ OFM for writing output feature values to the external memory.