CN114911628A

CN114911628A - MobileNet hardware acceleration system based on FPGA

Info

Publication number: CN114911628A
Application number: CN202210675284.9A
Authority: CN
Inventors: 魏榕山; 林宇轩; 陈标发
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2022-06-15
Filing date: 2022-06-15
Publication date: 2022-08-16

Abstract

The invention relates to a MobileNet hardware acceleration system based on an FPGA. The mobile terminal comprises a PL terminal, a CPU terminal, a communication module and a storage module, wherein the PL terminal is responsible for accelerating the MobileNet network, and the CPU terminal is responsible for overall coordination tasks and sending instructions; the PL end comprises a core control module and operation modules connected with the core control module; the communication module is used for realizing data transmission between the PL end and the CPU end as well as between the PL end and the storage module; the storage module is used for coordinately storing PL end data. The invention can make the network give full play to the parallel exhibition opening degree when reasoning, and improve the resource utilization rate and the system throughput rate.

Description

MobileNet hardware acceleration system based on FPGA

Technical Field

The invention relates to a MobileNet hardware acceleration system based on an FPGA.

Background

With the rapid development of artificial intelligence, especially in the field of deep learning, the development trend of the neural network model gradually presents the characteristics of model depth and complicated structure due to the continuous pursuit of the accuracy of the network model. Therefore, with the amount of calculation and parameter of most existing neural networks, it is difficult to apply to real scenes such as mobile terminals or edge devices. The application of neural networks in real scenes mainly faces two problems: storage and speed. Mobile terminals and edge devices are limited in cost and power consumption, often have limited resources, and are difficult to store such huge model parameters. And many scenes have higher requirements on real-time performance, especially the fields of military reconnaissance, automobile auxiliary driving systems and the like. The huge amount of computation of the deep neural network determines that the deep neural network is difficult to apply to the scene. Therefore, it is important to research compact and efficient CNN models for these application scenarios.

The emergence and development of the lightweight neural network sweep obstacles for the application of the neural network on mobile terminals and edge devices. MobileNet is a new generation of mobile-end convolutional neural network model proposed by Google. The core idea is to replace the standard convolution by a depth separable convolution (depthwise partial convolution), and compared with a network constructed by the standard convolution layer, the parameter number and the calculation amount can be reduced to about one ninth of the original number. By using a lightweight neural network, although the overall accuracy is slightly reduced, the requirements for storage resources and computational performance of application scenarios are greatly reduced.

Meanwhile, with the rapid growth of mobile internet applications and the rapid expansion of computing demands for data volume, a general purpose processor (CPU) at an edge device has no longer been able to meet the performance demands for energy efficient computing and application diversification. The Graphics Processing Unit (GPU) has excellent computational performance, but is used only in training tests in the early stage of research due to problems of cost and power consumption, and is difficult to apply to edge devices. Although the ASIC scheme has the best performance and power consumption, it has the worst flexibility and is difficult to adapt to the fast iteration of the neural network. And ASIC solutions are too costly and require significant shipment amortization costs. Therefore, the reconfigurable, high-performance and low-power-consumption FPGA scheme becomes the optimal scheme.

In conclusion, the high-performance hardware acceleration calculation realized by utilizing the programmability of the FPGA has great advantages in target detection application. Therefore, aiming at the requirements of speed, power consumption and resources in the actual application scene, the invention designs a MobileNet hardware implementation method based on FPGA acceleration.

Disclosure of Invention

The invention aims to provide a MobileNet hardware acceleration system based on an FPGA (field programmable gate array), which can fully play the parallel unfolding opening degree when a network carries out reasoning and improve the resource utilization rate and the system throughput rate.

In order to achieve the purpose, the technical scheme of the invention is as follows: a MobileNet hardware acceleration system based on FPGA comprises a PL end, a CPU end, a communication module and a storage module, wherein the PL end is responsible for acceleration realization of a MobileNet network, and the CPU end is responsible for overall coordination tasks and instruction sending;

the PL end comprises a core control module and operation modules connected with the core control module;

the communication module is used for realizing data transmission between the PL end and the CPU end as well as between the PL end and the storage module;

the storage module is used for coordinately storing PL end data.

Compared with the prior art, the invention has the following beneficial effects:

1. the invention provides a MobileNet hardware accelerator architecture and provides a correspondingly adapted parallel expansion strategy aiming at the characteristics of each network layer, so that the network can give full play to the parallel expansion degree when reasoning, and the resource utilization rate and the system throughput rate are improved.

2. The invention combines the network characteristics to optimize each module of the accelerator by adopting the parallel expansion and pipeline technology so as to improve the system throughput rate.

Drawings

FIG. 1 is a MobileNet hardware acceleration system architecture.

Fig. 2 is a design diagram of a deep convolution module (standard convolution compliant) architecture.

Fig. 3 is a block diagram of a point-by-point convolution module (compatible fully-connected layer) architecture design.

FIG. 4 is a diagram of an average pooling layer architecture design.

Fig. 5 is a flow chart of the operation of the MobileNet hardware acceleration system.

Detailed Description

The technical scheme of the invention is specifically explained below with reference to the accompanying drawings.

The architecture of the MobileNet hardware acceleration system provided by the invention is shown in fig. 1. The CPU is responsible for overall coordination tasks and sending instructions. The PL terminal is mainly responsible for the acceleration realization of the MobileNet network. Due to the fact that on-chip storage resources are limited, an off-chip DDR memory is adopted to cooperatively store data. The PL side can realize transmission of input and output data by configuring Direct Memory Access (DMA). The instructions are stored in the BRAM, wherein the instructions comprise operation modes, configuration parameters, storage positions and the like. The Command Analyzer is used as a core control module and is responsible for analyzing the instruction and outputting a corresponding control signal. The computing units such as the depth convolution module, the point-by-point convolution module, the SoftMax module, the full-connection module and the average pooling module are controlled by a Command Analyzer, an input characteristic diagram is read from an input buffer area, corresponding calculation is carried out by using DSP resources, intermediate data is cached in an output buffer area, and after the calculation is finished, quantization and activation are carried out, and finally the intermediate data is stored in the input buffer area.

The system block diagram of the present invention is shown in fig. 1. The system mainly comprises a storage module, a communication module, operation modules and a core control module, wherein the structure and the function of each part in the system are as follows:

1. memory module

Considering that part of BRAM on the FPGA chip has limited storage resources and needs to communicate with an external storage or interface in most practical application scenes, the project uses an off-chip DRAM to store input data, related configuration, weight and quantization parameters, and after an accelerator starts working, corresponding data is read and stored in the BRAM on the chip, so that subsequent reading is facilitated. The MobileNet network model parameters need to occupy 3852.4 kB storage space, and the total storage space which is not 4 MB is needed by the data such as quantization parameters, configuration parameters and the like. Considering that the ZCU102 development board adopted by the invention has 32.1 Mb BRAM resources, after initial data are imported from DDR, data required to be read in the intermediate calculation process and a result generated by calculation are stored in BRAM without interaction with off-chip DDR, and the system speed and performance are greatly improved.

2. Communication module

The communication module is used for connecting PL and PS and plays a role in transmitting data. The communication module of the system adopts AXI4 and AXI4-Lite bus.

3. Each operation module

Each operation module comprises a point-by-point convolution module, a depth convolution module, an average pooling module and a SoftMax module, and the specific structure and function of the operation module are as follows.

Deep convolution module (standard convolution compliant): the MobileNet network comprises a standard convolutional layer, a thirteen-layer deep convolutional layer and a fourteen-layer network layer. Considering that the standard convolution and the deep convolution have certain similarity, and the network only contains one standard convolution layer in total, the module design is mainly carried out aiming at the characteristic of the deep convolution, and the standard convolution is compatible with the deep convolution. In consideration of FPGA resources and MobileNet network structure characteristics, expanding two dimensions of the number and the size of input feature map channels according to the parallel expansion opening degree of 32 x 18 (3 x 3 x 2), and realizing deep convolution parallel expansion calculation by repeatedly designing the resources such as a multiplier, an adder tree and the like. In addition, a pipeline technology is used for optimizing the deep convolution calculation process. The deep convolution operation process is subdivided, and a series of operations of reading data, multiplying, accumulating, caching intermediate data, reading cached data and the like are subdivided by taking a period as a unit, so that each link can have continuous input and output in each period. The architectural design is shown in FIG. two.

Point-by-point convolution module (compatible full connectivity layer): in consideration of FPGA resources and point-by-point convolution layer structural features, two dimensions of the number of input feature diagram channels and the number of filter sets are expanded according to the parallel expansion opening of 32 x 32, and point-by-point convolution parallel expansion calculation is achieved by repeatedly designing resources such as multipliers and adder trees. In addition, the point-by-point convolution calculation process is optimized by using a pipeline technology. The point-by-point convolution operation process is subdivided, and a series of operations such as reading data, multiplying, accumulating, caching intermediate data, reading cache data and the like are subdivided by taking a period as a unit, so that each link can have continuous input and output in each period without mutual dependence, and the pipeline design is realized. The architectural design is shown in fig. three. When the number of rows and columns of the point-by-point convolution is reduced to 1, the operation is consistent with the operation of the full-connected layer, so that the full-connected layer can be understood as the point-by-point convolution layer with the input characteristic diagram size of 1 x 1. The full-link layer can be calculated using a point-by-point convolution module. Therefore, redesign of the fully-connected module is avoided, and system resource consumption is greatly reduced.

An average pooling module: considering the design of a convolution module architecture and a storage unit, and combining the operation characteristics of an average pooling layer, expanding two dimensions of the number of input feature map channels and the number of input feature map lines according to the parallel expansion opening of 32 x 7, and realizing average pooling parallel expansion calculation by repeatedly designing resources such as an adder tree, a divider and the like, thereby greatly improving the system throughput. In addition, a pipeline technology is used for optimizing the average pooling calculation process. The average pooling process is subdivided, and a series of operations such as reading data, accumulating, dividing, storing output data and the like are subdivided by taking a period as a unit, so that each link can have continuous input and output in each period. The architectural design is shown in fig. four.

SoftMax module: the implementation of the SoftMax function is complex and difficult to implement by hardware, and considering that the major role of the SoftMax layer is probability mapping, whether the SoftMax function is calculated does not affect the classification result, so the SoftMax layer can simplify the design and utilize a comparator to compare the size.

4. Core control module

The core control module is mainly responsible for analyzing the instruction sent by the Command Queue, controlling the corresponding module to carry out operation processing and distributing data streams. The core control module is realized by adopting a state machine.

The work flow diagram of the MobileNet hardware acceleration system provided by the invention is shown in figure 5:

1) configuring DDR according to the model training result, and importing related data such as pictures, weights, configuration and the like;

2) initializing a Command Queue, and loading instruction information to an on-chip memory BRAM;

3) loading data such as pictures, weights, quantization parameters and the like to an on-chip memory BRAM;

4) the Command Analyzer controls each computing module to operate according to the requirement according to the instruction information;

5) and finishing the work after all the instructions are executed.

The invention fully utilizes the parallelism and the reconfigurability of the FPGA platform according to the characteristics of the MobileNet network model, carries out customized design on a core network layer (a depth separable convolution layer) and a full connection layer, an average pooling layer and a standard convolution layer in the FPGA platform so as to improve the calculation performance, and keeps configurable parameters when each module is designed, thereby ensuring that each module and a system have certain expandability and universality. In addition, the invention combines the network characteristics to optimize each module by adopting the parallel expansion and pipeline technology for the accelerator so as to improve the system throughput rate.

The above are preferred embodiments of the present invention, and all changes made according to the technical scheme of the present invention that produce functional effects do not exceed the scope of the technical scheme of the present invention belong to the protection scope of the present invention.

Claims

1. A MobileNet hardware acceleration system based on FPGA is characterized by comprising a PL end, a CPU end, a communication module and a storage module, wherein the PL end is responsible for realizing acceleration of a MobileNet network, and the CPU end is responsible for overall coordination tasks and sending instructions;

the storage module is used for coordinately storing PL end data.

2. The FPGA-based MobileNet hardware acceleration system of claim 1, wherein the PL side implements the transmission of input and output data by configuring direct memory access to communicate with the memory module, and commands the BRAM stored in the PL side.

3. The FPGA-based MobileNet hardware acceleration system according to claim 1, wherein the core control module is a Command Analyzer, and is responsible for analyzing commands sent by Command Queue, and outputting corresponding control signals to control the operation of each operation module, the Command Queue interacts with the CPU terminal through the communication module, and the core control module is implemented by using a state machine.

4. The MobileNet hardware acceleration system based on the FPGA of claim 1, wherein each operation module is respectively a deep convolution module, a point-by-point convolution module, a SoftMax module and an average pooling module, each operation module is controlled by a core control module, an input feature map is read from an input buffer area, corresponding calculation is performed by using DSP resources, intermediate data is cached in an output buffer area, and after the calculation is completed, the intermediate data is quantized, activated and finally stored in the input buffer area.

5. The FPGA-based MobileNet hardware acceleration system of claim 1, wherein the storage module is an off-chip DDR memory.

6. The FPGA-based MobileNet hardware acceleration system of claim 1, wherein the communication module employs AXI4 and AXI4-Lite bus.

7. The FPGA-based MobileNet hardware acceleration system according to claim 4, wherein the deep convolution module is compatible with standard convolution, and the deep convolution module is implemented in a manner that: the method comprises the steps that a MobileNet network comprises a standard convolution layer and thirteen deep convolution layers in total, fourteen network layers are in total, FPGA resources and MobileNet network structure characteristics are considered, two dimensions of the number and the size of input feature diagram channels are expanded according to 32 x 18 parallel expansion opening, and deep convolution parallel expansion calculation is achieved through repeated design of tree resources comprising multipliers and adders; in addition, the deep convolution calculation process is optimized by using a pipeline technology, the deep convolution operation process is subdivided, and a series of operations including data reading, multiplication, accumulation, intermediate data caching and cache data reading are subdivided by taking a period as a unit, so that each link has continuous input and output in each period.

8. The FPGA-based MobileNet hardware acceleration system according to claim 4, wherein the point-by-point convolution module is compatible with a full connection layer, and the point-by-point convolution module is implemented in a manner that: in consideration of FPGA resources and point-by-point convolution layer structural features, expanding two dimensions of the number of input feature graph channels and the number of filter sets according to the parallel expansion opening degree of 32 x 32, and realizing point-by-point convolution parallel expansion calculation by repeatedly designing tree resources comprising multipliers and adders; in addition, a flow line technology is utilized to optimize the point-by-point convolution calculation process, the point-by-point convolution operation process is subdivided, and a series of operations including data reading, multiplication, accumulation, intermediate data caching and cache data reading are subdivided by taking a period as a unit, so that each link has continuous input and output in each period without mutual dependence, and the flow line design is realized; when the row number of the point-by-point convolution is degenerated to 1, the operation is consistent with the operation of the full connection layer, so that the full connection layer is the point-by-point convolution layer with the input characteristic diagram size of 1 x 1.

9. The FPGA-based MobileNet hardware acceleration system according to claim 4, wherein the average pooling module is implemented in a manner of: considering the design of a convolution module framework and a storage module, and combining the operation characteristics of an average pooling layer, expanding two dimensions of the number of input feature map channels and the number of input feature map lines according to the parallel expansion opening of 32 x 7, and realizing average pooling parallel expansion calculation by repeatedly designing tree resources comprising an adder tree and a divider; in addition, the average pooling calculation process is optimized by using a pipeline technology, and is subdivided by taking a period as a unit, wherein the subdivision comprises a series of operations of reading data, accumulating, dividing and storing output data, so that each link has continuous input and output in each period.

10. The FPGA-based MobileNet hardware acceleration system according to claim 4, wherein the implementation manner of the SoftMax module is as follows: considering that the main role of the SoftMax layer is probability mapping, whether to calculate the SoftMax function does not affect the classification result, so the SoftMax layer uses a comparator to perform the comparison in size.