CN114911628A - MobileNet hardware acceleration system based on FPGA - Google Patents

MobileNet hardware acceleration system based on FPGA Download PDF

Info

Publication number
CN114911628A
CN114911628A CN202210675284.9A CN202210675284A CN114911628A CN 114911628 A CN114911628 A CN 114911628A CN 202210675284 A CN202210675284 A CN 202210675284A CN 114911628 A CN114911628 A CN 114911628A
Authority
CN
China
Prior art keywords
module
mobilenet
point
convolution
fpga
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210675284.9A
Other languages
Chinese (zh)
Inventor
魏榕山
林宇轩
陈标发
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuzhou University
Original Assignee
Fuzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuzhou University filed Critical Fuzhou University
Priority to CN202210675284.9A priority Critical patent/CN114911628A/en
Publication of CN114911628A publication Critical patent/CN114911628A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

The invention relates to a MobileNet hardware acceleration system based on an FPGA. The mobile terminal comprises a PL terminal, a CPU terminal, a communication module and a storage module, wherein the PL terminal is responsible for accelerating the MobileNet network, and the CPU terminal is responsible for overall coordination tasks and sending instructions; the PL end comprises a core control module and operation modules connected with the core control module; the communication module is used for realizing data transmission between the PL end and the CPU end as well as between the PL end and the storage module; the storage module is used for coordinately storing PL end data. The invention can make the network give full play to the parallel exhibition opening degree when reasoning, and improve the resource utilization rate and the system throughput rate.

Description

MobileNet hardware acceleration system based on FPGA
Technical Field
The invention relates to a MobileNet hardware acceleration system based on an FPGA.
Background
With the rapid development of artificial intelligence, especially in the field of deep learning, the development trend of the neural network model gradually presents the characteristics of model depth and complicated structure due to the continuous pursuit of the accuracy of the network model. Therefore, with the amount of calculation and parameter of most existing neural networks, it is difficult to apply to real scenes such as mobile terminals or edge devices. The application of neural networks in real scenes mainly faces two problems: storage and speed. Mobile terminals and edge devices are limited in cost and power consumption, often have limited resources, and are difficult to store such huge model parameters. And many scenes have higher requirements on real-time performance, especially the fields of military reconnaissance, automobile auxiliary driving systems and the like. The huge amount of computation of the deep neural network determines that the deep neural network is difficult to apply to the scene. Therefore, it is important to research compact and efficient CNN models for these application scenarios.
The emergence and development of the lightweight neural network sweep obstacles for the application of the neural network on mobile terminals and edge devices. MobileNet is a new generation of mobile-end convolutional neural network model proposed by Google. The core idea is to replace the standard convolution by a depth separable convolution (depthwise partial convolution), and compared with a network constructed by the standard convolution layer, the parameter number and the calculation amount can be reduced to about one ninth of the original number. By using a lightweight neural network, although the overall accuracy is slightly reduced, the requirements for storage resources and computational performance of application scenarios are greatly reduced.
Meanwhile, with the rapid growth of mobile internet applications and the rapid expansion of computing demands for data volume, a general purpose processor (CPU) at an edge device has no longer been able to meet the performance demands for energy efficient computing and application diversification. The Graphics Processing Unit (GPU) has excellent computational performance, but is used only in training tests in the early stage of research due to problems of cost and power consumption, and is difficult to apply to edge devices. Although the ASIC scheme has the best performance and power consumption, it has the worst flexibility and is difficult to adapt to the fast iteration of the neural network. And ASIC solutions are too costly and require significant shipment amortization costs. Therefore, the reconfigurable, high-performance and low-power-consumption FPGA scheme becomes the optimal scheme.
In conclusion, the high-performance hardware acceleration calculation realized by utilizing the programmability of the FPGA has great advantages in target detection application. Therefore, aiming at the requirements of speed, power consumption and resources in the actual application scene, the invention designs a MobileNet hardware implementation method based on FPGA acceleration.
Disclosure of Invention
The invention aims to provide a MobileNet hardware acceleration system based on an FPGA (field programmable gate array), which can fully play the parallel unfolding opening degree when a network carries out reasoning and improve the resource utilization rate and the system throughput rate.
In order to achieve the purpose, the technical scheme of the invention is as follows: a MobileNet hardware acceleration system based on FPGA comprises a PL end, a CPU end, a communication module and a storage module, wherein the PL end is responsible for acceleration realization of a MobileNet network, and the CPU end is responsible for overall coordination tasks and instruction sending;
the PL end comprises a core control module and operation modules connected with the core control module;
the communication module is used for realizing data transmission between the PL end and the CPU end as well as between the PL end and the storage module;
the storage module is used for coordinately storing PL end data.
Compared with the prior art, the invention has the following beneficial effects:
1. the invention provides a MobileNet hardware accelerator architecture and provides a correspondingly adapted parallel expansion strategy aiming at the characteristics of each network layer, so that the network can give full play to the parallel expansion degree when reasoning, and the resource utilization rate and the system throughput rate are improved.
2. The invention combines the network characteristics to optimize each module of the accelerator by adopting the parallel expansion and pipeline technology so as to improve the system throughput rate.
Drawings
FIG. 1 is a MobileNet hardware acceleration system architecture.
Fig. 2 is a design diagram of a deep convolution module (standard convolution compliant) architecture.
Fig. 3 is a block diagram of a point-by-point convolution module (compatible fully-connected layer) architecture design.
FIG. 4 is a diagram of an average pooling layer architecture design.
Fig. 5 is a flow chart of the operation of the MobileNet hardware acceleration system.
Detailed Description
The technical scheme of the invention is specifically explained below with reference to the accompanying drawings.
The architecture of the MobileNet hardware acceleration system provided by the invention is shown in fig. 1. The CPU is responsible for overall coordination tasks and sending instructions. The PL terminal is mainly responsible for the acceleration realization of the MobileNet network. Due to the fact that on-chip storage resources are limited, an off-chip DDR memory is adopted to cooperatively store data. The PL side can realize transmission of input and output data by configuring Direct Memory Access (DMA). The instructions are stored in the BRAM, wherein the instructions comprise operation modes, configuration parameters, storage positions and the like. The Command Analyzer is used as a core control module and is responsible for analyzing the instruction and outputting a corresponding control signal. The computing units such as the depth convolution module, the point-by-point convolution module, the SoftMax module, the full-connection module and the average pooling module are controlled by a Command Analyzer, an input characteristic diagram is read from an input buffer area, corresponding calculation is carried out by using DSP resources, intermediate data is cached in an output buffer area, and after the calculation is finished, quantization and activation are carried out, and finally the intermediate data is stored in the input buffer area.
The system block diagram of the present invention is shown in fig. 1. The system mainly comprises a storage module, a communication module, operation modules and a core control module, wherein the structure and the function of each part in the system are as follows:
1. memory module
Considering that part of BRAM on the FPGA chip has limited storage resources and needs to communicate with an external storage or interface in most practical application scenes, the project uses an off-chip DRAM to store input data, related configuration, weight and quantization parameters, and after an accelerator starts working, corresponding data is read and stored in the BRAM on the chip, so that subsequent reading is facilitated. The MobileNet network model parameters need to occupy 3852.4 kB storage space, and the total storage space which is not 4 MB is needed by the data such as quantization parameters, configuration parameters and the like. Considering that the ZCU102 development board adopted by the invention has 32.1 Mb BRAM resources, after initial data are imported from DDR, data required to be read in the intermediate calculation process and a result generated by calculation are stored in BRAM without interaction with off-chip DDR, and the system speed and performance are greatly improved.
2. Communication module
The communication module is used for connecting PL and PS and plays a role in transmitting data. The communication module of the system adopts AXI4 and AXI4-Lite bus.
3. Each operation module
Each operation module comprises a point-by-point convolution module, a depth convolution module, an average pooling module and a SoftMax module, and the specific structure and function of the operation module are as follows.
Deep convolution module (standard convolution compliant): the MobileNet network comprises a standard convolutional layer, a thirteen-layer deep convolutional layer and a fourteen-layer network layer. Considering that the standard convolution and the deep convolution have certain similarity, and the network only contains one standard convolution layer in total, the module design is mainly carried out aiming at the characteristic of the deep convolution, and the standard convolution is compatible with the deep convolution. In consideration of FPGA resources and MobileNet network structure characteristics, expanding two dimensions of the number and the size of input feature map channels according to the parallel expansion opening degree of 32 x 18 (3 x 3 x 2), and realizing deep convolution parallel expansion calculation by repeatedly designing the resources such as a multiplier, an adder tree and the like. In addition, a pipeline technology is used for optimizing the deep convolution calculation process. The deep convolution operation process is subdivided, and a series of operations of reading data, multiplying, accumulating, caching intermediate data, reading cached data and the like are subdivided by taking a period as a unit, so that each link can have continuous input and output in each period. The architectural design is shown in FIG. two.
Point-by-point convolution module (compatible full connectivity layer): in consideration of FPGA resources and point-by-point convolution layer structural features, two dimensions of the number of input feature diagram channels and the number of filter sets are expanded according to the parallel expansion opening of 32 x 32, and point-by-point convolution parallel expansion calculation is achieved by repeatedly designing resources such as multipliers and adder trees. In addition, the point-by-point convolution calculation process is optimized by using a pipeline technology. The point-by-point convolution operation process is subdivided, and a series of operations such as reading data, multiplying, accumulating, caching intermediate data, reading cache data and the like are subdivided by taking a period as a unit, so that each link can have continuous input and output in each period without mutual dependence, and the pipeline design is realized. The architectural design is shown in fig. three. When the number of rows and columns of the point-by-point convolution is reduced to 1, the operation is consistent with the operation of the full-connected layer, so that the full-connected layer can be understood as the point-by-point convolution layer with the input characteristic diagram size of 1 x 1. The full-link layer can be calculated using a point-by-point convolution module. Therefore, redesign of the fully-connected module is avoided, and system resource consumption is greatly reduced.
An average pooling module: considering the design of a convolution module architecture and a storage unit, and combining the operation characteristics of an average pooling layer, expanding two dimensions of the number of input feature map channels and the number of input feature map lines according to the parallel expansion opening of 32 x 7, and realizing average pooling parallel expansion calculation by repeatedly designing resources such as an adder tree, a divider and the like, thereby greatly improving the system throughput. In addition, a pipeline technology is used for optimizing the average pooling calculation process. The average pooling process is subdivided, and a series of operations such as reading data, accumulating, dividing, storing output data and the like are subdivided by taking a period as a unit, so that each link can have continuous input and output in each period. The architectural design is shown in fig. four.
SoftMax module: the implementation of the SoftMax function is complex and difficult to implement by hardware, and considering that the major role of the SoftMax layer is probability mapping, whether the SoftMax function is calculated does not affect the classification result, so the SoftMax layer can simplify the design and utilize a comparator to compare the size.
4. Core control module
The core control module is mainly responsible for analyzing the instruction sent by the Command Queue, controlling the corresponding module to carry out operation processing and distributing data streams. The core control module is realized by adopting a state machine.
The work flow diagram of the MobileNet hardware acceleration system provided by the invention is shown in figure 5:
1) configuring DDR according to the model training result, and importing related data such as pictures, weights, configuration and the like;
2) initializing a Command Queue, and loading instruction information to an on-chip memory BRAM;
3) loading data such as pictures, weights, quantization parameters and the like to an on-chip memory BRAM;
4) the Command Analyzer controls each computing module to operate according to the requirement according to the instruction information;
5) and finishing the work after all the instructions are executed.
The invention fully utilizes the parallelism and the reconfigurability of the FPGA platform according to the characteristics of the MobileNet network model, carries out customized design on a core network layer (a depth separable convolution layer) and a full connection layer, an average pooling layer and a standard convolution layer in the FPGA platform so as to improve the calculation performance, and keeps configurable parameters when each module is designed, thereby ensuring that each module and a system have certain expandability and universality. In addition, the invention combines the network characteristics to optimize each module by adopting the parallel expansion and pipeline technology for the accelerator so as to improve the system throughput rate.
The above are preferred embodiments of the present invention, and all changes made according to the technical scheme of the present invention that produce functional effects do not exceed the scope of the technical scheme of the present invention belong to the protection scope of the present invention.

Claims (10)

1. A MobileNet hardware acceleration system based on FPGA is characterized by comprising a PL end, a CPU end, a communication module and a storage module, wherein the PL end is responsible for realizing acceleration of a MobileNet network, and the CPU end is responsible for overall coordination tasks and sending instructions;
the PL end comprises a core control module and operation modules connected with the core control module;
the communication module is used for realizing data transmission between the PL end and the CPU end as well as between the PL end and the storage module;
the storage module is used for coordinately storing PL end data.
2. The FPGA-based MobileNet hardware acceleration system of claim 1, wherein the PL side implements the transmission of input and output data by configuring direct memory access to communicate with the memory module, and commands the BRAM stored in the PL side.
3. The FPGA-based MobileNet hardware acceleration system according to claim 1, wherein the core control module is a Command Analyzer, and is responsible for analyzing commands sent by Command Queue, and outputting corresponding control signals to control the operation of each operation module, the Command Queue interacts with the CPU terminal through the communication module, and the core control module is implemented by using a state machine.
4. The MobileNet hardware acceleration system based on the FPGA of claim 1, wherein each operation module is respectively a deep convolution module, a point-by-point convolution module, a SoftMax module and an average pooling module, each operation module is controlled by a core control module, an input feature map is read from an input buffer area, corresponding calculation is performed by using DSP resources, intermediate data is cached in an output buffer area, and after the calculation is completed, the intermediate data is quantized, activated and finally stored in the input buffer area.
5. The FPGA-based MobileNet hardware acceleration system of claim 1, wherein the storage module is an off-chip DDR memory.
6. The FPGA-based MobileNet hardware acceleration system of claim 1, wherein the communication module employs AXI4 and AXI4-Lite bus.
7. The FPGA-based MobileNet hardware acceleration system according to claim 4, wherein the deep convolution module is compatible with standard convolution, and the deep convolution module is implemented in a manner that: the method comprises the steps that a MobileNet network comprises a standard convolution layer and thirteen deep convolution layers in total, fourteen network layers are in total, FPGA resources and MobileNet network structure characteristics are considered, two dimensions of the number and the size of input feature diagram channels are expanded according to 32 x 18 parallel expansion opening, and deep convolution parallel expansion calculation is achieved through repeated design of tree resources comprising multipliers and adders; in addition, the deep convolution calculation process is optimized by using a pipeline technology, the deep convolution operation process is subdivided, and a series of operations including data reading, multiplication, accumulation, intermediate data caching and cache data reading are subdivided by taking a period as a unit, so that each link has continuous input and output in each period.
8. The FPGA-based MobileNet hardware acceleration system according to claim 4, wherein the point-by-point convolution module is compatible with a full connection layer, and the point-by-point convolution module is implemented in a manner that: in consideration of FPGA resources and point-by-point convolution layer structural features, expanding two dimensions of the number of input feature graph channels and the number of filter sets according to the parallel expansion opening degree of 32 x 32, and realizing point-by-point convolution parallel expansion calculation by repeatedly designing tree resources comprising multipliers and adders; in addition, a flow line technology is utilized to optimize the point-by-point convolution calculation process, the point-by-point convolution operation process is subdivided, and a series of operations including data reading, multiplication, accumulation, intermediate data caching and cache data reading are subdivided by taking a period as a unit, so that each link has continuous input and output in each period without mutual dependence, and the flow line design is realized; when the row number of the point-by-point convolution is degenerated to 1, the operation is consistent with the operation of the full connection layer, so that the full connection layer is the point-by-point convolution layer with the input characteristic diagram size of 1 x 1.
9. The FPGA-based MobileNet hardware acceleration system according to claim 4, wherein the average pooling module is implemented in a manner of: considering the design of a convolution module framework and a storage module, and combining the operation characteristics of an average pooling layer, expanding two dimensions of the number of input feature map channels and the number of input feature map lines according to the parallel expansion opening of 32 x 7, and realizing average pooling parallel expansion calculation by repeatedly designing tree resources comprising an adder tree and a divider; in addition, the average pooling calculation process is optimized by using a pipeline technology, and is subdivided by taking a period as a unit, wherein the subdivision comprises a series of operations of reading data, accumulating, dividing and storing output data, so that each link has continuous input and output in each period.
10. The FPGA-based MobileNet hardware acceleration system according to claim 4, wherein the implementation manner of the SoftMax module is as follows: considering that the main role of the SoftMax layer is probability mapping, whether to calculate the SoftMax function does not affect the classification result, so the SoftMax layer uses a comparator to perform the comparison in size.
CN202210675284.9A 2022-06-15 2022-06-15 MobileNet hardware acceleration system based on FPGA Pending CN114911628A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210675284.9A CN114911628A (en) 2022-06-15 2022-06-15 MobileNet hardware acceleration system based on FPGA

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210675284.9A CN114911628A (en) 2022-06-15 2022-06-15 MobileNet hardware acceleration system based on FPGA

Publications (1)

Publication Number Publication Date
CN114911628A true CN114911628A (en) 2022-08-16

Family

ID=82770489

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210675284.9A Pending CN114911628A (en) 2022-06-15 2022-06-15 MobileNet hardware acceleration system based on FPGA

Country Status (1)

Country Link
CN (1) CN114911628A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111178518A (en) * 2019-12-24 2020-05-19 杭州电子科技大学 Software and hardware cooperative acceleration method based on FPGA
CN111488983A (en) * 2020-03-24 2020-08-04 哈尔滨工业大学 Lightweight CNN model calculation accelerator based on FPGA
WO2020258529A1 (en) * 2019-06-28 2020-12-30 东南大学 Bnrp-based configurable parallel general convolutional neural network accelerator
CN114154630A (en) * 2021-11-23 2022-03-08 北京理工大学 Hardware accelerator for quantifying MobileNet and design method thereof

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020258529A1 (en) * 2019-06-28 2020-12-30 东南大学 Bnrp-based configurable parallel general convolutional neural network accelerator
CN111178518A (en) * 2019-12-24 2020-05-19 杭州电子科技大学 Software and hardware cooperative acceleration method based on FPGA
CN111488983A (en) * 2020-03-24 2020-08-04 哈尔滨工业大学 Lightweight CNN model calculation accelerator based on FPGA
CN114154630A (en) * 2021-11-23 2022-03-08 北京理工大学 Hardware accelerator for quantifying MobileNet and design method thereof

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
林宇轩: "基于FPGA的实时视频去雾系统研究", 中国优秀硕士论文电子期刊 信息科技辑, 15 February 2020 (2020-02-15) *
谢思璞等: "多分支卷积神经网络的FPGA设计与优化", 嵌入式技术, vol. 47, no. 7, 6 July 2021 (2021-07-06), pages 97 - 101 *

Similar Documents

Publication Publication Date Title
CN109447241B (en) Dynamic reconfigurable convolutional neural network accelerator architecture for field of Internet of things
CN111967468B (en) Implementation method of lightweight target detection neural network based on FPGA
CN108805266B (en) Reconfigurable CNN high-concurrency convolution accelerator
CN109284817B (en) Deep separable convolutional neural network processing architecture/method/system and medium
CN111178518A (en) Software and hardware cooperative acceleration method based on FPGA
CN108665063B (en) Bidirectional parallel processing convolution acceleration system for BNN hardware accelerator
CN107239824A (en) Apparatus and method for realizing sparse convolution neutral net accelerator
CN110222818B (en) Multi-bank row-column interleaving read-write method for convolutional neural network data storage
CN111898733A (en) Deep separable convolutional neural network accelerator architecture
CN111079923B (en) Spark convolutional neural network system suitable for edge computing platform and circuit thereof
CN113222133B (en) FPGA-based compressed LSTM accelerator and acceleration method
CN112633490B (en) Data processing device, method and related product for executing neural network model
CN113792621B (en) FPGA-based target detection accelerator design method
CN110991630A (en) Convolutional neural network processor for edge calculation
CN111860773B (en) Processing apparatus and method for information processing
CN111210019A (en) Neural network inference method based on software and hardware cooperative acceleration
Shahshahani et al. Memory optimization techniques for fpga based cnn implementations
CN111340198A (en) Neural network accelerator with highly-multiplexed data based on FPGA (field programmable Gate array)
CN115423081A (en) Neural network accelerator based on CNN _ LSTM algorithm of FPGA
CN113033794A (en) Lightweight neural network hardware accelerator based on deep separable convolution
CN109740619B (en) Neural network terminal operation method and device for target recognition
CN114005458A (en) Voice noise reduction method and system based on pipeline architecture and storage medium
CN117574970A (en) Inference acceleration method, system, terminal and medium for large-scale language model
CN114911628A (en) MobileNet hardware acceleration system based on FPGA
CN116822600A (en) Neural network search chip based on RISC-V architecture

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination