CN111178518A - Software and hardware cooperative acceleration method based on FPGA - Google Patents

Software and hardware cooperative acceleration method based on FPGA Download PDF

Info

Publication number
CN111178518A
CN111178518A CN201911350336.XA CN201911350336A CN111178518A CN 111178518 A CN111178518 A CN 111178518A CN 201911350336 A CN201911350336 A CN 201911350336A CN 111178518 A CN111178518 A CN 111178518A
Authority
CN
China
Prior art keywords
data
convolution
module
neural network
result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911350336.XA
Other languages
Chinese (zh)
Inventor
颜成钢
李扬
刘炳涛
孙垚棋
张继勇
张勇东
沈韬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN201911350336.XA priority Critical patent/CN111178518A/en
Publication of CN111178518A publication Critical patent/CN111178518A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Advance Control (AREA)
  • Complex Calculations (AREA)

Abstract

The invention provides a software and hardware cooperative acceleration method based on an FPGA (field programmable gate array). The method comprises the steps of firstly compressing the network parameters of the deep learning model in a data quantization mode, and then inputting fixed point data obtained through quantization into a neural network accelerator for processing, wherein the neural network accelerator comprises an AXI4 bus interface, a convolution calculation module, a data cache module and a data processing module; the invention utilizes the software part to compress the neural network model, the hardware part is designed into a specific hardware architecture (neural network accelerator), the calculation amount is reduced, the high parallelism is utilized to effectively accelerate, and the memory access times are reduced to reduce the hardware energy consumption. The invention reduces redundant useless calculation and reading of parameter data by using the running information and the algorithm structure during convolution calculation, accelerates the reference of the neural network by using the FPGA hardware platform, can improve the real-time performance of the DCNN, realizes higher calculation performance and reduces energy consumption.

Description

Software and hardware cooperative acceleration method based on FPGA
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a software and hardware cooperation acceleration method based on an FPGA
Background
The neural network is an artificial intelligence machine learning technology, especially the deep convolution neural network has received wide attention, and the deep convolution neural network has achieved some remarkable achievements in the fields of speech recognition, natural language processing and intelligent image processing, especially image recognition. However, the computation amount of the commonly used network model reaches 10 hundred million orders, and the parameter amount reaches hundreds of million orders, so that no matter the convolutional neural network is trained or identified, a high-performance GPU, a large-capacity storage device or a high-power server cluster is generally required to provide computation and storage support for the convolutional neural network. With the popularization of intelligent devices, embedded devices have higher and higher requirements for rapid and accurate image recognition. For embedded devices with tight resources and sensitive power consumption, the neural network provides a severe requirement for the implementation of the convolutional neural network due to huge calculation amount and parameter amount. Therefore, with its powerful parallel capability, flexible design method and high performance power consumption ratio, FPGA becomes one of the most attractive implementation platforms for implementing hardware acceleration of convolutional neural network in embedded device. The invention provides a model compression and hardware acceleration method.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a software and hardware cooperative acceleration method based on an FPGA (field programmable gate array). A software part is used for compressing a neural network model, a hardware part is a specific hardware architecture (a neural network accelerator) designed, the calculation amount is reduced, the high parallelism is used for carrying out effective acceleration, and the memory access times are reduced to reduce the hardware energy consumption.
A software and hardware cooperative acceleration method based on an FPGA comprises the following specific steps:
and (1) compressing the parameters of the deep learning model network in a data quantization mode.
Firstly, analyzing the value range of network parameters, and determining bit width data required by fixed point data.
Recording original floating point data, selecting different precisions to calculate to obtain quantized data corresponding to the different precisions, comparing the obtained quantized data with the original floating point data, counting errors between the quantized data and the original floating point data, and selecting the precision adopted by the quantized data with the minimum error as an alternative precision, wherein the specific formula is as follows:
2w-p-1>max(|Dmax|×2p,|Dmin|×2p) (1)
in the formula, p is quantization precision; w is the quantization bit width; d represents floating point data before quantization.
And replacing the original floating point data with the quantized data corresponding to the alternative precision by adopting the obtained alternative precision and bit width data, selecting a group of quantized data with lower precision compared with the original network precision as fixed point data adopted by hardware through testing, and finally storing the obtained fixed point data in an off-chip storage device to finish compression of the deep learning model network parameters.
Step (2), designing a neural network accelerator;
the neural network accelerator comprises an AXI4 bus interface, a convolution calculation module, a data cache module and a data processing module.
The AXI4 bus interface is a high-performance bus interface facing address mapping based on an AXI bus protocol, is a universal bus interface, and an accelerator can be mounted on any bus equipment using an AXI4 protocol to work through the AXI4 bus interface; according to the AXI bus protocol, the PS (processing system) of the accelerator and the FPGA adopts the principle of a real and ready handshake signal, so that the correctness of data and command transmission is ensured.
The data cache module comprises a data cache region to be calculated, a convolution result cache region and an output result cache region; the convolution result buffer area adopts a double BRAM buffer structure, and the output result buffer area adopts a first-in first-out queue (FIFO) execution method;
the convolution calculation module is connected with the data cache region to be calculated and the convolution result cache region and is a main body of the convolution neural network; the main calculation of the convolution neural network is completed in a convolution operation channel, and because the calculation amount is large, a pipeline parallel multiply-add structure is adopted, all multiply-add operations of a single N x N convolution part are processed in a full parallel mode, and N x N times of multiply-add operations can be completed in one clock period; meanwhile, Line Buffer is introduced as input Buffer, so that the sliding process of a convolution window can be simulated, each input pixel is multiplexed to the maximum extent, the repeated reading times of the input characteristic diagram in the RAM are greatly reduced, and partial addressing logic is omitted, so that the aims of saving power consumption and hardware resources are fulfilled.
The data processing module is positioned between the convolution result buffer area and the output result buffer area and is responsible for processing convolution result data and transmitting the obtained output result to the output result buffer area. The data processing module comprises a standardization module, an activation function module and a pooling unit module, the convolution result is sequentially processed by the standardization module, the activation function module and the pooling unit module in a pipeline mode, the standardization module adopts a standardization coefficient multiply-add operation, the activation function module adopts a ReLU function operation, and the pooling unit module adopts a maximum pooling logic.
And (3) the software and the hardware work cooperatively, the quantified fixed point data is used for reference of the neural network accelerator, and the operation process is as follows:
the off-chip processor transmits fixed point data in the off-chip memory DDR to the neural network accelerator through an AXI bus, an AXI4 bus protocol is adopted between the off-chip processor and the neural network accelerator, in order to guarantee correct transmission of the data, a real and ready handshake signal is adopted, when the real and ready are simultaneously effective, the data enters a data cache area to be calculated and then is transmitted to the convolution calculation module in a data stream mode, the convolution calculation module generates convolution result data to the convolution result cache area, and the convolution result cache area transmits the convolution result data to the data processing module for processing. When all the data processed by the data processing module is read out, a result preparation completion signal is returned to the off-chip processor, and then the off-chip processor reads out the result data from the output result buffer area through the AXI4 interface.
The invention has the following beneficial effects:
the invention reduces redundant useless calculation and reading of parameter data by using the running information and the algorithm structure during convolution calculation, accelerates the reference of the neural network by using the FPGA hardware platform, can improve the real-time performance of the DCNN, realizes higher calculation performance and reduces energy consumption.
Drawings
FIG. 1 is a schematic diagram of the accelerator assembly of the present invention;
FIG. 2 is a schematic diagram of the convolution calculation module of the present invention;
FIG. 3 is a diagram illustrating the structure of a convolution result buffer according to the present invention;
FIG. 4 is a schematic diagram of a maximum pooling architecture in the data processing module of the present invention.
Detailed Description
The process of the present invention is further illustrated below with reference to the figures and examples.
The method comprises the following specific steps:
a software and hardware cooperative acceleration method based on an FPGA comprises the following specific steps:
and (1) compressing the parameters of the deep learning model network in a data quantization mode.
Firstly, analyzing the value range of network parameters, and determining bit width data required by fixed point data.
Recording original floating point data, selecting different precisions to calculate to obtain quantized data corresponding to the different precisions, comparing the obtained quantized data with the original floating point data, counting errors between the quantized data and the original floating point data, and selecting the precision adopted by the quantized data with the minimum error as an alternative precision, wherein the specific formula is as follows:
2w-p-1>max(|Dmax|×2p,|Dmin|×2p) (1)
in the formula, p is quantization precision; w is the quantization bit width; d represents floating point data before quantization.
And replacing the original floating point data with the quantized data corresponding to the alternative precision by adopting the obtained alternative precision and bit width data, selecting a group of quantized data with lower precision compared with the original network precision as fixed point data adopted by hardware through testing, and finally storing the obtained fixed point data in an off-chip storage device to finish compression of the deep learning model network parameters.
Step (2), designing a neural network accelerator;
the neural network accelerator comprises an AXI4 bus interface, a convolution calculation module, a data cache module and a data processing module.
The AXI4 bus interface is a high-performance bus interface facing address mapping based on an AXI bus protocol, is a universal bus interface, and an accelerator can be mounted on any bus equipment using an AXI4 protocol to work through the AXI4 bus interface; according to the AXI bus protocol, the PS (processing system) of the accelerator and the FPGA adopts the principle of a real and ready handshake signal, so that the correctness of data and command transmission is ensured.
The data cache module comprises a data cache region to be calculated, a convolution result cache region and an output result cache region; the convolution result buffer area adopts a double BRAM buffer structure, and the output result buffer area adopts a first-in first-out queue (FIFO) execution method;
the convolution calculation module is connected with the data cache region to be calculated and the convolution result cache region and is a main body of the convolution neural network; the main calculation of the convolution neural network is completed in a convolution operation channel, and because the calculation amount is large, a pipeline parallel multiply-add structure is adopted, all multiply-add operations of a single N x N convolution part are processed in a full parallel mode, and N x N times of multiply-add operations can be completed in one clock period; meanwhile, Line Buffer is introduced as input Buffer, so that the sliding process of a convolution window can be simulated, each input pixel is multiplexed to the maximum extent, the repeated reading times of the input characteristic diagram in the RAM are greatly reduced, and partial addressing logic is omitted, so that the aims of saving power consumption and hardware resources are fulfilled.
The data processing module is positioned between the convolution result buffer area and the output result buffer area and is responsible for processing convolution result data and transmitting the obtained output result to the output result buffer area. The data processing module comprises a standardization module, an activation function module and a pooling unit module, the convolution result is sequentially processed by the standardization module, the activation function module and the pooling unit module in a pipeline mode, the standardization module adopts a standardization coefficient multiply-add operation, the activation function module adopts a ReLU function operation, and the pooling unit module adopts a maximum pooling logic.
And (3) the software and the hardware work cooperatively, the quantified fixed point data is used for reference of the neural network accelerator, and the operation process is as follows:
the off-chip processor (PS part of FPGA) transmits fixed point data in the off-chip memory DDR to the neural network accelerator (PL design part of FPGA) through an AXI bus, an AXI4 bus protocol is adopted between the off-chip processor and the neural network accelerator, in order to ensure the correct transmission of data, a real and ready handshake signal is adopted, when the real and ready are simultaneously effective, the data enters a data cache area to be calculated and then is transmitted to the convolution calculation module in a data stream mode, the convolution calculation module generates convolution result data to the convolution result cache area, and the convolution result cache area transmits the convolution result data to the data processing module for processing. When all the data processed by the data processing module is read out, a result preparation completion signal is returned to the off-chip processor, and then the off-chip processor reads out the result data from the output result buffer area through the AXI4 interface.
The specific working method of the accelerator is shown in fig. 1:
an off-chip processor (PS part of FPGA) transmits fixed point data in an off-chip memory DDR to a neural network accelerator (PL design part of FPGA) through an AXI bus, an AXI4 bus protocol is adopted between the off-chip processor and the neural network accelerator, in order to ensure the correct transmission of data, a real and ready handshake signal is adopted, when the real and ready are simultaneously effective, the data enters a data buffer area to be calculated and then is sent to a convolution calculation module in a data flow mode, the next batch of feature map data is sequentially written into all feature map cache units of each group when the data update mode is complete update, and each group only updates S feature map cache units of reference convolution kernel step length in turn when the data update mode is partial update; and storing the weight values in each group of convolution kernel storage units according to the channel respectively according to the numerical information of the number KH of convolution kernel rows, the number KL of convolution kernel columns, the step length S of convolution kernel and the number KC of convolution kernels. Firstly, a convolution calculation module starts to execute a convolution calculation task, the convolution calculation module adopts a pipeline parallel multiply-add structure, realizes full parallel processing on all multiply-add operations in a single NxN convolution window, finishes the NxN times of multiply-add operations in 1 clock cycle, simultaneously introduces a shift register operation, automatically fills zero according to feature map filling number PAD when the initial position and the end position of a stored feature map row are taken out for a single feature map storage unit, finishes a batch of convolution calculation each time of shift calculation, and then generates a feature map data address of next convolution calculation through convolution kernel column number KL and convolution kernel step length S. The data is obtained through the convolution calculation module, the data can enter the data cache module to be calculated, and a double BRAM cache structure is adopted in a result cache region. The pooling type is 2 multiplied by 2 maximum pooling, and the data are sent according to a Z-shaped sequence, wherein the 1 st to 2 nd lines are sent from top to bottom and from left to right, and then the data are sent according to 3 to 4 lines, so that the data received by the output result buffer area after the result processing is finished are arranged in sequence. The result processing modules are all in multi-stage pipeline operation, the result processing module is provided with three modules, firstly, the standardized processing module is utilized, standardized parameters corresponding to output channels are written into a standardized parameter cache region before output is started, the standardized parameters are aligned with result data in the output convolution result processing process and taken out, the calculation result of each convolution kernel corresponds to a pair of parameters a and b, multiplication and addition operation is carried out on a standardized submodule and a convolution result x, namely, y is output as ax + b, different modes are directly distinguished through values of a and b, then a calculation formula is deduced by utilizing a convolution neural network model of a batch standardized method, and further standardized operation is completed. Then, a pooling module adopts 2 multiplied by 2 maximal pooling, and finally, the operation of the activation function module is carried out, a ReLU function is adopted, the principle is followed, the input is negative number, the input is set to zero, and if the input is positive number, the input is reserved, so that the consumption of hardware resources is less; when the convolution result data is all read out, a result preparation completion signal is returned to the off-chip processor, and then the result data is read out from the result output buffer FIFO by the processor through the AXI4 interface.
The convolution calculation module executes the process shown in FIG. 2;
the parallel multiply-add structure of the assembly line adopted by the convolution calculation module makes full use of the high parallel lines of the FPGA. Full parallel processing is achieved for all multiply-add operations within a single nxn convolution window, completing nxn multiply-add operations within 1 clock cycle. Taking a 3 × 3 convolution kernel as an example, the module is designed as shown in fig. 3, and the module can complete 9 times of multiplication operations in each clock cycle, and then can output a final value through a 2-stage parallel adder, where the final value is a pixel on the output feature map. Another key point in the design is to add 3-Line buffers based on the shift register design into the module. Since the convolution layer operation is a convolution window sliding process, for each input feature map, the input pixels except the boundary need to be calculated twice. Therefore, Line Buffer is introduced to serve as an input Buffer, the design can simulate the sliding process of a convolution window, each input pixel is multiplexed to the maximum extent, the repeated reading times of the input characteristic diagram in the RAM are greatly reduced, and partial addressing logic is omitted, so that the purposes of saving power consumption and hardware resources are achieved. The input data, i.e. the data to be calculated, enters the operation module in a data flow mode, and after a pre-filling stage of a plurality of cycles, the whole assembly line can start to continuously output the calculation result. By instantiating a plurality of pipelines, the operation parallelism in the convolution layer can be improved, and the purpose of accelerating convolution operation is achieved.
The structure of the convolution result buffer area is shown in FIG. 3;
for the convolution result buffer structure, a double-BRAM (buffer allocation buffer) buffer structure is adopted, in order to ensure the stability of data processing, the common data buffer operation comprises two parts of data loading and data processing, and for a certain on-chip memory, the data in the certain on-chip memory can be processed only after the data loading is finished, so that the pipeline interruption caused by packet loss in off-chip data transmission is avoided. For a single on-chip RAM, when the cache is in the load state, no valid data can be provided, which may result in reduced computational efficiency. To solve this problem, the present invention proposes a data cache structure of "dual BRAM" as shown in fig. 3. When the data cache is in the state 1, the BRAM1 carries out data loading, and the BRAM2 is used for carrying out data processing; in the state 2, the two BRAMs exchange the working state. In this way, when the BRAM1 starts to work for data loading, the BRAM2 does not work, and only after the BRAM1 finishes data loading, the BRAM1 and the BRAM respectively switch the working states and keep the data processing and data loading tasks. The data cache can ensure that the data processing unit is always in a working state in the whole processing process, and the data loading and the data processing are completely operated in parallel, so that the computing capacity of the FPGA is utilized to the maximum extent.
The maximum pooling structure in the data processing module is shown in FIG. 4;
in order to work together with the pipeline multiply-add module, a pipeline pooling module matched with the pipeline multiply-add module is additionally designed, and a 2 x 2 maximum value pooling module is adopted. The pooling module also follows a Line Buffer structure similar to the convolution module. By means of two cascaded pooling logics and a register inserted in the middle, the final value of a 2 x 2 pooling window can be guaranteed to be output every two clock cycles.

Claims (1)

1. A software and hardware cooperative acceleration method based on an FPGA is characterized by comprising the following specific steps:
compressing the parameters of the deep learning model network in a data quantization mode;
firstly, analyzing the value range of network parameters, and determining bit width data required by fixed point data;
recording original floating point data, selecting different precisions to calculate to obtain quantized data corresponding to the different precisions, comparing the obtained quantized data with the original floating point data, counting errors between the quantized data and the original floating point data, and selecting the precision adopted by the quantized data with the minimum error as an alternative precision, wherein the specific formula is as follows:
2w-p-1>max(|Dmax|×2p,|Dmin|×2p) (1)
in the formula, p is quantization precision; w is the quantization bit width; d represents floating point data before quantization;
the obtained alternative precision and bit width data are adopted, the quantization data corresponding to the alternative precision replace original floating point data, a group of quantization data with lower precision reduction compared with the original network precision is selected as fixed point data adopted by hardware through testing, and finally the obtained fixed point data is stored in an off-chip storage device to complete compression of the deep learning model network parameters;
step (2), designing a neural network accelerator;
a neural network accelerator packet;
the AXI4 bus interface is a high-performance bus interface facing address mapping based on an AXI bus protocol, is a universal bus interface, and an accelerator can be mounted on any bus equipment using an AXI4 protocol to work through the AXI4 bus interface; according to an AXI bus protocol, adopting a variable and ready handshake signal principle by an accelerator and a PS (processing system) of an FPGA (field programmable gate array), and ensuring the correctness of data and command transmission;
the data cache module comprises a data cache region to be calculated, a convolution result cache region and an output result cache region; the convolution result buffer area adopts a double BRAM buffer structure, and the output result buffer area adopts a first-in first-out queue (FIFO) execution method;
the convolution calculation module is connected with the data cache region to be calculated and the convolution result cache region and is a main body of the convolution neural network; the main calculation of the convolution neural network is completed in a convolution operation channel, and because the calculation amount is large, a pipeline parallel multiply-add structure is adopted, all multiply-add operations of a single N x N convolution part are processed in a full parallel mode, and N x N times of multiply-add operations can be completed in one clock period; meanwhile, Line Buffer is introduced as input Buffer, so that the sliding process of a convolution window can be simulated, each input pixel is multiplexed to the maximum extent, the repeated reading times of input feature maps in the RAM are greatly reduced, and partial addressing logic is omitted, so that the aims of saving power consumption and hardware resources are fulfilled;
the data processing module is positioned between the convolution result cache region and the output result cache region, and is responsible for processing convolution result data and transmitting an obtained output result to the output result cache region; the data processing module comprises a standardization module, an activation function module and a pooling unit module, the convolution result is sequentially processed by the standardization module, the activation function module and the pooling unit module in a pipeline mode, wherein the standardization module adopts a standardization coefficient multiply-add operation, the activation function module adopts a ReLU function operation, and the pooling unit module adopts a maximum pooling logic;
and (3) the software and the hardware work cooperatively, the quantified fixed point data is used for reference of the neural network accelerator, and the operation process is as follows:
the off-chip processor transmits fixed point data in the off-chip memory DDR to the neural network accelerator through an AXI bus, an AXI4 bus protocol is adopted between the off-chip processor and the neural network accelerator, in order to guarantee correct transmission of the data, a real and ready handshake signal is adopted, when the real and ready are simultaneously effective, the data enters a data cache area to be calculated and then is transmitted to the convolution calculation module in a data stream mode, the convolution calculation module generates convolution result data to the convolution result cache area, and the convolution result cache area transmits the convolution result data to the data processing module for processing; when all the data processed by the data processing module is read out, a result preparation completion signal is returned to the off-chip processor, and then the off-chip processor reads out the result data from the output result buffer area through the AXI4 interface.
CN201911350336.XA 2019-12-24 2019-12-24 Software and hardware cooperative acceleration method based on FPGA Pending CN111178518A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911350336.XA CN111178518A (en) 2019-12-24 2019-12-24 Software and hardware cooperative acceleration method based on FPGA

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911350336.XA CN111178518A (en) 2019-12-24 2019-12-24 Software and hardware cooperative acceleration method based on FPGA

Publications (1)

Publication Number Publication Date
CN111178518A true CN111178518A (en) 2020-05-19

Family

ID=70646347

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911350336.XA Pending CN111178518A (en) 2019-12-24 2019-12-24 Software and hardware cooperative acceleration method based on FPGA

Country Status (1)

Country Link
CN (1) CN111178518A (en)

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111814972A (en) * 2020-07-08 2020-10-23 上海雪湖科技有限公司 Neural network convolution operation acceleration method based on FPGA
CN111882051A (en) * 2020-07-29 2020-11-03 复旦大学 Global broadcast data input circuit for neural network processing
CN112001492A (en) * 2020-08-07 2020-11-27 中山大学 Mixed flow type acceleration framework and acceleration method for binary weight Densenet model
CN112003792A (en) * 2020-07-23 2020-11-27 烽火通信科技股份有限公司 Software and hardware cooperative message acceleration method and device
CN112329545A (en) * 2020-10-13 2021-02-05 江苏大学 ZCU104 platform-based convolutional neural network implementation and processing method for application of convolutional neural network implementation in fruit identification
CN112508184A (en) * 2020-12-16 2021-03-16 重庆邮电大学 Design method of fast image recognition accelerator based on convolutional neural network
CN112734011A (en) * 2021-01-04 2021-04-30 北京大学 Deep neural network accelerator collaborative design method based on incremental synthesis
CN112766478A (en) * 2021-01-21 2021-05-07 中国电子科技集团公司信息科学研究院 FPGA pipeline structure for convolutional neural network
CN112862080A (en) * 2021-03-10 2021-05-28 中山大学 Hardware calculation method of attention mechanism of EfficientNet
CN113033087A (en) * 2021-03-17 2021-06-25 电子科技大学 High-speed data transmission method for optical neural network based on FPGA
CN113094118A (en) * 2021-04-26 2021-07-09 深圳思谋信息科技有限公司 Data processing system, method, apparatus, computer device and storage medium
CN113238988A (en) * 2021-06-08 2021-08-10 中科寒武纪科技股份有限公司 Processing system, integrated circuit and board card for optimizing parameters of deep neural network
CN113238987A (en) * 2021-06-08 2021-08-10 中科寒武纪科技股份有限公司 Statistic quantizer, storage device, processing device and board card for quantized data
CN113362292A (en) * 2021-05-27 2021-09-07 重庆邮电大学 Bone age assessment method and system based on programmable logic gate array
CN113392963A (en) * 2021-05-08 2021-09-14 北京化工大学 CNN hardware acceleration system design method based on FPGA
CN113792621A (en) * 2021-08-27 2021-12-14 杭州电子科技大学 Target detection accelerator design method based on FPGA
CN113902099A (en) * 2021-10-08 2022-01-07 电子科技大学 Neural network design and optimization method based on software and hardware joint learning
CN114911628A (en) * 2022-06-15 2022-08-16 福州大学 MobileNet hardware acceleration system based on FPGA
CN115130672A (en) * 2022-06-08 2022-09-30 武汉大学 Method and device for calculating convolution neural network by software and hardware collaborative optimization
CN115658323A (en) * 2022-11-15 2023-01-31 国网上海能源互联网研究院有限公司 FPGA load flow calculation acceleration architecture and method based on software and hardware cooperation
US11775720B2 (en) 2021-07-02 2023-10-03 International Business Machines Corporation Integrated circuit development using machine learning-based prediction of power, performance, and area

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180046894A1 (en) * 2016-08-12 2018-02-15 DeePhi Technology Co., Ltd. Method for optimizing an artificial neural network (ann)
CN108280514A (en) * 2018-01-05 2018-07-13 中国科学技术大学 Sparse neural network acceleration system based on FPGA and design method
CN109871949A (en) * 2017-12-22 2019-06-11 泓图睿语(北京)科技有限公司 Convolutional neural networks accelerator and accelerated method
US20190190538A1 (en) * 2017-12-18 2019-06-20 Facebook, Inc. Accelerator hardware for compression and decompression
CN109934339A (en) * 2019-03-06 2019-06-25 东南大学 A kind of general convolutional neural networks accelerator based on a dimension systolic array
WO2019137060A1 (en) * 2018-01-15 2019-07-18 合肥工业大学 Convolutional neural network hardware accelerator based on multicast network-on-chip, and operation mode thereof

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180046894A1 (en) * 2016-08-12 2018-02-15 DeePhi Technology Co., Ltd. Method for optimizing an artificial neural network (ann)
US20190190538A1 (en) * 2017-12-18 2019-06-20 Facebook, Inc. Accelerator hardware for compression and decompression
CN109871949A (en) * 2017-12-22 2019-06-11 泓图睿语(北京)科技有限公司 Convolutional neural networks accelerator and accelerated method
CN108280514A (en) * 2018-01-05 2018-07-13 中国科学技术大学 Sparse neural network acceleration system based on FPGA and design method
WO2019137060A1 (en) * 2018-01-15 2019-07-18 合肥工业大学 Convolutional neural network hardware accelerator based on multicast network-on-chip, and operation mode thereof
CN109934339A (en) * 2019-03-06 2019-06-25 东南大学 A kind of general convolutional neural networks accelerator based on a dimension systolic array

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LEANDRO D. MEDUS等: "A Novel Systolic Parallel Hardware Architecture for the FPGA Acceleration of Feedforward Neural Networks" *
张榜 等: "一种基于FPGA的卷积神经网络加速器的设计与实现" *

Cited By (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111814972B (en) * 2020-07-08 2024-02-02 上海雪湖科技有限公司 Neural network convolution operation acceleration method based on FPGA
CN111814972A (en) * 2020-07-08 2020-10-23 上海雪湖科技有限公司 Neural network convolution operation acceleration method based on FPGA
CN112003792A (en) * 2020-07-23 2020-11-27 烽火通信科技股份有限公司 Software and hardware cooperative message acceleration method and device
CN112003792B (en) * 2020-07-23 2022-04-15 烽火通信科技股份有限公司 Software and hardware cooperative message acceleration method and device
CN111882051A (en) * 2020-07-29 2020-11-03 复旦大学 Global broadcast data input circuit for neural network processing
CN111882051B (en) * 2020-07-29 2022-05-20 复旦大学 Global broadcast data input circuit for neural network processing
CN112001492A (en) * 2020-08-07 2020-11-27 中山大学 Mixed flow type acceleration framework and acceleration method for binary weight Densenet model
CN112001492B (en) * 2020-08-07 2023-06-23 中山大学 Mixed running water type acceleration architecture and acceleration method for binary weight DenseNet model
CN112329545B (en) * 2020-10-13 2024-05-14 江苏大学 ZCU104 platform-based convolutional neural network implementation and processing method of application of same in fruit identification
CN112329545A (en) * 2020-10-13 2021-02-05 江苏大学 ZCU104 platform-based convolutional neural network implementation and processing method for application of convolutional neural network implementation in fruit identification
CN112508184A (en) * 2020-12-16 2021-03-16 重庆邮电大学 Design method of fast image recognition accelerator based on convolutional neural network
CN112508184B (en) * 2020-12-16 2022-04-29 重庆邮电大学 Design method of fast image recognition accelerator based on convolutional neural network
CN112734011A (en) * 2021-01-04 2021-04-30 北京大学 Deep neural network accelerator collaborative design method based on incremental synthesis
CN112766478A (en) * 2021-01-21 2021-05-07 中国电子科技集团公司信息科学研究院 FPGA pipeline structure for convolutional neural network
CN112766478B (en) * 2021-01-21 2024-04-12 中国电子科技集团公司信息科学研究院 FPGA (field programmable Gate array) pipeline structure oriented to convolutional neural network
CN112862080A (en) * 2021-03-10 2021-05-28 中山大学 Hardware calculation method of attention mechanism of EfficientNet
CN112862080B (en) * 2021-03-10 2023-08-15 中山大学 Hardware computing method of attention mechanism of Efficient Net
CN113033087B (en) * 2021-03-17 2022-06-07 电子科技大学 High-speed data transmission method for optical neural network based on FPGA
CN113033087A (en) * 2021-03-17 2021-06-25 电子科技大学 High-speed data transmission method for optical neural network based on FPGA
CN113094118A (en) * 2021-04-26 2021-07-09 深圳思谋信息科技有限公司 Data processing system, method, apparatus, computer device and storage medium
CN113392963A (en) * 2021-05-08 2021-09-14 北京化工大学 CNN hardware acceleration system design method based on FPGA
CN113392963B (en) * 2021-05-08 2023-12-19 北京化工大学 FPGA-based CNN hardware acceleration system design method
CN113362292A (en) * 2021-05-27 2021-09-07 重庆邮电大学 Bone age assessment method and system based on programmable logic gate array
CN113238987A (en) * 2021-06-08 2021-08-10 中科寒武纪科技股份有限公司 Statistic quantizer, storage device, processing device and board card for quantized data
CN113238988A (en) * 2021-06-08 2021-08-10 中科寒武纪科技股份有限公司 Processing system, integrated circuit and board card for optimizing parameters of deep neural network
US11775720B2 (en) 2021-07-02 2023-10-03 International Business Machines Corporation Integrated circuit development using machine learning-based prediction of power, performance, and area
CN113792621B (en) * 2021-08-27 2024-04-05 杭州电子科技大学 FPGA-based target detection accelerator design method
CN113792621A (en) * 2021-08-27 2021-12-14 杭州电子科技大学 Target detection accelerator design method based on FPGA
CN113902099A (en) * 2021-10-08 2022-01-07 电子科技大学 Neural network design and optimization method based on software and hardware joint learning
CN113902099B (en) * 2021-10-08 2023-06-02 电子科技大学 Neural network design and optimization method based on software and hardware joint learning
CN115130672B (en) * 2022-06-08 2024-03-08 武汉大学 Software and hardware collaborative optimization convolutional neural network calculation method and device
CN115130672A (en) * 2022-06-08 2022-09-30 武汉大学 Method and device for calculating convolution neural network by software and hardware collaborative optimization
CN114911628A (en) * 2022-06-15 2022-08-16 福州大学 MobileNet hardware acceleration system based on FPGA
CN115658323A (en) * 2022-11-15 2023-01-31 国网上海能源互联网研究院有限公司 FPGA load flow calculation acceleration architecture and method based on software and hardware cooperation

Similar Documents

Publication Publication Date Title
CN111178518A (en) Software and hardware cooperative acceleration method based on FPGA
CN111459877B (en) Winograd YOLOv2 target detection model method based on FPGA acceleration
US20220012593A1 (en) Neural network accelerator and neural network acceleration method based on structured pruning and low-bit quantization
CN109934339B (en) General convolutional neural network accelerator based on one-dimensional pulse array
CN110390385B (en) BNRP-based configurable parallel general convolutional neural network accelerator
CN108280514B (en) FPGA-based sparse neural network acceleration system and design method
CN110516801B (en) High-throughput-rate dynamic reconfigurable convolutional neural network accelerator
CN111242289B (en) Convolutional neural network acceleration system and method with expandable scale
CN107480789B (en) Efficient conversion method and device of deep learning model
CN109447241B (en) Dynamic reconfigurable convolutional neural network accelerator architecture for field of Internet of things
CN111967468A (en) FPGA-based lightweight target detection neural network implementation method
US11763156B2 (en) Neural network compression based on bank-balanced sparsity
CN109146067B (en) Policy convolution neural network accelerator based on FPGA
CN108764466A (en) Convolutional neural networks hardware based on field programmable gate array and its accelerated method
CN113051216B (en) MobileNet-SSD target detection device and method based on FPGA acceleration
CN109284824B (en) Reconfigurable technology-based device for accelerating convolution and pooling operation
CN113792621B (en) FPGA-based target detection accelerator design method
CN113392973B (en) AI chip neural network acceleration method based on FPGA
CN112950656A (en) Block convolution method for pre-reading data according to channel based on FPGA platform
Shahshahani et al. Memory optimization techniques for fpga based cnn implementations
CN109472734B (en) Target detection network based on FPGA and implementation method thereof
CN109740619B (en) Neural network terminal operation method and device for target recognition
Zong-ling et al. The design of lightweight and multi parallel CNN accelerator based on FPGA
CN116822600A (en) Neural network search chip based on RISC-V architecture
CN116011534A (en) FPGA-based general convolutional neural network accelerator implementation method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Yan Chenggang

Inventor after: Li Yang

Inventor after: Liu Bingtao

Inventor after: Shi Zhiguo

Inventor after: Sun Yaoqi

Inventor after: Zhang Jiyong

Inventor after: Zhang Yongdong

Inventor after: Shen Tao

Inventor before: Yan Chenggang

Inventor before: Li Yang

Inventor before: Liu Bingtao

Inventor before: Sun Yaoqi

Inventor before: Zhang Jiyong

Inventor before: Zhang Yongdong

Inventor before: Shen Tao