CN111178518A

CN111178518A - Software and hardware cooperative acceleration method based on FPGA

Info

Publication number: CN111178518A
Application number: CN201911350336.XA
Authority: CN
Inventors: 颜成钢; 李扬; 刘炳涛; 孙垚棋; 张继勇; 张勇东; 沈韬
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2019-12-24
Filing date: 2019-12-24
Publication date: 2020-05-19

Abstract

The invention provides a software and hardware cooperative acceleration method based on an FPGA (field programmable gate array). The method comprises the steps of firstly compressing the network parameters of the deep learning model in a data quantization mode, and then inputting fixed point data obtained through quantization into a neural network accelerator for processing, wherein the neural network accelerator comprises an AXI4 bus interface, a convolution calculation module, a data cache module and a data processing module; the invention utilizes the software part to compress the neural network model, the hardware part is designed into a specific hardware architecture (neural network accelerator), the calculation amount is reduced, the high parallelism is utilized to effectively accelerate, and the memory access times are reduced to reduce the hardware energy consumption. The invention reduces redundant useless calculation and reading of parameter data by using the running information and the algorithm structure during convolution calculation, accelerates the reference of the neural network by using the FPGA hardware platform, can improve the real-time performance of the DCNN, realizes higher calculation performance and reduces energy consumption.

Description

Software and hardware cooperative acceleration method based on FPGA

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a software and hardware cooperation acceleration method based on an FPGA

Background

The neural network is an artificial intelligence machine learning technology, especially the deep convolution neural network has received wide attention, and the deep convolution neural network has achieved some remarkable achievements in the fields of speech recognition, natural language processing and intelligent image processing, especially image recognition. However, the computation amount of the commonly used network model reaches 10 hundred million orders, and the parameter amount reaches hundreds of million orders, so that no matter the convolutional neural network is trained or identified, a high-performance GPU, a large-capacity storage device or a high-power server cluster is generally required to provide computation and storage support for the convolutional neural network. With the popularization of intelligent devices, embedded devices have higher and higher requirements for rapid and accurate image recognition. For embedded devices with tight resources and sensitive power consumption, the neural network provides a severe requirement for the implementation of the convolutional neural network due to huge calculation amount and parameter amount. Therefore, with its powerful parallel capability, flexible design method and high performance power consumption ratio, FPGA becomes one of the most attractive implementation platforms for implementing hardware acceleration of convolutional neural network in embedded device. The invention provides a model compression and hardware acceleration method.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a software and hardware cooperative acceleration method based on an FPGA (field programmable gate array). A software part is used for compressing a neural network model, a hardware part is a specific hardware architecture (a neural network accelerator) designed, the calculation amount is reduced, the high parallelism is used for carrying out effective acceleration, and the memory access times are reduced to reduce the hardware energy consumption.

A software and hardware cooperative acceleration method based on an FPGA comprises the following specific steps:

and (1) compressing the parameters of the deep learning model network in a data quantization mode.

Firstly, analyzing the value range of network parameters, and determining bit width data required by fixed point data.

Recording original floating point data, selecting different precisions to calculate to obtain quantized data corresponding to the different precisions, comparing the obtained quantized data with the original floating point data, counting errors between the quantized data and the original floating point data, and selecting the precision adopted by the quantized data with the minimum error as an alternative precision, wherein the specific formula is as follows:

2^w-p-1＞max(|D_max|×2^p,|D_min|×2^p) (1)

in the formula, p is quantization precision; w is the quantization bit width; d represents floating point data before quantization.

And replacing the original floating point data with the quantized data corresponding to the alternative precision by adopting the obtained alternative precision and bit width data, selecting a group of quantized data with lower precision compared with the original network precision as fixed point data adopted by hardware through testing, and finally storing the obtained fixed point data in an off-chip storage device to finish compression of the deep learning model network parameters.

Step (2), designing a neural network accelerator;

the neural network accelerator comprises an AXI4 bus interface, a convolution calculation module, a data cache module and a data processing module.

The AXI4 bus interface is a high-performance bus interface facing address mapping based on an AXI bus protocol, is a universal bus interface, and an accelerator can be mounted on any bus equipment using an AXI4 protocol to work through the AXI4 bus interface; according to the AXI bus protocol, the PS (processing system) of the accelerator and the FPGA adopts the principle of a real and ready handshake signal, so that the correctness of data and command transmission is ensured.

The data cache module comprises a data cache region to be calculated, a convolution result cache region and an output result cache region; the convolution result buffer area adopts a double BRAM buffer structure, and the output result buffer area adopts a first-in first-out queue (FIFO) execution method;

the convolution calculation module is connected with the data cache region to be calculated and the convolution result cache region and is a main body of the convolution neural network; the main calculation of the convolution neural network is completed in a convolution operation channel, and because the calculation amount is large, a pipeline parallel multiply-add structure is adopted, all multiply-add operations of a single N x N convolution part are processed in a full parallel mode, and N x N times of multiply-add operations can be completed in one clock period; meanwhile, Line Buffer is introduced as input Buffer, so that the sliding process of a convolution window can be simulated, each input pixel is multiplexed to the maximum extent, the repeated reading times of the input characteristic diagram in the RAM are greatly reduced, and partial addressing logic is omitted, so that the aims of saving power consumption and hardware resources are fulfilled.

The data processing module is positioned between the convolution result buffer area and the output result buffer area and is responsible for processing convolution result data and transmitting the obtained output result to the output result buffer area. The data processing module comprises a standardization module, an activation function module and a pooling unit module, the convolution result is sequentially processed by the standardization module, the activation function module and the pooling unit module in a pipeline mode, the standardization module adopts a standardization coefficient multiply-add operation, the activation function module adopts a ReLU function operation, and the pooling unit module adopts a maximum pooling logic.

And (3) the software and the hardware work cooperatively, the quantified fixed point data is used for reference of the neural network accelerator, and the operation process is as follows:

the off-chip processor transmits fixed point data in the off-chip memory DDR to the neural network accelerator through an AXI bus, an AXI4 bus protocol is adopted between the off-chip processor and the neural network accelerator, in order to guarantee correct transmission of the data, a real and ready handshake signal is adopted, when the real and ready are simultaneously effective, the data enters a data cache area to be calculated and then is transmitted to the convolution calculation module in a data stream mode, the convolution calculation module generates convolution result data to the convolution result cache area, and the convolution result cache area transmits the convolution result data to the data processing module for processing. When all the data processed by the data processing module is read out, a result preparation completion signal is returned to the off-chip processor, and then the off-chip processor reads out the result data from the output result buffer area through the AXI4 interface.

The invention has the following beneficial effects:

the invention reduces redundant useless calculation and reading of parameter data by using the running information and the algorithm structure during convolution calculation, accelerates the reference of the neural network by using the FPGA hardware platform, can improve the real-time performance of the DCNN, realizes higher calculation performance and reduces energy consumption.

Drawings

FIG. 1 is a schematic diagram of the accelerator assembly of the present invention;

FIG. 2 is a schematic diagram of the convolution calculation module of the present invention;

FIG. 3 is a diagram illustrating the structure of a convolution result buffer according to the present invention;

FIG. 4 is a schematic diagram of a maximum pooling architecture in the data processing module of the present invention.

Detailed Description

The process of the present invention is further illustrated below with reference to the figures and examples.

The method comprises the following specific steps:

2^w-p-1＞max(|D_max|×2^p,|D_min|×2^p) (1)

Step (2), designing a neural network accelerator;

the off-chip processor (PS part of FPGA) transmits fixed point data in the off-chip memory DDR to the neural network accelerator (PL design part of FPGA) through an AXI bus, an AXI4 bus protocol is adopted between the off-chip processor and the neural network accelerator, in order to ensure the correct transmission of data, a real and ready handshake signal is adopted, when the real and ready are simultaneously effective, the data enters a data cache area to be calculated and then is transmitted to the convolution calculation module in a data stream mode, the convolution calculation module generates convolution result data to the convolution result cache area, and the convolution result cache area transmits the convolution result data to the data processing module for processing. When all the data processed by the data processing module is read out, a result preparation completion signal is returned to the off-chip processor, and then the off-chip processor reads out the result data from the output result buffer area through the AXI4 interface.

The specific working method of the accelerator is shown in fig. 1:

an off-chip processor (PS part of FPGA) transmits fixed point data in an off-chip memory DDR to a neural network accelerator (PL design part of FPGA) through an AXI bus, an AXI4 bus protocol is adopted between the off-chip processor and the neural network accelerator, in order to ensure the correct transmission of data, a real and ready handshake signal is adopted, when the real and ready are simultaneously effective, the data enters a data buffer area to be calculated and then is sent to a convolution calculation module in a data flow mode, the next batch of feature map data is sequentially written into all feature map cache units of each group when the data update mode is complete update, and each group only updates S feature map cache units of reference convolution kernel step length in turn when the data update mode is partial update; and storing the weight values in each group of convolution kernel storage units according to the channel respectively according to the numerical information of the number KH of convolution kernel rows, the number KL of convolution kernel columns, the step length S of convolution kernel and the number KC of convolution kernels. Firstly, a convolution calculation module starts to execute a convolution calculation task, the convolution calculation module adopts a pipeline parallel multiply-add structure, realizes full parallel processing on all multiply-add operations in a single NxN convolution window, finishes the NxN times of multiply-add operations in 1 clock cycle, simultaneously introduces a shift register operation, automatically fills zero according to feature map filling number PAD when the initial position and the end position of a stored feature map row are taken out for a single feature map storage unit, finishes a batch of convolution calculation each time of shift calculation, and then generates a feature map data address of next convolution calculation through convolution kernel column number KL and convolution kernel step length S. The data is obtained through the convolution calculation module, the data can enter the data cache module to be calculated, and a double BRAM cache structure is adopted in a result cache region. The pooling type is 2 multiplied by 2 maximum pooling, and the data are sent according to a Z-shaped sequence, wherein the 1 st to 2 nd lines are sent from top to bottom and from left to right, and then the data are sent according to 3 to 4 lines, so that the data received by the output result buffer area after the result processing is finished are arranged in sequence. The result processing modules are all in multi-stage pipeline operation, the result processing module is provided with three modules, firstly, the standardized processing module is utilized, standardized parameters corresponding to output channels are written into a standardized parameter cache region before output is started, the standardized parameters are aligned with result data in the output convolution result processing process and taken out, the calculation result of each convolution kernel corresponds to a pair of parameters a and b, multiplication and addition operation is carried out on a standardized submodule and a convolution result x, namely, y is output as ax + b, different modes are directly distinguished through values of a and b, then a calculation formula is deduced by utilizing a convolution neural network model of a batch standardized method, and further standardized operation is completed. Then, a pooling module adopts 2 multiplied by 2 maximal pooling, and finally, the operation of the activation function module is carried out, a ReLU function is adopted, the principle is followed, the input is negative number, the input is set to zero, and if the input is positive number, the input is reserved, so that the consumption of hardware resources is less; when the convolution result data is all read out, a result preparation completion signal is returned to the off-chip processor, and then the result data is read out from the result output buffer FIFO by the processor through the AXI4 interface.

The convolution calculation module executes the process shown in FIG. 2;

the parallel multiply-add structure of the assembly line adopted by the convolution calculation module makes full use of the high parallel lines of the FPGA. Full parallel processing is achieved for all multiply-add operations within a single nxn convolution window, completing nxn multiply-add operations within 1 clock cycle. Taking a 3 × 3 convolution kernel as an example, the module is designed as shown in fig. 3, and the module can complete 9 times of multiplication operations in each clock cycle, and then can output a final value through a 2-stage parallel adder, where the final value is a pixel on the output feature map. Another key point in the design is to add 3-Line buffers based on the shift register design into the module. Since the convolution layer operation is a convolution window sliding process, for each input feature map, the input pixels except the boundary need to be calculated twice. Therefore, Line Buffer is introduced to serve as an input Buffer, the design can simulate the sliding process of a convolution window, each input pixel is multiplexed to the maximum extent, the repeated reading times of the input characteristic diagram in the RAM are greatly reduced, and partial addressing logic is omitted, so that the purposes of saving power consumption and hardware resources are achieved. The input data, i.e. the data to be calculated, enters the operation module in a data flow mode, and after a pre-filling stage of a plurality of cycles, the whole assembly line can start to continuously output the calculation result. By instantiating a plurality of pipelines, the operation parallelism in the convolution layer can be improved, and the purpose of accelerating convolution operation is achieved.

The structure of the convolution result buffer area is shown in FIG. 3;

for the convolution result buffer structure, a double-BRAM (buffer allocation buffer) buffer structure is adopted, in order to ensure the stability of data processing, the common data buffer operation comprises two parts of data loading and data processing, and for a certain on-chip memory, the data in the certain on-chip memory can be processed only after the data loading is finished, so that the pipeline interruption caused by packet loss in off-chip data transmission is avoided. For a single on-chip RAM, when the cache is in the load state, no valid data can be provided, which may result in reduced computational efficiency. To solve this problem, the present invention proposes a data cache structure of "dual BRAM" as shown in fig. 3. When the data cache is in the state 1, the BRAM1 carries out data loading, and the BRAM2 is used for carrying out data processing; in the state 2, the two BRAMs exchange the working state. In this way, when the BRAM1 starts to work for data loading, the BRAM2 does not work, and only after the BRAM1 finishes data loading, the BRAM1 and the BRAM respectively switch the working states and keep the data processing and data loading tasks. The data cache can ensure that the data processing unit is always in a working state in the whole processing process, and the data loading and the data processing are completely operated in parallel, so that the computing capacity of the FPGA is utilized to the maximum extent.

The maximum pooling structure in the data processing module is shown in FIG. 4;

in order to work together with the pipeline multiply-add module, a pipeline pooling module matched with the pipeline multiply-add module is additionally designed, and a 2 x 2 maximum value pooling module is adopted. The pooling module also follows a Line Buffer structure similar to the convolution module. By means of two cascaded pooling logics and a register inserted in the middle, the final value of a 2 x 2 pooling window can be guaranteed to be output every two clock cycles.

Claims

1. A software and hardware cooperative acceleration method based on an FPGA is characterized by comprising the following specific steps:

compressing the parameters of the deep learning model network in a data quantization mode;

firstly, analyzing the value range of network parameters, and determining bit width data required by fixed point data;

2^w-p-1＞max(|D_max|×2^p,|D_min|×2^p) (1)

in the formula, p is quantization precision; w is the quantization bit width; d represents floating point data before quantization;

the obtained alternative precision and bit width data are adopted, the quantization data corresponding to the alternative precision replace original floating point data, a group of quantization data with lower precision reduction compared with the original network precision is selected as fixed point data adopted by hardware through testing, and finally the obtained fixed point data is stored in an off-chip storage device to complete compression of the deep learning model network parameters;

step (2), designing a neural network accelerator;

a neural network accelerator packet;

the AXI4 bus interface is a high-performance bus interface facing address mapping based on an AXI bus protocol, is a universal bus interface, and an accelerator can be mounted on any bus equipment using an AXI4 protocol to work through the AXI4 bus interface; according to an AXI bus protocol, adopting a variable and ready handshake signal principle by an accelerator and a PS (processing system) of an FPGA (field programmable gate array), and ensuring the correctness of data and command transmission;

the convolution calculation module is connected with the data cache region to be calculated and the convolution result cache region and is a main body of the convolution neural network; the main calculation of the convolution neural network is completed in a convolution operation channel, and because the calculation amount is large, a pipeline parallel multiply-add structure is adopted, all multiply-add operations of a single N x N convolution part are processed in a full parallel mode, and N x N times of multiply-add operations can be completed in one clock period; meanwhile, Line Buffer is introduced as input Buffer, so that the sliding process of a convolution window can be simulated, each input pixel is multiplexed to the maximum extent, the repeated reading times of input feature maps in the RAM are greatly reduced, and partial addressing logic is omitted, so that the aims of saving power consumption and hardware resources are fulfilled;

the data processing module is positioned between the convolution result cache region and the output result cache region, and is responsible for processing convolution result data and transmitting an obtained output result to the output result cache region; the data processing module comprises a standardization module, an activation function module and a pooling unit module, the convolution result is sequentially processed by the standardization module, the activation function module and the pooling unit module in a pipeline mode, wherein the standardization module adopts a standardization coefficient multiply-add operation, the activation function module adopts a ReLU function operation, and the pooling unit module adopts a maximum pooling logic;

the off-chip processor transmits fixed point data in the off-chip memory DDR to the neural network accelerator through an AXI bus, an AXI4 bus protocol is adopted between the off-chip processor and the neural network accelerator, in order to guarantee correct transmission of the data, a real and ready handshake signal is adopted, when the real and ready are simultaneously effective, the data enters a data cache area to be calculated and then is transmitted to the convolution calculation module in a data stream mode, the convolution calculation module generates convolution result data to the convolution result cache area, and the convolution result cache area transmits the convolution result data to the data processing module for processing; when all the data processed by the data processing module is read out, a result preparation completion signal is returned to the off-chip processor, and then the off-chip processor reads out the result data from the output result buffer area through the AXI4 interface.